What is Apache Kafka?
Linkedin 2011
- Extremely scalable
- Message bus
- For example, brad pit making instagram posts
- Good for distributed event streaming
- Good for event based systems
- Producers -> apps that send messages to various partitions for a particular topic
- Topics -> split into partitions
- Why not just let producers send msgs immediately to consumers? B/c not scalable
- Guarantee is whenever producer sends msg to partition it will be ordered within partition.
- Partition and topics lie within a broker which is basically a kafka server
- Consumers can pull on-demand, at their own pace
- Producers produce messages using clients
- You can have multiple producers, servers, consumers which requires a lot of bandwidth
- Keep replicas of each partition
-
Motivating Example
- World Cup
- Producer will put all events on the queue like sub Brazil, goal by Neymar
- Consumers read from the queue
- Problem #1 Too many events on the queue and the server hosting queue running out of space
- Solution #1 Scale horizontally (more servers), issue is events then are not read in order
- Problem #2 Now events are not processed in the correct order
- Solution #2 Partition by game now all games are separated
- We need to distribution/partition by user
- Problem #3 Consumer can’t keep up with the rate of new events
- Solution #3 Consumer groups -> each event processed by 1 consumer in the group
- Introduce topics where each event associated with topic and each consumer subscribed to a topic
- World Cup
-
Overview
- Broker: The servers (physical or virtual) that hold the queue
- Partition: The ‘queue’. An ordered immutable sequence of messages that we append to like a log file. Each broker can multiple partitions. Exists physically. Way of scaling data
- Topic: A logical grouping of partitions. You publish and consume from topics in kafka. Exists in code.
- Producer: write messages/records to topics
- Consumer: read messages/records off of topics
- Look under the hood
- Producer create and publishes a message/record structure
- Message will heave headers, key, value, timestamp
- You can add a record via CLI
- Kafka then assigns message to correct topic, broker, partition
- Message then appended to partition via append only log file
- Consumers read next message based on offset
- Each partition has a leader and replicated
-
When to use it
- Anytime you need a message queue
- Processing can be done async i.e. Youtube transcoding
- in-order message processing i.e. Ticketmaster waiting queue
- Decouple producer and consumer so they can scale independently i.e. Leetcode or Online Judge
- Anytime you need a stream
- Need to process a lot of data in real time ie. Ad click aggregator
- Stream of messages need to be processed by multiple consumers simultaneously i.e. Messenger or FB live comments
- Anytime you need a message queue
-
What you should know for deep dives
- Scalability
- Constraints:
- Aim for < 1MB per message
- One broker up to 1 TB data & 10k messages per second
- How to scale?
- More brokers
- Choose a good partition key
- How to handle a hot partition?
- Remove the key (would work if order does not matter)
- Compound the key - AdId:1-10 or AdId:UserId
- Backpressure - slow down the producer
- Constraints:
- Fault tolerance & Durability
- Relevant settings: - acks(acks=all) - replication factor (3 is default)
- What happens when a consumer goes down (kafka will not go down)?
- Just read the latest offset
- If a consumer group, rebalance
- Errors & retries
- Performance Optimizations
- Batch messages in producer
- Compress messages in producer
- Partition key is where you should start
- Retention Policies
- Two settings:
- Retention.ms (default 7 days) - how long to keep message
- Retention.bytes (default 1GB) when to start purging based on size
- Two settings:
- Scalability