Kafka

What is Apache Kafka?

Linkedin 2011

Extremely scalable
Message bus
For example, brad pit making instagram posts
Good for distributed event streaming
Good for event based systems
Producers -> apps that send messages to various partitions for a particular topic
Topics -> split into partitions
Why not just let producers send msgs immediately to consumers? B/c not scalable
Guarantee is whenever producer sends msg to partition it will be ordered within partition.
Partition and topics lie within a broker which is basically a kafka server
Consumers can pull on-demand, at their own pace
Producers produce messages using clients
You can have multiple producers, servers, consumers which requires a lot of bandwidth
Keep replicas of each partition

Motivating Example
- World Cup
  - Producer will put all events on the queue like sub Brazil, goal by Neymar
  - Consumers read from the queue
  - Problem #1 Too many events on the queue and the server hosting queue running out of space
  - Solution #1 Scale horizontally (more servers), issue is events then are not read in order
  - Problem #2 Now events are not processed in the correct order
  - Solution #2 Partition by game now all games are separated
  - We need to distribution/partition by user
  - Problem #3 Consumer can’t keep up with the rate of new events
  - Solution #3 Consumer groups -> each event processed by 1 consumer in the group
  - Introduce topics where each event associated with topic and each consumer subscribed to a topic
Overview
- Broker: The servers (physical or virtual) that hold the queue
- Partition: The ‘queue’. An ordered immutable sequence of messages that we append to like a log file. Each broker can multiple partitions. Exists physically. Way of scaling data
- Topic: A logical grouping of partitions. You publish and consume from topics in kafka. Exists in code.
- Producer: write messages/records to topics
- Consumer: read messages/records off of topics
- Look under the hood
  - Producer create and publishes a message/record structure
  - Message will heave headers, key, value, timestamp
  - You can add a record via CLI
  - Kafka then assigns message to correct topic, broker, partition
  - Message then appended to partition via append only log file
  - Consumers read next message based on offset
  - Each partition has a leader and replicated
When to use it
1. Anytime you need a message queue
  - Processing can be done async i.e. Youtube transcoding
  - in-order message processing i.e. Ticketmaster waiting queue
  - Decouple producer and consumer so they can scale independently i.e. Leetcode or Online Judge
2. Anytime you need a stream
  - Need to process a lot of data in real time ie. Ad click aggregator
  - Stream of messages need to be processed by multiple consumers simultaneously i.e. Messenger or FB live comments
What you should know for deep dives
1. Scalability
  - Constraints:
    - Aim for < 1MB per message
    - One broker up to 1 TB data & 10k messages per second
  - How to scale?
    - More brokers
    - Choose a good partition key
  - How to handle a hot partition?
    - Remove the key (would work if order does not matter)
    - Compound the key - AdId:1-10 or AdId:UserId
    - Backpressure - slow down the producer
2. Fault tolerance & Durability
  - Relevant settings: - acks(acks=all) - replication factor (3 is default)
  - What happens when a consumer goes down (kafka will not go down)?
    - Just read the latest offset
    - If a consumer group, rebalance
3. Errors & retries
4. Performance Optimizations
  1. Batch messages in producer
  2. Compress messages in producer
  3. Partition key is where you should start
5. Retention Policies
  1. Two settings:
    1. Retention.ms (default 7 days) - how long to keep message
    2. Retention.bytes (default 1GB) when to start purging based on size

Manav's Digital Garden

Recent Notes

Explorer

Kafka

Graph View

Backlinks