sakib.ninja sakib.ninja

Apache Kafka: A Practical Guide for Developers

Apache Kafka has become the backbone of real-time data infrastructure for companies like LinkedIn, Netflix, and Uber.
It is designed to handle high-throughput, fault-tolerant, distributed event streaming with near real-time guarantees.

In this guide, we’ll cover Kafka’s architecture, core concepts, schema management, consumer groups, and stream processing, all explained with practical insights to help developers build production-ready applications.

Why Kafka?

Unlike traditional message queues, Kafka is not just a broker, it’s a distributed event streaming platform.

Key advantages:

  • High throughput: Millions of messages per second at scale.
  • Durability: Data is persisted on disk and replicated across brokers.
  • Scalability: Horizontal scaling with partitions and brokers.
  • Flexibility: Use Kafka as a message bus, event log, or real-time analytics backbone.

Kafka Architecture

Core Principles

  • Leader-Follower Architecture: Producers write to the partition leader; followers replicate the data for fault tolerance.
  • Topic-Based Messaging: Topics are append-only logs storing ordered events.
  • Partitions for Scalability: Topics are split across brokers for parallelism.
  • Consumer Groups: Multiple consumers can share the workload by dividing partitions.
  • Durability with Replication: Default replication factor is 3; ensures high availability.
  • Delivery Semantics: Kafka supports at least once, at most once, and exactly once processing.

Topics

A topic in Kafka is similar to a table in a database:

  • Stores a stream of immutable messages.
  • Can be serialized in JSON, Avro, Protobuf, or custom formats.
  • Messages have a time-to-live (default: 1 week).
  • Replication ensures durability and failover.

Partitions & Offsets

Kafka achieves scalability and ordering with partitions:

  • A topic can have many partitions, distributed across brokers.
  • Within a partition, messages are ordered.
  • Each message has an offset (a unique incremental ID).
  • Offsets are never reused, even if data is deleted.

⚠️ Best practice: choose the number of partitions carefully, load test, and avoid exceeding broker limits (e.g., 4K partitions per broker).

Message Keys

A message key determines how messages are distributed:

  • Key = NULL → Round-robin distribution across partitions.
  • Key ≠ NULL → All messages with the same key go to the same partition (ordering guarantee).
  • Kafka uses the Murmur2 hashing algorithm.

💡 Example: In a money transfer system, use customer_id as the key so all transactions for a customer remain ordered.

Message Format:

  • Key (nullable)
  • Value (usually required)
  • Headers (metadata map)
  • Partition + Offset
  • Timestamp
  • Compression (none, gzip, snappy, etc.)

Consumer Groups

Consumers in Kafka scale horizontally with consumer groups:

  • Each partition is consumed by only one consumer in a group.
  • Multiple groups can read the same topic independently.
  • If a consumer crashes, Kafka reassigns its partitions.

Delivery Semantics:

  1. At least once → Default, may cause duplicates.
  2. At most once → Lowest latency, risk of data loss.
  3. Exactly once → Achieved with Kafka Streams/Transactional API.

Replication & Durability

  • Typical replication factor: 3.
  • Ensures high availability and failover.
  • With higher replication:
    • Disk usage grows.
    • Write latency increases (especially with acks=all).

Acknowledgements (acks):

  • acks=0 → Fire-and-forget (fastest, possible data loss).
  • acks=1 → Leader acknowledgement only.
  • acks=all → Leader + all replicas (safest, slower).

Serialization & Schema Registry

Kafka supports multiple serialization formats:

  • Avro (binary, schema-based) – most common.
  • Protobuf, Thrift – alternatives with strong typing.
  • Schema Registry ensures data consistency:
    • Producers attach schema version with data.
    • Consumers fetch schema and deserialize accordingly.

⚠️ Note: Avro doesn’t support date type directly.

ZooKeeper vs KRaft

  • ZooKeeper → Used historically for cluster metadata and coordination.
  • KRaft → Kafka’s new built-in consensus protocol.
    • Kafka 3.x supports both.
    • Kafka 4.0 and above will remove ZooKeeper entirely.

Kafka Streams

Kafka Streams is a stream processing library built into Kafka.

Supports:

  • Transformations (map, filter, groupBy, aggregate)
  • Joins across topics
  • Windowed operations (e.g., hourly counts)
  • Publishing results back to topics

Example use cases:

  • Fraud detection
  • Real-time analytics dashboards
  • Event-driven microservices

Rebalancing in Consumer Groups

Rebalancing redistributes partitions when:

  • A consumer joins or leaves a group.
  • New partitions are added.

Steps:

  1. Trigger – Membership or partition changes.
  2. Coordinator – One broker manages partition assignments.
  3. Assignment – Partitions balanced across consumers.
  4. Cooperative Rebalance (≥ Kafka 2.3) – Minimizes downtime by allowing consumers to keep partitions during reassignment.

Common Mistakes to Avoid

  1. Too Many Partitions – Causes overhead; always benchmark.
  2. Misusing Batches – Batches ensure atomicity, not performance boosts.
  3. Ignoring Schema Evolution – Always manage schema compatibility in production.
  4. Low-Cardinality Keys – Leads to hotspots and uneven distribution.
  5. Expecting SQL-like Ordering – Ordering only within partitions, not across the topic.

Final Thoughts

Apache Kafka is more than a message broker, it’s a foundation for event-driven systems.
It powers real-time analytics, streaming ETL, and microservices communication at scale.

To use Kafka effectively:

  • Choose partition keys wisely.
  • Configure replication & acks for your SLA.
  • Use schema registry for long-term compatibility.
  • Design for consumer groups & rebalancing.

If your applications require real-time, scalable, fault-tolerant event pipelines, Kafka is the platform to bet on.