Chapter 2 of 4

Data Streams: Kafka and Protobuf

Created Apr 28, 2026 Updated May 4, 2026

When data arrives not as one big chunk but as a continuous stream — sensor readings, user events, service logs — you need infrastructure specifically designed for that mode. Kafka and Protobuf are often used together: Kafka provides the durable event log and the routing / fan-out machinery, while Protobuf provides compact, typed, schema-driven messages on top of it. Kafka has become one of the standard engines for streaming data transport in this kind of architecture, and Protobuf is one of the common formats for serializing the structured messages that travel over it (alongside Avro, JSON Schema, plain JSON, and raw bytes, depending on the team).

Kafka

Kafka is a distributed streaming message platform developed at LinkedIn in 2011 and later donated to the Apache Foundation. Architecturally, it is an append-only distributed log: message producers (producers) append events to the end of the log, and consumers (consumers) can read the log from any position at any time. This is fundamentally different from traditional message queues like RabbitMQ, where a message is usually removed after it is acknowledged by a consumer — in Kafka, messages live on disk until the retention policy expires, and several independent consumers can read the same stream from different positions.

Architectural components

Broker — a Kafka server that serves part of the topics. A cluster usually consists of at least 3 brokers.
Topic — a named message channel (for example, user_events or orders).
Partition — a topic is split into N partitions for parallelism. Each partition is a separate ordered log on disk.
Offset — the sequential number of a message within a partition. Consumers track their position by committing offsets, usually to Kafka's internal __consumer_offsets topic.
Replication Factor — how many copies of a partition to keep on different brokers (usually 3). One broker is the leader, the others are followers.
Consumer Group — a group of N consumers that share partitions among themselves. Exactly one consumer in the group reads a given partition — this guarantees processing order within a partition.

Topics and partitions

The fundamental unit of data organization is the topic, a named message channel. Each topic is physically split into several partitions, and that split is the main mechanism for parallelism in Kafka. Each partition is a separate ordered log on disk, and message ordering is guaranteed only within a single partition, not across partitions.

If you write to a topic with ten partitions, Kafka distributes messages between them either round-robin or by key. Messages with the same key end up in the same partition — as long as the partition count and the partitioning strategy remain unchanged. This matters when you need ordering for a specific entity: for example, all events of a single user or all readings of a single sensor must be processed in the order they arrive. (If you later expand the number of partitions, the key-to-partition mapping changes, which is one reason teams plan partition count carefully and rarely change it once the topic is in production.)

Offset and storing position

A consumer's position in a partition is called the offset — it is simply the sequential number of a message. The consumer controls its position by committing offsets, most commonly into Kafka's internal __consumer_offsets topic, although an application can also store offsets externally if it needs to coordinate them with its own database (for transactional consume-then-write patterns). Either way, the consumer controls the position; Kafka does not remove messages just because one consumer has read them. This is fundamentally different from systems where the broker manages consumer state.

This approach gives an important property: a consumer can at any moment "rewind" and reread old messages if needed — for example, when rolling out a new version of processing logic that has to recompute past events.

Replication and leaders

A Kafka cluster consists of several servers — brokers (usually at least three for production). Each partition is replicated across several brokers according to the replication factor parameter (the standard is 3). One of the brokers becomes the leader for the partition and accepts all writes and reads; the others are followers that asynchronously copy data from the leader. If the leader fails, Kafka automatically elects a new one from those followers that managed to sync.

Consumer Groups and read scaling

When several consumers join a consumer group, Kafka distributes the partitions between them so that each partition is processed by exactly one consumer in the group. This gives horizontal scaling of reads: if you have a topic with 10 partitions and 5 consumers in the group, each one will get 2 partitions. Adding a 6th consumer triggers a rebalance — a redistribution of partitions. If there are more consumers than partitions, the extras will sit idle — therefore, the number of partitions defines the maximum read parallelism.

Write acknowledgment levels (acks)

A critically important parameter on writes is the acknowledgment level. It has three values:

acks=0 — the producer sends a message and does not wait for any acknowledgment (fire-and-forget). The fastest mode, but if the broker fails the message is lost.
acks=1 — the leader acknowledged the write. The message will not be lost if a follower fails, but it can be lost if the leader fails before replicating the data.
acks=all — the leader waits until the write is acknowledged by enough in-sync replicas to satisfy min.insync.replicas. Together with a replication factor greater than 1 and a sensible min.insync.replicas setting, this gives strong durability against broker failure, at the cost of higher write latency. The exact guarantee depends on all three settings together: acks=all alone with min.insync.replicas=1 is weaker than the same setting with min.insync.replicas=2.

Retention and log compaction

The lifetime of messages in a topic is governed by the retention policy. Usually, messages are kept either for a configured time (retention.ms, often 7 days by default), or until the partition size exceeds a limit (retention.bytes), after which the oldest segments are deleted.

An alternative is log compaction: instead of deleting old messages by age or size, Kafka keeps only the latest message per key. This makes the topic behave like a durable changelog for the latest state of each key — useful for state reconstruction, materialized views, and changelog topics in stream-processing frameworks. It is not the same as event sourcing in its strict sense, which requires preserving the full event history; log compaction may delete older versions of the same key over time.

Delivery guarantees

In the usual configuration, Kafka applications are built with at-least-once semantics: after a message has been durably written to Kafka, consumers can retry processing it, so data is usually not lost. Duplicates are possible if processing succeeds but the offset commit fails — on the next consumer run the same message is re-read and re-processed. At-most-once and exactly-once patterns are also possible, depending on when offsets are committed and whether transactions are used.

Exactly-once semantics are achievable for Kafka-to-Kafka workflows using the idempotent producer (deduplication by producer ID and sequence number, introduced together with Kafka transactions in 0.11) plus Kafka transactions, which allow atomic writes across multiple partitions. Transactions reduce throughput somewhat, which is the usual cost.

The important caveat is that exactly-once in this Kafka-native sense covers only the Kafka boundary. When the consumer writes to an external system — a database, a warehouse, an HTTP API — the application still needs idempotent writes (upserts on a deterministic key, request-deduplication tokens) or its own transactional strategy on the destination side. Kafka transactions don't extend across the network into other systems on their own.

Typical scenarios

The classic niche for Kafka is collecting a large volume of events from many sources and dispatching those events to several consumers. Typical scenarios:

User events for analytics. The same stream is read at once by several systems: a real-time dashboard, a batch pipeline into the warehouse, a fraud detector, recommender retraining. Partitioning by user_id preserves the order of one user's actions, while different users are processed in parallel.
Transactions and orders in e-commerce — an orders stream partitioned by customer_id or region_id.
IoT and device telemetry — a stream of readings from thousands of devices, partitioning by device_id preserves order within a single device.
Application logs from many microservices — Kafka as the central bus between producers (services) and consumers (ELK, long-term storage, alerting).

In all of these scenarios, Kafka's key property is partitioning by key, which preserves order within a "logical entity" (user, order, device) while parallelizing processing across different entities.

Alternatives

RabbitMQ — a classic message broker with queues, optimized for point-to-point delivery rather than streaming. Suitable for tasks where you need flexible routing logic (exchange types, bindings) and a small message volume.
AWS Kinesis — a managed Kafka-like service from Amazon. More expensive, but with no operations burden.
Google Pub/Sub — serverless, with autoscaling and no broker configuration.
Redis Streams — a lightweight alternative for small streams, when Kafka is overkill.

Protobuf (Protocol Buffers)

Protobuf is a binary serialization format plus a schema description language, developed by Google and open-sourced in 2008. Today its main use is serializing messages in gRPC and in streaming systems like Kafka, where compactness and parsing speed matter. When you push millions of messages per second, the difference between parsing JSON and parsing Protobuf is measured in real money for CPU.

Workflow

Working with Protobuf has three stages: describing the schema in a .proto file, generating classes for the language you need, and using those classes in your code.

First you write the schema:

// 1. Schema description in order_event.proto
syntax = "proto3";

message OrderEvent {
  string order_id = 1;   // field = tag number (NOT index)
  int64 timestamp = 2;
  double amount = 3;
  repeated string item_ids = 4;  // array
}

The key point here is the numbers = 1, = 2, etc. These are not field indices, they are tag numbers — they identify the field in the wire format and must under no circumstances be changed, otherwise compatibility breaks.

Next, classes for the desired language are generated from the schema:

# 2. Generate classes for the language
protoc --python_out=. order_event.proto
# → creates order_event_pb2.py with the OrderEvent class

The generated class is then used as an ordinary language object, with serialization and deserialization methods:

# 3. Usage
event = OrderEvent(order_id="ord-abc", timestamp=1234567890, amount=42.5)
binary_data = event.SerializeToString()  # bytes for Kafka
parsed = OrderEvent.FromString(binary_data)

Wire format — why it's compact

Protobuf's compactness on the wire is explained by several design decisions:

Each field in a serialized message is encoded as a triple tag_number + wire_type + value.
Numeric types use varint encoding: small numbers take one byte, large ones up to ten. The value 42 fits into a single byte, 1_000_000_000 fits into five.
Missing fields are not serialized at all, unlike JSON where you have to write "field": null and waste space on both the field name and the value.
Field names are entirely absent in the binary representation — only tag numbers are present. This is what produces the main space savings.

An additional speed gain comes from the absence of lexical parsing: in JSON the parser walks the text character by character, identifying tokens, while in Protobuf each tag immediately says "an int32 will be here" or "a string will be here".

Protobuf vs JSON

Compared to JSON, Protobuf gives you:

Compactness 3–10× better (no field names, varint instead of decimal strings).
Parsing speed several times higher (no need to lex text).
Strict typing — producers and consumers use generated classes instead of ad-hoc dictionaries; many mistakes are caught earlier than with raw JSON. Schema-compatibility issues against already-written messages still need a registry with compatibility checks to be caught reliably.
Schema evolution — adding new fields is backward compatible: old clients simply ignore unknown tag numbers.

Schema evolution rules

When evolving a schema you have to follow a few rules so as not to break compatibility with old clients or with old already-written messages:

✅ Add new optional fields with new tag numbers — safe; old clients just don't see them.
✅ Removing fields is allowed, but the tag number of a removed field must under no circumstances be reused in the future. Otherwise old messages written with that tag will be misinterpreted.
❌ Changing the type of an existing field — generally unsafe and should be avoided unless you understand the Protobuf wire-compatibility rules very precisely. A few specific type changes are wire-compatible (some integer-width changes among signed/unsigned varint types, for example), but most are not, and getting it wrong silently breaks deserialization on old data.
❌ Changing a tag number — not allowed, breaks compatibility entirely.

Typical scenarios

Protobuf is a good fit whenever services need to exchange structured messages at high volume with a schema that is maintained separately from the data itself. Typical scenarios:

Kafka messages — common pairing in data engineering. Protobuf is often used instead of JSON as the message format: same producer/consumer flow, but 3–10× smaller payloads and several times faster deserialization.
Internal gRPC APIs — service-to-service calls where both sides are under your control and you want compile-time type safety across languages.
Event sourcing stores — persisting domain events that must be read back reliably years later; the schema registry enforces that old events remain decodable after schema evolution.
Mobile → backend pipelines — client SDKs emit compact binary payloads over slow or metered connections; Protobuf cuts bandwidth and parse time on both ends.

In all of these, the common thread is a stable schema shared between known parties, high message volume or tight latency requirements, and the ability to maintain a schema registry.

Alternatives

Among alternative binary formats, the main competitors are Apache Avro and Apache Thrift.

Apache Avro. Avro object container files embed the schema with the data, which makes the resulting datasets self-describing — useful for batch storage on HDFS, S3, or other big-data sinks. In Kafka, however, Avro is most often used together with a schema registry (Confluent Schema Registry being the canonical example) rather than embedding the full schema in every message; the message carries a small schema ID and the registry handles the rest. Protobuf and Avro are both viable Kafka-message formats, with similar registry patterns.
Apache Thrift (Facebook) was once popular in the RPC world, but today it is losing to the gRPC + Protobuf combination.
MessagePack — a binary "JSON-like" format without a schema. More compact than JSON, but weaker than Protobuf in compression and safety (no compile-time checks).
FlatBuffers (also Google) — optimized for the case where you need to read individual fields without fully deserializing the entire message (games, mobile).

Kafka + Protobuf together

Kafka moves messages: it handles routing, ordering, retention, and fan-out to many consumers. It is format-agnostic — you can push raw bytes, JSON, Avro, or Protobuf. Protobuf handles what Kafka doesn't: it defines what those bytes mean, enforces a schema at the producer and consumer sides, and reduces network and storage overhead, improving overall throughput efficiency.

In a typical production setup:

A .proto file defines the event schema (e.g., UserEvent, OrderPlaced).
The producer serializes each event with Protobuf and writes the binary payload to a Kafka topic.
Multiple consumer groups read from that topic independently — each one deserializes with the same generated class.
A schema registry (e.g., Confluent Schema Registry) stores .proto versions and ensures producers and consumers stay compatible as the schema evolves.

The combination gives you high-throughput, ordered delivery (Kafka) with compact, typed, schema-versioned messages (Protobuf) — which is why it has become a very common stack in event-driven systems.