Building a Multi-Client CDC Data Pipeline with Kafka and Flink

We had around twenty enterprise clients, each running their own isolated database. The product team wanted real-time analytics dashboards. The problem was getting data from all those databases into an analytics layer — without touching production systems, without rewriting application code, and without one noisy client slowing everyone else down.

This is what we built.

The Architecture

Client Databases
       ↓
  Debezium (CDC)
       ↓
     Kafka
       ↓
Flink (Transformation)
       ↓
  MongoDB (Raw)
       ↓
ClickHouse (Analytics)

Each layer had one job.

CDC over event emitters

The first decision was how to capture changes. We chose Debezium, which reads directly from the database write-ahead log, over building event publishers into application code.

No application changes meant no risk of missed events and no coordination with multiple service teams. New consumers could be added downstream without touching the source.

The trade-off: Debezium emits raw row-level diffs, not business events. A user update looks like a diff, not "user changed their email." Reconstructing business meaning from those raw changes became Flink's job.

Kafka

Kafka held everything together. Topics were named by domain (user.updated, order.created), and every message carried a client_id.

Partition strategy mattered more than I expected. In a multi-client system, a high-traffic client can starve processing for others if partitioning isn't thought through. We spent real time on this.

Flink

Raw CDC events aren't analytics-ready. Flink handled filtering, normalization, deduplication, and enrichment before anything hit storage.

We treated Kafka as the event log and Flink as the place where raw changes became meaningful records. Flink's checkpointing also helped with reliability — if a job crashed, it could resume without replaying everything from scratch.

Two storage layers

We didn't write directly to ClickHouse. Events landed in MongoDB first (close to the original shape), then in ClickHouse in a form optimized for dashboard queries.

The MongoDB layer turned out to be more useful than expected. When transformation logic changed, we replayed from raw data rather than backfilling from source databases. It also made debugging easier — you could see exactly what came in and what it became.

The hard part: isolation

The architecture wasn't complicated once we had it. The hard part was maintaining isolation consistently — every Kafka message tagged with client_id, Flink streams keyed by client, every analytics query filtered per tenant.

Discipline here is easy at the start and erodes quickly when you're moving fast. We had cases where isolation assumptions didn't hold in new consumers and had to go back and fix them.

On failures

In a distributed streaming system, failures are expected. Connectors restart, consumers crash, events get retried. We built assuming all of this: idempotent writes, deduplication in Flink, lag monitoring on consumer groups.

The goal was at-least-once delivery with idempotent handling. Exactly-once is possible in theory but much harder to get right in practice.