← Case studies

Building a Multi-Client CDC Data Pipeline with Kafka and Flink

How I designed a real-time ETL pipeline using Debezium, Kafka, Flink, MongoDB, and ClickHouse.

  • Kafka
  • Flink
  • ClickHouse
  • Microservices

The Problem

We had multiple clients.

Each client had their own database.

We needed:

  • Real-time analytics
  • Data isolation per client
  • No heavy queries hitting transactional databases
  • Near real-time updates for dashboards

The challenge wasn’t just moving data.

It was doing it safely, at scale, without breaking production systems.


The Architecture

The flow looked like this:

Client Databases

Debezium (CDC)

Kafka

Flink (Transformation Layer)

MongoDB (Raw)

ClickHouse (Analytics)

Each layer had a clear responsibility.


Step 1: CDC with Debezium

Instead of modifying application code to emit events,
we used Change Data Capture (CDC).

Debezium reads the database write-ahead log and publishes changes into Kafka.

Benefits:

  • No extra logic in services
  • Guaranteed alignment with database state
  • Easy to add new downstream consumers

But CDC emits row-level changes — not business meaning.

That responsibility shifts to the processing layer.


Step 2: Kafka as the Backbone

Kafka became our central event log.

Important design decisions:

  • Topics named by domain (e.g. user.updated, order.created)
  • Messages include client_id
  • Partition strategy carefully chosen to balance ordering and distribution

In multi-client systems, partitioning is critical.
Too naive, and one noisy client can affect everyone.


Step 3: Flink as the Brain

Raw CDC events are messy.

Flink handled:

  • Filtering
  • Data normalization
  • Deduplication
  • Client-level enrichment
  • Aggregation (when needed)

We treated Kafka as the event log,
and Flink as the transformation engine.

Flink also gave us:

  • Stateful processing
  • Checkpointing
  • Controlled parallelism

Dual Storage Strategy

We didn’t insert directly into ClickHouse.

Instead, we split storage into two layers:

MongoDB (Raw Layer)

  • Stores near-original events
  • Useful for replay
  • Helpful for debugging
  • Flexible schema

ClickHouse (Analytics Layer)

  • Pre-transformed data
  • Columnar storage
  • Optimized for dashboards
  • High ingestion throughput

This separation made the system more resilient.

If transformation logic changed,
we could replay from raw data safely.


Multi-Client Isolation

Every event carried a client_id.

We enforced isolation by:

  • Including client_id in partition strategy
  • Keying Flink streams by client
  • Filtering queries per client in analytics

Without strict tagging,
multi-client pipelines become dangerous quickly.


Failure Scenarios We Designed For

In distributed systems, failure is normal.

We assumed:

  • Connectors would restart
  • Consumers would crash
  • Events would be retried
  • Network partitions would happen

So we built:

  • Idempotent writes
  • Deduplication logic
  • Monitoring for consumer lag
  • Clear separation between raw and transformed data

Exactly-once is a goal.
At-least-once is reality.

Design accordingly.


Lessons Learned

  • CDC simplifies event generation but complicates semantics.
  • Raw data storage is critical for resilience.
  • Analytics databases must be modeled differently from transactional ones.
  • Multi-client systems require strict isolation discipline.
  • Observability is not optional in streaming systems.

Final Thoughts

This pipeline wasn’t just about moving data.

It was about designing a system that:

  • Doesn’t overload transactional databases
  • Scales with new clients
  • Handles failures gracefully
  • Supports real-time analytics
  • Remains debuggable under pressure

Event-driven architecture adds complexity.

But when done correctly,
it gives you scalability, isolation, and resilience by design.