Back to all articles
2025-12-02

Building Our Real-Time Data Platform: A Retrospective from the Trenches

How we built CDC, clickstream, and S3/Lambda pipelines into a unified real-time analytics platform.

When I look back at the last year, I don’t remember a single “big bang moment” where everything came together. What I remember instead is a long sequence of small, deliberate engineering decisions — each one solving a specific problem — that eventually turned into a real-time data platform running at scale at Allen Digital.

This wasn’t the hardest architecture I’ve built, but it was easily the broadest. It touched everything: CDC from transactional databases, clickstream ingestion from apps and the web, third-party integrations landing in S3, schema evolution, data quality, materialized views, batch processing, observability, and the one thing that every real-time system eventually teaches you: boring engineering wins.

I didn’t build this alone — we were a small team — but I drove the platform end to end. Some nights everything was on fire. Some days nothing worked. And eventually, people on the floor started recognizing the platform without me having to explain it. That’s when I knew we’d built something real.

This is the retrospective I wish I had read before starting.


The Three Pipelines That Became the Backbone

Early on, I realized that trying to force every data source into one ingestion pipeline would be a mistake. Different sources behave differently — they fail differently, scale differently, and need different guarantees.

Instead, we built three pipelines. Not thirty. Not one pipeline pretending to be thirty. Just three, each optimized for its job:

  1. CDC Pipeline for operational databases

  2. Clickstream Pipeline for high-volume user events

  3. S3 → Lambda Pipeline for third-party systems

Everything else in the platform emerged from these three foundational lanes.


1. CDC Pipeline — Where It All Started

This was the first piece we built — the part that convinced people the platform was real.

The Flow

Databases (Mongo, MySQL, PostgreSQL, MS-SQL)
→ Kafka Connect Source Connector
→ Kafka Topic
→ Modified Kafka Connect Sink Connector
→ ClickHouse Raw Table (MergeTree)
→ Materialized View
→ Final Queryable Table (ReplacingMergeTree)

What Worked

  • Log-based CDC meant zero load on production databases.

  • Kafka gave us replayability, which saved us more times than I can count.

  • Raw → MV → Final table gave us a clean separation between ingestion and consumption.

What Surprised Me

ClickHouse’s ReplacingMergeTree is powerful, but you have to respect what it is:
eventual upsert semantics.
Merges happen asynchronously.
Duplicates appear until they don’t.

In the early months, this confused every analytics person who touched the system.

Eventually I learned the mantra:

Use FINAL when correctness matters, use argMax when performance matters.

Once we internalized that, the pipeline became stable — boring even.
And boring is exactly what a CDC pipeline should be.


2. Clickstream Pipeline — Our Highest Throughput System

Next came clickstream ingestion. This pipeline eventually carried more events per second than all other pipelines combined.

The Flow

Web/App Clients
→ Pixel Service (Go + Kratos)
→ Kafka Topic
→ Modified Kafka Connect Sink
→ Raw Table
→ Materialized View
→ Final Table (ReplacingMergeTree)

Why This Pipeline Felt Different

Clickstream data is chaotic.
Schemas evolve. New properties show up unannounced. Device metadata is noisy.
And our events had hundreds of possible fields across different product surfaces.

If we had tried to model all of them statically on day one, we would have failed.

Instead, we treated raw events as JSON and let the MV extract and type fields into a flat, queryable table. This gave us the flexibility of a schemaless system with the performance of a columnar database.

The Hard Part

Funnels.

Funnels are where every database cries.

We optimized ORDER BY around event name, date, device info, pathname, and event timestamp. Funnel queries that used to take minutes dropped by an order of magnitude.

But the real lesson?

If your schema is wrong, no amount of hardware will fix your queries.

That insight guided every decision after this point.


3. S3 → Mother Lambda Pipeline — The Most “Engineering” Part of the Platform

Third-party data ingestion started simple and quickly became unmanageable:

  • 11+ data sources

  • Different schemas

  • Different formats (Parquet, CSV.gz, JSON)

  • Different partitioning strategies

  • Different update patterns

  • Different teams owning each source

At first, each source had its own Lambda pipeline. Very quickly, this became unsustainable.

The Turning Point

We consolidated everything into a single, config-driven Mother Lambda.

One Lambda to rule them all.

It pulled configuration per source, and each source got:

  • Its own transformation logic

  • Its own error handling

  • Its own schema definition

  • Its own ClickHouse table

  • Automatic retries, DLQs, and failover between clusters

This single decision reduced complexity across the platform more than anything else we built.

In hindsight, we should have done it earlier.


The Real Story: Three Months of Firefighting

People often see a mature data platform and imagine architectural clarity from day one.

The truth is, the first few months were chaos.

  • Kafka Connect connectors crashed after pod restarts.

  • Lambda connections timed out at the worst possible times.

  • Materialized views dropped events during migration windows.

  • Schema drift broke dashboards unexpectedly.

  • MergeTree tables accumulated too many parts.

  • Funnel queries froze dashboards.

  • Raw tables grew faster than planned.

  • Spark jobs revealed missing fields late in the pipeline.

Every week was a new fire.
Every fire taught us something that became a principle.

And every principle became a piece of the final platform.


The Principles That Emerged (The Things I Actually Learned)

1. Real-time systems reward boring engineering

The flashiest pieces broke the most.
The boring pieces built confidence.

2. You don’t need a dozen microservices

We built three pipelines, not thirty.
And each one is still in production today.

3. ClickHouse is powerful — but you must respect its rules

  • FINAL clause is both a blessing and a curse.

  • Partitioning makes or breaks performance.

  • ORDER BY is not an afterthought.

  • ReplacingMergeTree is eventually consistent — design around that.

4. Data quality isn’t an add-on

Freshness checks, schema drift detection, and spot checks prevented entire classes of downstream issues.

5. Schema changes require choreography

Zero-downtime schema evolution forced us to implement a two-phase migration process:

  • temp table + temp MV

  • backfill

  • atomic rename

  • recreate MV

  • buffer window to avoid loss

It sounds complicated only if you haven’t experienced the pain of losing data mid-migration.

6. Observability is not optional

Kafka lag, ClickHouse replication lag, connector health, invalid events, S3 processing durations — everything needed to be monitored, or we were blind.


The Moment I Felt Proud

There wasn’t a single deploy or migration that made me feel like the platform was “done.”
But I remember two moments clearly:

  • When I received the Rising Star award.

  • When colleagues across the floor finally said, “The data is reliable now.”

That meant more than numbers or dashboards.


What the Platform Looks Like Today

A year later:

  • 2.5K+ events/sec on clickstream

  • Hundreds of tables across CDC, clickstream, and external sources

  • Near real-time freshness

  • No daily firefighting

  • Fully automated data quality

  • Fact and dimension tables powering dashboards and analytics

  • Three ClickHouse clusters (ingestion, analytics, adhoc)

  • Three ingestion pipelines, still intact, still doing their job

Most importantly:
The platform is predictable.
Predictability is the real definition of “enterprise-grade.”


Final Thoughts

If there’s one thing I hope engineers take away from this, it’s this:

A real-time data platform is not a single heroic architecture.
It’s a series of practical engineering decisions that survive reality.

You don’t need perfect design.
You don’t need dozens of services.
You don’t need a team of fifty.

You need:

  • A few well-designed pipelines

  • An OLAP store you understand deeply

  • Defensive engineering

  • Good observability

  • And patience

The rest you figure out along the way — exactly like we did.