Back to all articles
2025-12-04

Building an In-House Mixpanel Alternative: Our Journey to Processing 20 Billion Events

How we built a production-grade product analytics platform from scratch—and what we learned along the way.

At Allen Digital, we serve millions of students preparing for competitive exams like JEE and NEET. Understanding how students interact with our platform—what features they use, where they drop off, how they learn—is critical to improving outcomes. Like many growing companies, we started with Mixpanel.

But as our scale grew to billions of events per month, we faced a familiar crossroads: keep paying escalating vendor costs, or build something ourselves.

We chose to build. This is the story of how we created a full-featured product analytics platform that rivals Mixpanel—handling event ingestion, identity resolution, and analytics at scale.


The Problem with Vendor Analytics at Scale

When you're processing a few million events, Mixpanel is a no-brainer. The UI is polished, the SDKs work out of the box, and you're up and running in days.

But at our scale—20+ billion events and growing—three problems emerged:

  1. Cost: Vendor pricing scales linearly with volume. Ours didn't need to.
  2. Flexibility: We needed to join analytics data with internal systems (test scores, batch assignments, learning paths). Mixpanel's data lives in their walled garden.
  3. Latency for custom queries: Complex queries that took seconds in SQL took forever through vendor APIs.

The question wasn't if we should build, but how.


The Architecture: Events In, Insights Out

We designed a system with four layers:

┌─────────────────────────────────────────────────────────────────────┐
│                         CLIENT LAYER                                │
│   Android SDK    iOS SDK    Web SDK    Backend Services             │
└─────────────────────────────────┬───────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      INGESTION LAYER                                │
│                 parmanu-event-service (Go)                          │
│         • Event validation & enrichment                             │
│         • Identity resolution (IDM APIs)                            │
│         • Worker pool for high throughput                           │
└─────────────────────────────────┬───────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      STREAMING LAYER                                │
│              Amazon MSK (Kafka) + Kafka Connect                     │
│         • 22+ topics for different event types                      │
│         • Exactly-once delivery to ClickHouse                       │
│         • Dead letter queues for failed events                      │
└─────────────────────────────────┬───────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       STORAGE LAYER                                 │
│                        ClickHouse                                   │
│         • Schema-on-read raw tables (flexibility)                   │
│         • Materialized views for query optimization                 │
│         • 600+ flattened columns for app events                     │
└─────────────────────────────────┬───────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      ANALYTICS LAYER                                │
│                    ClickSight (React UI)                            │
│         • Funnels, Retention, Cohorts, User Paths                   │
│         • Event trends with breakdowns                              │
│         • SQL Playground for ad-hoc analysis                        │
└─────────────────────────────────────────────────────────────────────┘

Let me walk through each layer.


Layer 1: The Ingestion Gateway

We built parmanu-event-service in Go using the Kratos framework. It's the single entry point for all events—from mobile apps, web clients, and backend services.

Why a dedicated gateway?

Validation: Every event is validated against a configuration stored in ClickHouse. Unknown event types are rejected.

Enrichment: We automatically extract UTM parameters, parse user-agent strings, and attach server-side metadata (CDN headers, timestamps). Clients don't need to send this.

Throughput: A worker pool architecture handles bursty traffic. Events are queued in channels and processed by workers that produce to Kafka asynchronously.

// Simplified worker loop
func (ew *EventWorker) EventTaskListener() {
    for {
        select {
        case task := <-ew.taskQueue:
            response, err := ew.produceEvent(task)
            task.result <- response
        }
    }
}

Client SDKs

We built SDKs for Android, iOS, and Web that handle:

  • Batching events for network efficiency
  • Persisting events during offline mode
  • Automatic retry with exponential backoff
  • Session and device ID management

Layer 2: Identity Resolution

One of the hardest problems in analytics is connecting anonymous users to logged-in users. When someone browses your app anonymously, then signs up, you need to stitch those sessions together.

We built an Identity Management (IDM) service with three core APIs:

Init — When the app launches

POST /v1/idm/init
→ Returns: { pixelSessionId: "uuid", pixelDeviceId: "uuid" }

The SDK stores these IDs and attaches them to every event. Even before login, we can track a user's journey.

Identify — When the user logs in

POST /v1/idm/identify
{ pixelSessionId, pixelDeviceId, userId }
→ Stores: identity_mappings(session, device, user)

This creates a permanent link: "device X and session Y belong to user Z."

Reset — When the user logs out

POST /v1/idm/reset
→ Returns: { pixelSessionId: "new-uuid" }

New session, same device. The identity graph preserves history while respecting session boundaries.

The mapping tables in ClickHouse allow us to query user behavior across sessions:

SELECT *
FROM events e
JOIN identity_mappings im ON e.pixel_session_id = im.session_id
WHERE im.user_id = '12345'

Layer 3: Streaming with Kafka + ClickHouse

Events flow from our gateway to Kafka, then to ClickHouse via Kafka Connect.

Why Kafka Connect (not Kafka Engine tables)?

ClickHouse has native Kafka Engine tables, but they have limitations at scale. Kafka Connect gave us:

  • Exactly-once semantics: Critical for accurate analytics
  • Dead letter queues: Failed events go to sideline topics, not lost forever
  • Schema flexibility: The connector inserts JSON directly

Each event type has its own topic and connector:

- name: sink.prod.clickhouse.pixel-service.app-events
  topics: v2.quantumHub.appEvents
  tasksMax: 96 # High parallelism for volume
  exactlyOnce: true
  errors.tolerance: all
  errors.deadletterqueue.topic.name: v2.quantumHub.appEvents.sideline

Sideline Recovery

Events can fail for various reasons—schema mismatches, transient ClickHouse issues. Instead of dropping them, we:

  1. Route failures to *.sideline topics
  2. Store them in *_sideline ClickHouse tables
  3. Run a daily Airflow job that retries them
# Simplified sideline retry
def retry_sideline_events():
    sideline_events = spark.read.table("pixelService.appEvents_sideline")
    # Apply fixes and re-insert to main table
    sideline_events.write.insertInto("pixelService.appEvents")

Layer 4: The Schema Strategy

This was one of our most debated decisions: schema-on-read vs. schema-on-write.

Raw Tables: Schema-on-Read

CREATE TABLE pixelService.appEvents (
    id String,
    event String,  -- Full JSON payload
    eventTimestamp DateTime,
    serverTimestamp DateTime64(3)
) ENGINE = MergeTree()
ORDER BY (serverTimestamp, id);

The event column stores the entire JSON. New fields? No ALTER TABLE needed. Just send them.

Derived Tables: Schema-on-Write

For fast querying, we use Materialized Views that flatten the JSON:

CREATE MATERIALIZED VIEW pixel_service_app_events_mv TO quantum_data.pixel_service_app_events
AS SELECT
    JSONExtractString(event, 'event_name') AS event_name,
    JSONExtractString(event, '$os') AS os,
    JSONExtractString(event, 'pathname') AS pathname,
    -- 600+ more columns
FROM pixelService.appEvents;

This gives us columnar storage benefits (compression, predicate pushdown) without sacrificing ingestion flexibility.

Trade-off: Adding a new column to the derived table requires updating the MV. We accept this because it's infrequent and the query performance gain is worth it.


Layer 5: The Analytics UI (ClickSight)

We built a React-based analytics UI called ClickSight that talks directly to ClickHouse over HTTP.

Schema Adapter Pattern

ClickSight is schema-agnostic. A configuration file maps logical fields to physical columns:

{
  "schema": {
    "columns": {
      "event_name": "event_name",
      "timestamp": "server_timestamp",
      "user_id": "pixel_properties_user_id"
    },
    "user_identifier": {
      "type": "computed",
      "expression": "if(pixel_properties_user_id != '', pixel_properties_user_id, pixel_device_id)"
    }
  }
}

This means ClickSight can work with any ClickHouse event table—not just ours.

Features We Built

Feature Implementation
Funnels ClickHouse windowFunnel() function with time windows
Retention Cohort-based return rate analysis
User Paths Sankey diagrams with D3.js
Trends Time-series with daily/weekly/monthly granularity
Breakdowns GROUP BY with multi-property support
SQL Playground Monaco editor with syntax highlighting

Performance Optimizations

Querying 20B rows directly is slow. We optimized with:

  1. Partition pruning: Queries always filter on ist_date (date partition key)
  2. Primary key design: (event_name, ist_date) matches common query patterns
  3. Query caching: Session storage for dropdown values
  4. CTEs over huge IN clauses: Scalable funnel queries
-- Instead of: WHERE user_id IN (10000 values)
-- We use:
WITH step_1_users AS (
    SELECT user_identifier FROM user_funnels WHERE funnel_level >= 1
)
SELECT count(*) FROM events
WHERE user_identifier IN (SELECT * FROM step_1_users)

Self-Service Event Registration

We didn't want platform engineers to be a bottleneck. So we built quantum-hub, an admin service for event configuration.

The Approval Flow

  1. Create: Product team registers a new event type
  2. Approve: Platform team reviews and approves
  3. Auto-provision: The system creates:
    • Kafka topics (forward + sideline)
    • ClickHouse tables
    • Read-only database user
  4. Deploy: Connector config is added to GitOps
  5. Done: Client SDKs can now send the new event
POST /v1/eventConfiguration
{
    "eventName": "question_answered",
    "contactEmail": "product-team@allen.in",
    "autoCapture": {
        "userDetails": true,
        "userAgent": true
    }
}

What We Learned

1. Start with raw JSON, optimize later

We debated upfront schema design for weeks. In hindsight, starting with schema-on-read (raw JSON) and adding MVs for performance was the right call. You can't predict what fields you'll need.

2. Exactly-once matters more than you think

Early on, we had duplicate events from retry logic. Building with exactly-once semantics from day one saved us painful deduplication later.

3. Dead letter queues are not optional

The first time ClickHouse had an outage, we lost events. After adding sideline topics and a daily retry job, we stopped losing data.

4. The UI is harder than the backend

Building a usable analytics UI took longer than the entire ingestion pipeline. If we did it again, we'd start with an open-source UI (like PostHog or Metabase) and customize it.

5. Identity resolution is table stakes

Anonymous-to-user linking isn't a nice-to-have. Without it, your funnels are meaningless because you can't track users across sessions.


The Numbers

Metric Value
Total events stored 20+ billion
Daily ingestion ~200M+ events
Kafka Connect tasks 200+
ClickHouse query latency (P95) < 5 seconds
Event types supported 22+
Team size ~5 engineers

Was It Worth It?

Financially: We estimate annual savings of $350K+ compared to Mixpanel pricing at our scale.

Technically: We have full control. Need to join analytics with test scores? One SQL query. Need to debug a weird user session? Full access to raw data.

Operationally: It requires care and feeding. Kafka Connect, ClickHouse MVs, Airflow jobs—these need monitoring and maintenance.

Verdict: For our scale and use case, absolutely worth it. For a smaller company, I'd still recommend starting with a vendor and migrating later.


Open Source?

ClickSight, our analytics UI, is designed to be schema-agnostic. We're exploring open-sourcing it for teams that have event data in ClickHouse but lack a product analytics UI.

If you're interested, reach out.


Conclusion

Building a Mixpanel alternative isn't about replicating every feature—it's about solving your specific analytics needs with the flexibility and economics that make sense for your scale.

For us, that meant:

  • A Go-based ingestion gateway with worker pools
  • Kafka for reliable event streaming
  • ClickHouse for blazing-fast OLAP queries
  • Identity resolution built into the core
  • A React-based UI for product teams

If you're processing hundreds of millions of events and frustrated with vendor limitations, know that building your own is possible. It's a lot of work, but the control and cost savings are real.


Have questions? Reach out on LinkedIn.