Building an In-House Mixpanel Alternative: Our Journey to Processing 20 Billion Events
How we built a production-grade product analytics platform from scratch—and what we learned along the way.
At Allen Digital, we serve millions of students preparing for competitive exams like JEE and NEET. Understanding how students interact with our platform—what features they use, where they drop off, how they learn—is critical to improving outcomes. Like many growing companies, we started with Mixpanel.
But as our scale grew to billions of events per month, we faced a familiar crossroads: keep paying escalating vendor costs, or build something ourselves.
We chose to build. This is the story of how we created a full-featured product analytics platform that rivals Mixpanel—handling event ingestion, identity resolution, and analytics at scale.
The Problem with Vendor Analytics at Scale
When you're processing a few million events, Mixpanel is a no-brainer. The UI is polished, the SDKs work out of the box, and you're up and running in days.
But at our scale—20+ billion events and growing—three problems emerged:
- Cost: Vendor pricing scales linearly with volume. Ours didn't need to.
- Flexibility: We needed to join analytics data with internal systems (test scores, batch assignments, learning paths). Mixpanel's data lives in their walled garden.
- Latency for custom queries: Complex queries that took seconds in SQL took forever through vendor APIs.
The question wasn't if we should build, but how.
The Architecture: Events In, Insights Out
We designed a system with four layers:
┌─────────────────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
│ Android SDK iOS SDK Web SDK Backend Services │
└─────────────────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ INGESTION LAYER │
│ parmanu-event-service (Go) │
│ • Event validation & enrichment │
│ • Identity resolution (IDM APIs) │
│ • Worker pool for high throughput │
└─────────────────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STREAMING LAYER │
│ Amazon MSK (Kafka) + Kafka Connect │
│ • 22+ topics for different event types │
│ • Exactly-once delivery to ClickHouse │
│ • Dead letter queues for failed events │
└─────────────────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STORAGE LAYER │
│ ClickHouse │
│ • Schema-on-read raw tables (flexibility) │
│ • Materialized views for query optimization │
│ • 600+ flattened columns for app events │
└─────────────────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ ANALYTICS LAYER │
│ ClickSight (React UI) │
│ • Funnels, Retention, Cohorts, User Paths │
│ • Event trends with breakdowns │
│ • SQL Playground for ad-hoc analysis │
└─────────────────────────────────────────────────────────────────────┘
Let me walk through each layer.
Layer 1: The Ingestion Gateway
We built parmanu-event-service in Go using the Kratos framework. It's the single entry point for all events—from mobile apps, web clients, and backend services.
Why a dedicated gateway?
Validation: Every event is validated against a configuration stored in ClickHouse. Unknown event types are rejected.
Enrichment: We automatically extract UTM parameters, parse user-agent strings, and attach server-side metadata (CDN headers, timestamps). Clients don't need to send this.
Throughput: A worker pool architecture handles bursty traffic. Events are queued in channels and processed by workers that produce to Kafka asynchronously.
// Simplified worker loop
func (ew *EventWorker) EventTaskListener() {
for {
select {
case task := <-ew.taskQueue:
response, err := ew.produceEvent(task)
task.result <- response
}
}
}
Client SDKs
We built SDKs for Android, iOS, and Web that handle:
- Batching events for network efficiency
- Persisting events during offline mode
- Automatic retry with exponential backoff
- Session and device ID management
Layer 2: Identity Resolution
One of the hardest problems in analytics is connecting anonymous users to logged-in users. When someone browses your app anonymously, then signs up, you need to stitch those sessions together.
We built an Identity Management (IDM) service with three core APIs:
Init — When the app launches
POST /v1/idm/init
→ Returns: { pixelSessionId: "uuid", pixelDeviceId: "uuid" }
The SDK stores these IDs and attaches them to every event. Even before login, we can track a user's journey.
Identify — When the user logs in
POST /v1/idm/identify
{ pixelSessionId, pixelDeviceId, userId }
→ Stores: identity_mappings(session, device, user)
This creates a permanent link: "device X and session Y belong to user Z."
Reset — When the user logs out
POST /v1/idm/reset
→ Returns: { pixelSessionId: "new-uuid" }
New session, same device. The identity graph preserves history while respecting session boundaries.
The mapping tables in ClickHouse allow us to query user behavior across sessions:
SELECT *
FROM events e
JOIN identity_mappings im ON e.pixel_session_id = im.session_id
WHERE im.user_id = '12345'
Layer 3: Streaming with Kafka + ClickHouse
Events flow from our gateway to Kafka, then to ClickHouse via Kafka Connect.
Why Kafka Connect (not Kafka Engine tables)?
ClickHouse has native Kafka Engine tables, but they have limitations at scale. Kafka Connect gave us:
- Exactly-once semantics: Critical for accurate analytics
- Dead letter queues: Failed events go to sideline topics, not lost forever
- Schema flexibility: The connector inserts JSON directly
Each event type has its own topic and connector:
- name: sink.prod.clickhouse.pixel-service.app-events
topics: v2.quantumHub.appEvents
tasksMax: 96 # High parallelism for volume
exactlyOnce: true
errors.tolerance: all
errors.deadletterqueue.topic.name: v2.quantumHub.appEvents.sideline
Sideline Recovery
Events can fail for various reasons—schema mismatches, transient ClickHouse issues. Instead of dropping them, we:
- Route failures to
*.sidelinetopics - Store them in
*_sidelineClickHouse tables - Run a daily Airflow job that retries them
# Simplified sideline retry
def retry_sideline_events():
sideline_events = spark.read.table("pixelService.appEvents_sideline")
# Apply fixes and re-insert to main table
sideline_events.write.insertInto("pixelService.appEvents")
Layer 4: The Schema Strategy
This was one of our most debated decisions: schema-on-read vs. schema-on-write.
Raw Tables: Schema-on-Read
CREATE TABLE pixelService.appEvents (
id String,
event String, -- Full JSON payload
eventTimestamp DateTime,
serverTimestamp DateTime64(3)
) ENGINE = MergeTree()
ORDER BY (serverTimestamp, id);
The event column stores the entire JSON. New fields? No ALTER TABLE needed. Just send them.
Derived Tables: Schema-on-Write
For fast querying, we use Materialized Views that flatten the JSON:
CREATE MATERIALIZED VIEW pixel_service_app_events_mv TO quantum_data.pixel_service_app_events
AS SELECT
JSONExtractString(event, 'event_name') AS event_name,
JSONExtractString(event, '$os') AS os,
JSONExtractString(event, 'pathname') AS pathname,
-- 600+ more columns
FROM pixelService.appEvents;
This gives us columnar storage benefits (compression, predicate pushdown) without sacrificing ingestion flexibility.
Trade-off: Adding a new column to the derived table requires updating the MV. We accept this because it's infrequent and the query performance gain is worth it.
Layer 5: The Analytics UI (ClickSight)
We built a React-based analytics UI called ClickSight that talks directly to ClickHouse over HTTP.
Schema Adapter Pattern
ClickSight is schema-agnostic. A configuration file maps logical fields to physical columns:
{
"schema": {
"columns": {
"event_name": "event_name",
"timestamp": "server_timestamp",
"user_id": "pixel_properties_user_id"
},
"user_identifier": {
"type": "computed",
"expression": "if(pixel_properties_user_id != '', pixel_properties_user_id, pixel_device_id)"
}
}
}
This means ClickSight can work with any ClickHouse event table—not just ours.
Features We Built
| Feature | Implementation |
|---|---|
| Funnels | ClickHouse windowFunnel() function with time windows |
| Retention | Cohort-based return rate analysis |
| User Paths | Sankey diagrams with D3.js |
| Trends | Time-series with daily/weekly/monthly granularity |
| Breakdowns | GROUP BY with multi-property support |
| SQL Playground | Monaco editor with syntax highlighting |
Performance Optimizations
Querying 20B rows directly is slow. We optimized with:
- Partition pruning: Queries always filter on
ist_date(date partition key) - Primary key design:
(event_name, ist_date)matches common query patterns - Query caching: Session storage for dropdown values
- CTEs over huge IN clauses: Scalable funnel queries
-- Instead of: WHERE user_id IN (10000 values)
-- We use:
WITH step_1_users AS (
SELECT user_identifier FROM user_funnels WHERE funnel_level >= 1
)
SELECT count(*) FROM events
WHERE user_identifier IN (SELECT * FROM step_1_users)
Self-Service Event Registration
We didn't want platform engineers to be a bottleneck. So we built quantum-hub, an admin service for event configuration.
The Approval Flow
- Create: Product team registers a new event type
- Approve: Platform team reviews and approves
- Auto-provision: The system creates:
- Kafka topics (forward + sideline)
- ClickHouse tables
- Read-only database user
- Deploy: Connector config is added to GitOps
- Done: Client SDKs can now send the new event
POST /v1/eventConfiguration
{
"eventName": "question_answered",
"contactEmail": "product-team@allen.in",
"autoCapture": {
"userDetails": true,
"userAgent": true
}
}
What We Learned
1. Start with raw JSON, optimize later
We debated upfront schema design for weeks. In hindsight, starting with schema-on-read (raw JSON) and adding MVs for performance was the right call. You can't predict what fields you'll need.
2. Exactly-once matters more than you think
Early on, we had duplicate events from retry logic. Building with exactly-once semantics from day one saved us painful deduplication later.
3. Dead letter queues are not optional
The first time ClickHouse had an outage, we lost events. After adding sideline topics and a daily retry job, we stopped losing data.
4. The UI is harder than the backend
Building a usable analytics UI took longer than the entire ingestion pipeline. If we did it again, we'd start with an open-source UI (like PostHog or Metabase) and customize it.
5. Identity resolution is table stakes
Anonymous-to-user linking isn't a nice-to-have. Without it, your funnels are meaningless because you can't track users across sessions.
The Numbers
| Metric | Value |
|---|---|
| Total events stored | 20+ billion |
| Daily ingestion | ~200M+ events |
| Kafka Connect tasks | 200+ |
| ClickHouse query latency (P95) | < 5 seconds |
| Event types supported | 22+ |
| Team size | ~5 engineers |
Was It Worth It?
Financially: We estimate annual savings of $350K+ compared to Mixpanel pricing at our scale.
Technically: We have full control. Need to join analytics with test scores? One SQL query. Need to debug a weird user session? Full access to raw data.
Operationally: It requires care and feeding. Kafka Connect, ClickHouse MVs, Airflow jobs—these need monitoring and maintenance.
Verdict: For our scale and use case, absolutely worth it. For a smaller company, I'd still recommend starting with a vendor and migrating later.
Open Source?
ClickSight, our analytics UI, is designed to be schema-agnostic. We're exploring open-sourcing it for teams that have event data in ClickHouse but lack a product analytics UI.
If you're interested, reach out.
Conclusion
Building a Mixpanel alternative isn't about replicating every feature—it's about solving your specific analytics needs with the flexibility and economics that make sense for your scale.
For us, that meant:
- A Go-based ingestion gateway with worker pools
- Kafka for reliable event streaming
- ClickHouse for blazing-fast OLAP queries
- Identity resolution built into the core
- A React-based UI for product teams
If you're processing hundreds of millions of events and frustrated with vendor limitations, know that building your own is possible. It's a lot of work, but the control and cost savings are real.
Have questions? Reach out on LinkedIn.