Back to all articles
2025-12-03

ClickHouse at Scale: Optimization Techniques That Saved Us Millions of Rows and Hours of Query Time

Battle-tested ClickHouse optimization techniques used in production for a 200+ TB real-time data platform.

After two years of running ClickHouse as the core analytical engine for a 200+ TB real-time platform (2.5K+ events/sec, billions of rows ingested daily), we've learned exactly what makes ClickHouse fly—and what makes it crawl. These are the concrete, battle-tested optimization techniques that turned multi-minute queries into sub-second responses and kept our clusters healthy under constant write pressure.

I'm sharing the playbook we actually use in production, ranked by impact.


1. Primary Key = Your Most Important Decision (Do It Wrong and Nothing Else Matters)

Most performance disasters we've seen (including our own early ones) trace back to a bad ORDER BY.

Real example – Clickstream events (before vs after):

Bad: Generic ordering → 2–3 minute funnel queries

ORDER BY (event_timestamp, user_identifier)

Good: Aligned with actual query patterns → 8–12 second → 800ms

ORDER BY (event_name, ist_date, pathname, `$os`, event_timestamp, user_identifier)

Rule of thumb: Put the columns you filter or group by most frequently first, from highest to lowest cardinality.

Acceptable trade-off: longer primary keys (5–7 columns is fine). ClickHouse can skip entire data parts when the prefix matches your WHERE/GROUP BY.


2. Partitioning: Monthly Is the Sweet Spot (Almost Always)

We tried daily partitions early on. Result: too many parts → slow merges → "too many parts" errors.

Production standard now:

PARTITION BY toYYYYMM(date_column)    -- monthly
-- or for very high-volume tables:
PARTITION BY toYYYYMMDD(date_column)  -- only if >50M rows/day

Monthly partitions give you:

  • Fewer parts (critical for MergeTree health)
  • Still good prune-ability for time-range queries
  • Faster OPTIMIZE/backup operations

3. ReplacingMergeTree + argMax > FINAL Clause (For Large Tables)

SELECT ... FINAL is convenient but kills performance on tables >10–50M rows.

Our pattern for latest-state tables:

-- Instead of FINAL (O(n))
SELECT
    user_id,
    argMax(name, updated_at) AS name,
    argMax(email, updated_at) AS email,
    argMax(status, updated_at) AS status
FROM quantum_data.users
GROUP BY user_id

10–100x faster on 500M+ row tables. We even wrap this in materialized views for dashboards.


4. LowCardinality() Is Magic (Use It Everywhere Possible)

Strings kill performance. LowCardinality turns them into dictionary-encoded integers.

Before (slow scans):

pathname String,
`$os` String

After (10x–20x faster filters):

pathname LowCardinality(String),
`$os` LowCardinality(String),
event_name LowCardinality(String)

Apply to any column with <1M distinct values (OS, browser, country, event_name, etc.).


5. Materialized Columns for Expensive Expressions

Stop computing the same thing repeatedly.

Common winners:

ist_date       Date MATERIALIZED toDate(server_timestamp, 'Asia/Kolkata'),
user_identifier String MATERIALIZED if(user_id != '', user_id, device_id),
is_new_user    UInt8 MATERIALIZED (created_at >= today() - 7)

These become real columns—no runtime cost.


6. Skip Indexes That Actually Work

Most people ignore them. We don't.

-- For range queries on non-primary-key columns
ALTER TABLE events ADD INDEX idx_revenue (revenue) TYPE minmax GRANULARITY 8192;

-- For IN queries or equality on medium-cardinality columns
ALTER TABLE events ADD INDEX idx_campaign (campaign_id) TYPE set(10000) GRANULARITY 8192;

These saved several 30+ second queries on cold filters.


7. Projection Views for Pre-Aggregated Dashboards

For repeated aggregation patterns (daily active users, revenue by day, etc.):

CREATE PROJECTION daily_summary
(
    SELECT
        toDate(server_timestamp) AS day,
        count() AS events,
        sum(revenue) AS revenue
    GROUP BY day
)

Query the projection directly → instant results, no scan.


8. Control Merge Speed (Don't Let Background Merges Ruin Your Life)

Default settings are too aggressive on write-heavy clusters.

Our tuned settings:

max_bytes_to_merge_at_max_space_in_pool = 50GiB   -- prevent huge merges
merge_with_ttl_timeout = 86400                    -- allow TTL merges to finish
background_pool_size = 4                          -- don't starve writes

9. ZSTD Compression + Codec Tuning

CODEC(ZSTD(3))                 -- best balance for most data
-- For numeric columns with small variance:
CODEC(Delta, ZSTD(1))
-- For IDs:
CODEC(Gorilla, ZSTD(1))

We routinely get 10:1 compression (200TB raw → 20TB stored).


10. Three-Cluster Pattern (The Real Scalability Unlock)

  • Ingestion cluster: optimized for writes, small nodes, aggressive merging disabled
  • Analytics cluster: large nodes, replicas from ingestion, serves dashboards
  • Ad-hoc cluster: throwaway for data science explosions

This separation saved us more times than I can count.


Bonus: The "Force Merge" Job We Run Weekly

SQL:

-- Clean up duplicates and small parts
SELECT table, count() AS parts
FROM system.parts
WHERE active AND table LIKE 'quantum_data.%'
GROUP BY table HAVING parts > 100

Then: Python

for table in dirty_tables:
    clickhouse.query(f"OPTIMIZE TABLE {table} FINAL")

Run during low-traffic hours. Keeps query performance predictable.


Results After Applying All of the Above

Metric Before Optimizations After Optimizations
p95 query latency 30–180 sec 1–3 sec
Funnel query time 2–3 minutes 800ms–2 sec
Storage used 200TB raw 20TB compressed
Too-many-parts errors Daily Never
Dashboard refresh time 5–15 min Real-time

ClickHouse is incredibly fast out of the box—but it becomes magical when you align every knob with your actual workload.

If you're running ClickHouse at scale and want to compare notes (or just vent about funnel queries), find me on LinkedIn. Happy querying!