sakib.ninja sakib.ninja
#FreePalestine 🇵🇸

ClickHouse: A Practical Guide for Developers

ClickHouse has become one of the fastest-growing databases in the analytics world, powering dashboards, log pipelines, and real-time reporting systems at companies like Yandex, Cloudflare, and Uber.
It’s a column-oriented OLAP database designed for lightning-fast queries on billions of rows, making it ideal for event data, time-series, and large-scale BI workloads.

In this post, I’ll walk you through ClickHouse’s core principles, schema design strategies, table engines, data types, and example queries to help you use it effectively in production.

Why ClickHouse?

Unlike transactional databases, ClickHouse is built for analytics at scale:

  • Columnar storage - Only reads the columns you query, saving I/O.
  • Immutable parts - Append-only model with background merges for efficiency.
  • Compression + vectorized execution - Queries over terabytes return in milliseconds.
  • SQL support - A familiar SQL dialect with rich analytical functions.
  • Scalability - Replication and sharding allow clusters to grow horizontally.

The tradeoff: ClickHouse is not transactional. It’s best for analytics, not for systems needing strict ACID guarantees.

Core Principles

  • Column-oriented: Stores data by column, not row, for better compression and scan speed.
  • Immutable parts: Updates/deletes create new parts; merges clean up in background.
  • Schema-first: You must define schemas upfront; types directly impact performance.
  • Query-driven design: Denormalization and pre-aggregation are common.
  • Replication & Sharding: Replication provides fault tolerance; sharding distributes data.

Schema Design Guidelines

  1. Partition by time or domain - Commonly by day/month for time-series.
  2. Primary Key = Sort Key - Defines row order within partitions; critical for query speed.
  3. Use the right MergeTree engine:
    • MergeTree - General purpose
    • ReplacingMergeTree - Keep the latest version
    • SummingMergeTree - Pre-sum numeric values
    • AggregatingMergeTree - Store aggregate states
  4. Avoid excessive mutations - Deletes/updates are costly.
  5. Denormalize for queries - Prefer flat, query-ready tables over joins.

Partitions, Primary Keys & Indexes

  • Partition Key: Splits data (e.g., monthly partitions).
  • Primary Key: Defines ordering inside partitions for fast range scans.
  • Data Skipping Indexes: Let queries skip blocks without relevant data.

Table Engines

The MergeTree family is the foundation of ClickHouse:

  • MergeTree - Core storage engine with background merges.
  • ReplacingMergeTree - Replace old rows by primary key.
  • SummingMergeTree - Aggregate sums during merges.
  • AggregatingMergeTree - Store aggregates for query acceleration.
  • ReplicatedMergeTree - Adds fault tolerance.
  • Distributed - Routes queries across shards.

Data Types

ClickHouse supports rich data types:

  • Numeric: Int32, UInt64, Float64, Decimal
  • String: String, FixedString
  • Date & Time: Date, DateTime, DateTime64
  • UUID: Universally unique identifier
  • Array & Nested: Arrays of any type, Nested structures
  • Map: Key-value pair storage
  • Nullable(T): Any type can be wrapped as Nullable

Built-in Functions

  • Aggregate: sum(), avg(), count(), quantile(), uniqExact()
  • Window: rank(), row_number(), movingAvg()
  • Array: arrayJoin(), arrayMap()
  • Time: toDate(), toStartOfMonth()
  • JSON: JSONExtract(), JSON_VALUE()
  • TTL: Automatic data expiration with TTL column + INTERVAL.

ClickHouse Write & Read Path

Write Path: Append-only Parts → Background Merges

  • Inserts create new immutable parts.
  • Background merges compact parts, aggregate data, and remove tombstones.
  • Unlike Cassandra, ClickHouse does not acknowledge per-consistency level, writes are eventually replicated.

Read Path: Columnar Scan → Indexes → Aggregation

  • Queries scan only required columns.
  • Partition pruning + skipping indexes avoid scanning irrelevant blocks.
  • Vectorized execution boosts aggregation speed.

Example Queries

Create Table

CREATE TABLE events
(
    event_date Date,
    user_id    UInt64,
    action     String,
    value      Float32
) ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id);

Insert Data

INSERT INTO events (event_date, user_id, action, value)
VALUES ('2025-10-12', 12345, 'click', 1.0);

Aggregate Query

SELECT event_date, action, count (*) AS events, avg (value) AS avg_value
FROM events
WHERE event_date BETWEEN '2025-10-01' AND '2025-10-12'
GROUP BY event_date, action
ORDER BY event_date;

Delete Old Data

ALTER TABLE events DELETE WHERE event_date < '2025-01-01';

TTL for Automatic Cleanup

CREATE TABLE events_ttl
(
    event_date DateTime,
    user_id    UInt64,
    action     String
) ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id)
TTL event_date + INTERVAL 90 DAY;

Practice Use Cases

  • Ad Click Analytics - (event_date, user_id, ad_id)
  • Log Processing - Store and query billions of log lines.
  • IoT/Time-Series - Efficiently handle sensor streams.
  • BI Dashboards - Real-time aggregations at scale.

Rebalancing in ClickHouse

Unlike Cassandra, ClickHouse does not auto-rebalance on shard changes:

  • Shard keys must be chosen up front.
  • To rebalance, reshard manually using INSERT SELECT into a new distributed table.
  • Replication automatically balances data across replicas.

This provides control but requires planning for scalability.

Common Mistakes to Avoid

  • Using it like OLTP - ClickHouse is not for transactions.
  • Too many small parts - Optimize insert batch sizes; compactions are expensive.
  • Wrong primary key - Poor sort keys kill performance.
  • Overusing mutations - Frequent deletes/updates hurt efficiency.
  • No partition strategy - Without partitions, queries degrade over time.
  • Ignoring replication/sharding design - Must be planned at schema stage.

Final Thoughts

ClickHouse shines when you need analytics on massive datasets in real time. It’s the go-to choice for event pipelines, BI dashboards, log analytics, and IoT workloads, but not for transactional apps.

The key mindset shift: Think append-only, columnar, and query-driven schema design.

If your system needs instant insights from billions of rows, ClickHouse is one of the best tools in your arsenal.

No previous