Skip to content

Add Projections page #3608

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 10, 2025
1 change: 1 addition & 0 deletions docs/data-modeling/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,6 @@ This section is about data modeling in ClickHouse and contains the following top
| [Schema Design](/data-modeling/schema-design) | Discusses ClickHouse schema design for optimal performance, considering factors like queries, data updates, latency, and volume. |
| [Dictionary](/dictionary) | An explainer on how to define and use dictionaries to improve query performance and enrich data. |
| [Materialized Views](/materialized-views) | Information on Materialized Views and Refreshable Materialized Views in ClickHouse. |
| [Projections](/data-modeling/projections)| Information on Projections in ClickHouse.|
| [Data Compression](/data-compression/compression-in-clickhouse) | Discusses various compression modes in ClickHouse and how to optimize data storage and query performance by choosing the right compression method for your specific data types and workloads. |
| [Denormalizing Data](/data-modeling/denormalization) | Discusses the denormalization approach used in ClickHouse which aims to improve query performance by storing related data in a single table. |
382 changes: 382 additions & 0 deletions docs/data-modeling/projections.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,382 @@
---
slug: /data-modeling/projections
title: 'Projections'
description: 'Page describing what projections are, how they can be used to improve
query performance, and how they differ from materialized views.'
keywords: ['projection', 'projections', 'query optimization']
---

import projections_1 from '@site/static/images/data-modeling/projections_1.png';
import projections_2 from '@site/static/images/data-modeling/projections_2.png';
import Image from '@theme/IdealImage';

# Projections

## Introduction {#introduction}

ClickHouse offers various mechanisms of speeding up analytical queries on large
amounts of data for real-time scenarios. One such mechanism to speed up your
queries is through the use of _Projections_. Projections help optimize
queries by creating a reordering of data by attributes of interest. This can be:

1. A complete reordering
2. A subset of the original table with a different order
3. A precomputed aggregation (similar to a Materialized View) but with an ordering
aligned to the aggregation.

## How do Projections work? {#how-do-projections-work}

Practically, a Projection can be thought of as an additional, hidden table to the
original table. The projection can have a different row order, and therefore a
different primary index, to that of the original table and it can automatically
and incrementally pre-compute aggregate values. As a result, using Projections
provide two "tuning knobs" for speeding up query execution:

- **Properly using primary indexes**
- **Pre-computing aggregates**

Projections are in some ways similar to [Materialized Views](/materialized-views)
, which also allow you to have multiple row orders and pre-compute aggregations
at insert time.
Projections are automatically updated and
kept in-sync with the original table, unlike Materialized Views, which are
explicitly updated. When a query targets the original table,
ClickHouse automatically samples the primary keys and chooses a table that can
generate the same correct result, but requires the least amount of data to be
read as shown in the figure below:

<Image img={projections_1} size="lg" alt="Projections in ClickHouse"/>

## Examples {#examples}

### Filtering on columns which aren't in the primary key {#filtering-without-using-primary-keys}

In this example, we'll show you how to add a projection to a table.
We'll also look at how the projection can be used to speed up queries which filter
on columns which are not in the primary key of a table.

For this example, we'll be using the New York Taxi Data
dataset available at [sql.clickhouse.com](sql.clickhouse.com) which is ordered
by `pickup_datetime`.

Let's write a simple query to find all the trip IDs for which passengers
tipped their driver greater than $200:

```sql runnable
SELECT
tip_amount,
trip_id,
dateDiff('minutes', pickup_datetime, dropoff_datetime) AS trip_duration_min
FROM nyc_taxi.trips WHERE tip_amount > 200 AND trip_duration_min > 0
ORDER BY tip_amount, trip_id ASC
```

Notice that because we are filtering on `tip_amount` which is not in the `ORDER BY`, ClickHouse
had to do a full table scan. Let's speed this query up.

So as to preserve the original table and results, we'll create a new table and copy the data using an `INSERT INTO SELECT`:

```sql
CREATE TABLE nyc_taxi.trips_with_projection AS nyc_taxi.trips;
INSERT INTO nyc_taxi.trips_with_projection SELECT * FROM nyc_taxi.trips;
```

To add a projection we use the `ALTER TABLE` statement together with the `ADD PROJECTION`
statement:

```sql
ALTER TABLE nyc_taxi.trips_with_projection
ADD PROJECTION prj_tip_amount
(
SELECT *
ORDER BY tip_amount, dateDiff('minutes', pickup_datetime, dropoff_datetime)
)
```

It is necessary after adding a projection to use the `MATERIALIZE PROJECTION`
statement so that the data in it is physically ordered and rewritten according
to the specified query above:

```sql
ALTER TABLE nyc.trips_with_projection MATERIALIZE PROJECTION prj_tip_amount
```

Let's run the query again now that we've added the projection:

```sql runnable
SELECT
tip_amount,
trip_id,
dateDiff('minutes', pickup_datetime, dropoff_datetime) AS trip_duration_min
FROM nyc_taxi.trips_with_projection WHERE tip_amount > 200 AND trip_duration_min > 0
ORDER BY tip_amount, trip_id ASC
```

Notice how we were able to decrease the query time substantially, and needed to scan
less rows.

We can confirm that our query above did indeed use the projection we made by
querying the `system.query_log` table:

```sql
SELECT query, projections
FROM system.query_log
WHERE query_id='<query_id>'
```

```response
┌─query─────────────────────────────────────────────────────────────────────────┬─projections──────────────────────┐
│ SELECT ↴│ ['default.trips.prj_tip_amount'] │
│↳ tip_amount, ↴│ │
│↳ trip_id, ↴│ │
│↳ dateDiff('minutes', pickup_datetime, dropoff_datetime) AS trip_duration_min↴│ │
│↳FROM trips WHERE tip_amount > 200 AND trip_duration_min > 0 │ │
└───────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────┘
```

### Using projections to speed up UK price paid queries {#using-projections-to-speed-up-UK-price-paid}

To demonstrate how projections can be used to speed up query performance, let's
take a look at an example using a real life dataset. For this example we'll be
using the table from our [UK Property Price Paid](https://clickhouse.com/docs/getting-started/example-datasets/uk-price-paid)
tutorial with 30.03 million rows. This dataset is also available within our
[sql.clickhouse.com](https://sql.clickhouse.com/?query_id=6IDMHK3OMR1C97J6M9EUQS)
environment.

If you would like to see how the table was created and data inserted, you can
refer to ["The UK property prices dataset"](/getting-started/example-datasets/uk-price-paid)
page.

We can run two simple queries on this dataset. The first lists the counties in London which
have the highest prices paid, and the second calculates the average price for the counties:

```sql runnable
SELECT
county,
price
FROM uk.uk_price_paid
WHERE town = 'LONDON'
ORDER BY price DESC
LIMIT 3
```

```sql runnable
SELECT
county,
avg(price)
FROM uk.uk_price_paid
GROUP BY county
ORDER BY avg(price) DESC
LIMIT 3
```

Notice that despite being very fast how a full table scan of all 30.03 million rows occurred for both queries, due
to the fact that neither `town` nor `price` were in our `ORDER BY` statement when we
created the table:

```sql
CREATE TABLE uk.uk_price_paid
(
...
)
ENGINE = MergeTree
--highlight-next-line
ORDER BY (postcode1, postcode2, addr1, addr2);
```

Let's see if we can speed this query up using projections.

To preserve the original table and results, we'll create a new table and copy the data using an `INSERT INTO SELECT`:

```sql
CREATE TABLE uk.uk_price_paid_with_projections AS uk_price_paid;
INSERT INTO uk.uk_price_paid_with_projections SELECT * FROM uk.uk_price_paid;
```

We create and populate projection `prj_oby_town_price` which produces an
additional (hidden) table with a primary index, ordering by town and price, to
optimize the query that lists the counties in a specific town for the highest
paid prices:

```sql
ALTER TABLE uk.uk_price_paid_with_projections
(ADD PROJECTION prj_obj_town_price
(
SELECT *
ORDER BY
town,
price
))
```

```sql
ALTER TABLE uk.uk_price_paid_with_projections
(MATERIALIZE PROJECTION prj_obj_town_price)
SETTINGS mutations_sync = 1
```

The [`mutations_sync`](/operations/settings/settings#mutations_sync) setting is
used to force synchronous execution.

We create and populate projection `prj_gby_county` – an additional (hidden) table
that incrementally pre-computes the avg(price) aggregate values for all existing
130 UK counties:

```sql
ALTER TABLE uk.uk_price_paid_with_projections
(ADD PROJECTION prj_gby_county
(
SELECT
county,
avg(price)
GROUP BY county
))
```
```sql
ALTER TABLE uk.uk_price_paid_with_projections
(MATERIALIZE PROJECTION prj_gby_county)
SETTINGS mutations_sync = 1
```

:::note
If there is a `GROUP BY` clause used in a projection like in the `prj_gby_county`
projection above, then the underlying storage engine for the (hidden) table
becomes `AggregatingMergeTree`, and all aggregate functions are converted to
`AggregateFunction`. This ensures proper incremental data aggregation.
:::

The figure below is a visualization of the main table `uk_price_paid_with_projections`
and its two projections:

<Image img={projections_2} size="lg" alt="Visualization of the main table uk_price_paid_with_projections and its two projections"/>

If we now run the query that lists the counties in London for the three highest
paid prices again, we see an improvement in query performance:

```sql runnable
SELECT
county,
price
FROM uk.uk_price_paid_with_projections
WHERE town = 'LONDON'
ORDER BY price DESC
LIMIT 3
```

Likewise, for the query that lists the U.K. counties with the three highest
average-paid prices:

```sql runnable
SELECT
county,
avg(price)
FROM uk.uk_price_paid_with_projections
GROUP BY county
ORDER BY avg(price) DESC
LIMIT 3
```

Note that both queries target the original table, and that both queries resulted
in a full table scan (all 30.03 million rows got streamed from disk) before we
created the two projections.

Also, note that the query that lists the counties in London for the three highest
paid prices is streaming 2.17 million rows. When we directly used a second table
optimized for this query, only 81.92 thousand rows were streamed from disk.

The reason for the difference is that currently, the `optimize_read_in_order`
optimization mentioned above isn’t supported for projections.

We inspect the `system.query_log` table to see that ClickHouse
automatically used the two projections for the two queries above (see the
projections column below):

```sql
SELECT
tables,
query,
query_duration_ms::String || ' ms' AS query_duration,
formatReadableQuantity(read_rows) AS read_rows,
projections
FROM clusterAllReplicas(default, system.query_log)
WHERE (type = 'QueryFinish') AND (tables = ['default.uk_price_paid_with_projections'])
ORDER BY initial_query_start_time DESC
LIMIT 2
FORMAT Vertical
```

```response
Row 1:
──────
tables: ['uk.uk_price_paid_with_projections']
query: SELECT
county,
avg(price)
FROM uk_price_paid_with_projections
GROUP BY county
ORDER BY avg(price) DESC
LIMIT 3
query_duration: 5 ms
read_rows: 132.00
projections: ['uk.uk_price_paid_with_projections.prj_gby_county']

Row 2:
──────
tables: ['uk.uk_price_paid_with_projections']
query: SELECT
county,
price
FROM uk_price_paid_with_projections
WHERE town = 'LONDON'
ORDER BY price DESC
LIMIT 3
SETTINGS log_queries=1
query_duration: 11 ms
read_rows: 2.29 million
projections: ['uk.uk_price_paid_with_projections.prj_obj_town_price']

2 rows in set. Elapsed: 0.006 sec.
```

## When to use Projections? {#when-to-use-projections}

Projections are an appealing feature for new users as they are automatically
maintained as data is inserted. Furthermore, queries can just be sent to a
single table where the projections are exploited where possible to speed up
the response time.

This is in contrast to Materialized Views, where the user has to select the
appropriate optimized target table or rewrite their query, depending on the
filters. This places greater emphasis on user applications and increases
client-side complexity.

Despite these advantages, projections come with some inherent limitations which
users should be aware of and thus should be deployed sparingly.

- Projections don't allow using different TTL for the source table and the
(hidden) target table, materialized views allow different TTLs.
- Projections don't currently support `optimize_read_in_order` for the (hidden)
target table.
- Lightweight updates and deletes are not supported for tables with projections.
- Materialized Views can be chained: the target table of one Materialized View
can be the source table of another Materialized View, and so on. This is not
possible with projections.
- Projections don't support joins, but Materialized Views do.
- Projections don't support filters (`WHERE` clause), but Materialized Views do.

We recommend using projections when:

- A complete re-ordering of the data is required. While the expression in the
projection can, in theory, use a `GROUP BY,` materialized views are more
effective for maintaining aggregates. The query optimizer is also more likely
to exploit projections that use a simple reordering, i.e., `SELECT * ORDER BY x`.
Users can select a subset of columns in this expression to reduce storage
footprint.
- Users are comfortable with the associated increase in storage footprint and
overhead of writing data twice. Test the impact on insertion speed and
[evaluate the storage overhead](/data-compression/compression-in-clickhouse).

## Related content {#related-content}
- [A Practical Introduction to Primary Indexes in ClickHouse](/guides/best-practices/sparse-primary-indexes#option-3-projections)
- [Materialized Views](/docs/materialized-views)
- [ALTER PROJECTION](/sql-reference/statements/alter/projection)
Loading