-
Notifications
You must be signed in to change notification settings - Fork 338
Add Projections page #3608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add Projections page #3608
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
6bdf57d
add page on projections
Blargian 6344b2b
update example
Blargian 110eb1e
update query_log result
Blargian 6894a2e
fix typo in primary index of example image
Blargian 7cf63c1
PRIMARY KEY -> ORDER BY
Blargian 3ea0687
review comments
Blargian d119523
incorporate review feedback
Blargian 5770ef8
update doc with runnable code blocks
Blargian 99741fe
Add projections to landing pages
Blargian 06e3ea3
small fixes
Blargian 9771894
Tweaks
gingerwizard 1936c01
Update projections.md
gingerwizard 3f07c5c
Update projections.md
gingerwizard 07f69bf
Update projections.md
gingerwizard File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,382 @@ | ||
--- | ||
slug: /data-modeling/projections | ||
title: 'Projections' | ||
description: 'Page describing what projections are, how they can be used to improve | ||
query performance, and how they differ from materialized views.' | ||
keywords: ['projection', 'projections', 'query optimization'] | ||
--- | ||
|
||
import projections_1 from '@site/static/images/data-modeling/projections_1.png'; | ||
import projections_2 from '@site/static/images/data-modeling/projections_2.png'; | ||
import Image from '@theme/IdealImage'; | ||
|
||
# Projections | ||
|
||
## Introduction {#introduction} | ||
|
||
ClickHouse offers various mechanisms of speeding up analytical queries on large | ||
amounts of data for real-time scenarios. One such mechanism to speed up your | ||
queries is through the use of _Projections_. Projections help optimize | ||
queries by creating a reordering of data by attributes of interest. This can be: | ||
|
||
1. A complete reordering | ||
2. A subset of the original table with a different order | ||
3. A precomputed aggregation (similar to a Materialized View) but with an ordering | ||
aligned to the aggregation. | ||
|
||
## How do Projections work? {#how-do-projections-work} | ||
|
||
Practically, a Projection can be thought of as an additional, hidden table to the | ||
original table. The projection can have a different row order, and therefore a | ||
different primary index, to that of the original table and it can automatically | ||
and incrementally pre-compute aggregate values. As a result, using Projections | ||
provide two "tuning knobs" for speeding up query execution: | ||
|
||
- **Properly using primary indexes** | ||
- **Pre-computing aggregates** | ||
|
||
Projections are in some ways similar to [Materialized Views](/materialized-views) | ||
, which also allow you to have multiple row orders and pre-compute aggregations | ||
at insert time. | ||
Projections are automatically updated and | ||
kept in-sync with the original table, unlike Materialized Views, which are | ||
explicitly updated. When a query targets the original table, | ||
ClickHouse automatically samples the primary keys and chooses a table that can | ||
generate the same correct result, but requires the least amount of data to be | ||
read as shown in the figure below: | ||
|
||
<Image img={projections_1} size="lg" alt="Projections in ClickHouse"/> | ||
|
||
## Examples {#examples} | ||
|
||
### Filtering on columns which aren't in the primary key {#filtering-without-using-primary-keys} | ||
|
||
In this example, we'll show you how to add a projection to a table. | ||
We'll also look at how the projection can be used to speed up queries which filter | ||
on columns which are not in the primary key of a table. | ||
|
||
For this example, we'll be using the New York Taxi Data | ||
dataset available at [sql.clickhouse.com](sql.clickhouse.com) which is ordered | ||
by `pickup_datetime`. | ||
|
||
Let's write a simple query to find all the trip IDs for which passengers | ||
tipped their driver greater than $200: | ||
|
||
```sql runnable | ||
SELECT | ||
tip_amount, | ||
trip_id, | ||
dateDiff('minutes', pickup_datetime, dropoff_datetime) AS trip_duration_min | ||
FROM nyc_taxi.trips WHERE tip_amount > 200 AND trip_duration_min > 0 | ||
ORDER BY tip_amount, trip_id ASC | ||
``` | ||
|
||
Notice that because we are filtering on `tip_amount` which is not in the `ORDER BY`, ClickHouse | ||
had to do a full table scan. Let's speed this query up. | ||
|
||
So as to preserve the original table and results, we'll create a new table and copy the data using an `INSERT INTO SELECT`: | ||
|
||
```sql | ||
CREATE TABLE nyc_taxi.trips_with_projection AS nyc_taxi.trips; | ||
INSERT INTO nyc_taxi.trips_with_projection SELECT * FROM nyc_taxi.trips; | ||
``` | ||
|
||
To add a projection we use the `ALTER TABLE` statement together with the `ADD PROJECTION` | ||
statement: | ||
|
||
```sql | ||
ALTER TABLE nyc_taxi.trips_with_projection | ||
ADD PROJECTION prj_tip_amount | ||
( | ||
SELECT * | ||
ORDER BY tip_amount, dateDiff('minutes', pickup_datetime, dropoff_datetime) | ||
) | ||
``` | ||
|
||
It is necessary after adding a projection to use the `MATERIALIZE PROJECTION` | ||
statement so that the data in it is physically ordered and rewritten according | ||
to the specified query above: | ||
|
||
```sql | ||
ALTER TABLE nyc.trips_with_projection MATERIALIZE PROJECTION prj_tip_amount | ||
``` | ||
|
||
Let's run the query again now that we've added the projection: | ||
|
||
```sql runnable | ||
SELECT | ||
tip_amount, | ||
trip_id, | ||
dateDiff('minutes', pickup_datetime, dropoff_datetime) AS trip_duration_min | ||
FROM nyc_taxi.trips_with_projection WHERE tip_amount > 200 AND trip_duration_min > 0 | ||
ORDER BY tip_amount, trip_id ASC | ||
``` | ||
|
||
Notice how we were able to decrease the query time substantially, and needed to scan | ||
less rows. | ||
|
||
We can confirm that our query above did indeed use the projection we made by | ||
querying the `system.query_log` table: | ||
|
||
```sql | ||
SELECT query, projections | ||
FROM system.query_log | ||
WHERE query_id='<query_id>' | ||
``` | ||
|
||
```response | ||
┌─query─────────────────────────────────────────────────────────────────────────┬─projections──────────────────────┐ | ||
│ SELECT ↴│ ['default.trips.prj_tip_amount'] │ | ||
│↳ tip_amount, ↴│ │ | ||
│↳ trip_id, ↴│ │ | ||
│↳ dateDiff('minutes', pickup_datetime, dropoff_datetime) AS trip_duration_min↴│ │ | ||
│↳FROM trips WHERE tip_amount > 200 AND trip_duration_min > 0 │ │ | ||
└───────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────┘ | ||
``` | ||
|
||
### Using projections to speed up UK price paid queries {#using-projections-to-speed-up-UK-price-paid} | ||
|
||
To demonstrate how projections can be used to speed up query performance, let's | ||
take a look at an example using a real life dataset. For this example we'll be | ||
using the table from our [UK Property Price Paid](https://clickhouse.com/docs/getting-started/example-datasets/uk-price-paid) | ||
tutorial with 30.03 million rows. This dataset is also available within our | ||
[sql.clickhouse.com](https://sql.clickhouse.com/?query_id=6IDMHK3OMR1C97J6M9EUQS) | ||
environment. | ||
|
||
If you would like to see how the table was created and data inserted, you can | ||
refer to ["The UK property prices dataset"](/getting-started/example-datasets/uk-price-paid) | ||
page. | ||
|
||
We can run two simple queries on this dataset. The first lists the counties in London which | ||
have the highest prices paid, and the second calculates the average price for the counties: | ||
|
||
```sql runnable | ||
SELECT | ||
county, | ||
price | ||
FROM uk.uk_price_paid | ||
WHERE town = 'LONDON' | ||
ORDER BY price DESC | ||
LIMIT 3 | ||
``` | ||
|
||
```sql runnable | ||
SELECT | ||
county, | ||
avg(price) | ||
FROM uk.uk_price_paid | ||
GROUP BY county | ||
ORDER BY avg(price) DESC | ||
LIMIT 3 | ||
``` | ||
|
||
Notice that despite being very fast how a full table scan of all 30.03 million rows occurred for both queries, due | ||
to the fact that neither `town` nor `price` were in our `ORDER BY` statement when we | ||
created the table: | ||
|
||
```sql | ||
CREATE TABLE uk.uk_price_paid | ||
( | ||
... | ||
) | ||
ENGINE = MergeTree | ||
--highlight-next-line | ||
ORDER BY (postcode1, postcode2, addr1, addr2); | ||
``` | ||
|
||
Let's see if we can speed this query up using projections. | ||
|
||
To preserve the original table and results, we'll create a new table and copy the data using an `INSERT INTO SELECT`: | ||
|
||
```sql | ||
CREATE TABLE uk.uk_price_paid_with_projections AS uk_price_paid; | ||
INSERT INTO uk.uk_price_paid_with_projections SELECT * FROM uk.uk_price_paid; | ||
``` | ||
|
||
We create and populate projection `prj_oby_town_price` which produces an | ||
additional (hidden) table with a primary index, ordering by town and price, to | ||
optimize the query that lists the counties in a specific town for the highest | ||
paid prices: | ||
|
||
```sql | ||
ALTER TABLE uk.uk_price_paid_with_projections | ||
(ADD PROJECTION prj_obj_town_price | ||
( | ||
SELECT * | ||
ORDER BY | ||
town, | ||
price | ||
)) | ||
``` | ||
|
||
```sql | ||
ALTER TABLE uk.uk_price_paid_with_projections | ||
(MATERIALIZE PROJECTION prj_obj_town_price) | ||
SETTINGS mutations_sync = 1 | ||
``` | ||
|
||
The [`mutations_sync`](/operations/settings/settings#mutations_sync) setting is | ||
used to force synchronous execution. | ||
|
||
We create and populate projection `prj_gby_county` – an additional (hidden) table | ||
that incrementally pre-computes the avg(price) aggregate values for all existing | ||
130 UK counties: | ||
|
||
```sql | ||
ALTER TABLE uk.uk_price_paid_with_projections | ||
(ADD PROJECTION prj_gby_county | ||
( | ||
SELECT | ||
county, | ||
avg(price) | ||
GROUP BY county | ||
)) | ||
``` | ||
```sql | ||
ALTER TABLE uk.uk_price_paid_with_projections | ||
(MATERIALIZE PROJECTION prj_gby_county) | ||
SETTINGS mutations_sync = 1 | ||
``` | ||
|
||
:::note | ||
If there is a `GROUP BY` clause used in a projection like in the `prj_gby_county` | ||
projection above, then the underlying storage engine for the (hidden) table | ||
becomes `AggregatingMergeTree`, and all aggregate functions are converted to | ||
`AggregateFunction`. This ensures proper incremental data aggregation. | ||
::: | ||
|
||
The figure below is a visualization of the main table `uk_price_paid_with_projections` | ||
and its two projections: | ||
|
||
<Image img={projections_2} size="lg" alt="Visualization of the main table uk_price_paid_with_projections and its two projections"/> | ||
|
||
If we now run the query that lists the counties in London for the three highest | ||
paid prices again, we see an improvement in query performance: | ||
|
||
```sql runnable | ||
SELECT | ||
county, | ||
price | ||
FROM uk.uk_price_paid_with_projections | ||
WHERE town = 'LONDON' | ||
ORDER BY price DESC | ||
LIMIT 3 | ||
``` | ||
|
||
Likewise, for the query that lists the U.K. counties with the three highest | ||
average-paid prices: | ||
|
||
```sql runnable | ||
SELECT | ||
county, | ||
avg(price) | ||
FROM uk.uk_price_paid_with_projections | ||
GROUP BY county | ||
ORDER BY avg(price) DESC | ||
LIMIT 3 | ||
``` | ||
|
||
Note that both queries target the original table, and that both queries resulted | ||
in a full table scan (all 30.03 million rows got streamed from disk) before we | ||
created the two projections. | ||
|
||
Also, note that the query that lists the counties in London for the three highest | ||
paid prices is streaming 2.17 million rows. When we directly used a second table | ||
Blargian marked this conversation as resolved.
Show resolved
Hide resolved
|
||
optimized for this query, only 81.92 thousand rows were streamed from disk. | ||
|
||
The reason for the difference is that currently, the `optimize_read_in_order` | ||
optimization mentioned above isn’t supported for projections. | ||
|
||
We inspect the `system.query_log` table to see that ClickHouse | ||
automatically used the two projections for the two queries above (see the | ||
projections column below): | ||
|
||
```sql | ||
SELECT | ||
tables, | ||
query, | ||
query_duration_ms::String || ' ms' AS query_duration, | ||
formatReadableQuantity(read_rows) AS read_rows, | ||
projections | ||
FROM clusterAllReplicas(default, system.query_log) | ||
WHERE (type = 'QueryFinish') AND (tables = ['default.uk_price_paid_with_projections']) | ||
ORDER BY initial_query_start_time DESC | ||
LIMIT 2 | ||
FORMAT Vertical | ||
``` | ||
|
||
```response | ||
Row 1: | ||
────── | ||
tables: ['uk.uk_price_paid_with_projections'] | ||
query: SELECT | ||
county, | ||
avg(price) | ||
FROM uk_price_paid_with_projections | ||
GROUP BY county | ||
ORDER BY avg(price) DESC | ||
LIMIT 3 | ||
query_duration: 5 ms | ||
read_rows: 132.00 | ||
projections: ['uk.uk_price_paid_with_projections.prj_gby_county'] | ||
|
||
Row 2: | ||
────── | ||
tables: ['uk.uk_price_paid_with_projections'] | ||
query: SELECT | ||
county, | ||
price | ||
FROM uk_price_paid_with_projections | ||
WHERE town = 'LONDON' | ||
ORDER BY price DESC | ||
LIMIT 3 | ||
SETTINGS log_queries=1 | ||
query_duration: 11 ms | ||
read_rows: 2.29 million | ||
projections: ['uk.uk_price_paid_with_projections.prj_obj_town_price'] | ||
|
||
2 rows in set. Elapsed: 0.006 sec. | ||
``` | ||
|
||
## When to use Projections? {#when-to-use-projections} | ||
|
||
Projections are an appealing feature for new users as they are automatically | ||
maintained as data is inserted. Furthermore, queries can just be sent to a | ||
single table where the projections are exploited where possible to speed up | ||
the response time. | ||
|
||
This is in contrast to Materialized Views, where the user has to select the | ||
appropriate optimized target table or rewrite their query, depending on the | ||
filters. This places greater emphasis on user applications and increases | ||
client-side complexity. | ||
|
||
Despite these advantages, projections come with some inherent limitations which | ||
users should be aware of and thus should be deployed sparingly. | ||
|
||
- Projections don't allow using different TTL for the source table and the | ||
gingerwizard marked this conversation as resolved.
Show resolved
Hide resolved
|
||
(hidden) target table, materialized views allow different TTLs. | ||
- Projections don't currently support `optimize_read_in_order` for the (hidden) | ||
target table. | ||
- Lightweight updates and deletes are not supported for tables with projections. | ||
- Materialized Views can be chained: the target table of one Materialized View | ||
can be the source table of another Materialized View, and so on. This is not | ||
possible with projections. | ||
- Projections don't support joins, but Materialized Views do. | ||
- Projections don't support filters (`WHERE` clause), but Materialized Views do. | ||
|
||
We recommend using projections when: | ||
|
||
- A complete re-ordering of the data is required. While the expression in the | ||
projection can, in theory, use a `GROUP BY,` materialized views are more | ||
effective for maintaining aggregates. The query optimizer is also more likely | ||
to exploit projections that use a simple reordering, i.e., `SELECT * ORDER BY x`. | ||
Users can select a subset of columns in this expression to reduce storage | ||
footprint. | ||
- Users are comfortable with the associated increase in storage footprint and | ||
overhead of writing data twice. Test the impact on insertion speed and | ||
[evaluate the storage overhead](/data-compression/compression-in-clickhouse). | ||
|
||
## Related content {#related-content} | ||
- [A Practical Introduction to Primary Indexes in ClickHouse](/guides/best-practices/sparse-primary-indexes#option-3-projections) | ||
- [Materialized Views](/docs/materialized-views) | ||
- [ALTER PROJECTION](/sql-reference/statements/alter/projection) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.