diff --git a/docs/about-us/adopters.md b/docs/about-us/adopters.md index 112493c1720..43fcc059334 100644 --- a/docs/about-us/adopters.md +++ b/docs/about-us/adopters.md @@ -6,7 +6,7 @@ sidebar_position: 60 description: 'A list of companies using ClickHouse and their success stories' --- -The following list of companies using ClickHouse and their success stories is assembled from public sources, thus might differ from current reality. We’d appreciate it if you share the story of adopting ClickHouse in your company and [add it to the list](https://github.com/ClickHouse/clickhouse-docs/blob/main/docs/about-us/adopters.md), but please make sure you won’t have any NDA issues by doing so. Providing updates with publications from other companies is also useful. +The following list of companies using ClickHouse and their success stories is assembled from public sources, thus might differ from current reality. We'd appreciate it if you share the story of adopting ClickHouse in your company and [add it to the list](https://github.com/ClickHouse/clickhouse-docs/blob/main/docs/about-us/adopters.md), but please make sure you won't have any NDA issues by doing so. Providing updates with publications from other companies is also useful.
diff --git a/docs/about-us/distinctive-features.md b/docs/about-us/distinctive-features.md index eeb418e13ac..5e08106b61d 100644 --- a/docs/about-us/distinctive-features.md +++ b/docs/about-us/distinctive-features.md @@ -78,7 +78,7 @@ ClickHouse provides various ways to trade accuracy for performance: ## Adaptive Join Algorithm {#adaptive-join-algorithm} -ClickHouse adaptively chooses how to [JOIN](../sql-reference/statements/select/join.md) multiple tables, by preferring hash-join algorithm and falling back to the merge-join algorithm if there’s more than one large table. +ClickHouse adaptively chooses how to [JOIN](../sql-reference/statements/select/join.md) multiple tables, by preferring hash-join algorithm and falling back to the merge-join algorithm if there's more than one large table. ## Data Replication and Data Integrity Support {#data-replication-and-data-integrity-support} diff --git a/docs/about-us/history.md b/docs/about-us/history.md index 1a1e16497db..3546af768e7 100644 --- a/docs/about-us/history.md +++ b/docs/about-us/history.md @@ -37,7 +37,7 @@ There is a widespread opinion that to calculate statistics effectively, you must However data aggregation comes with a lot of limitations: - You must have a pre-defined list of required reports. -- The user can’t make custom reports. +- The user can't make custom reports. - When aggregating over a large number of distinct keys, the data volume is barely reduced, so aggregation is useless. - For a large number of reports, there are too many aggregation variations (combinatorial explosion). - When aggregating keys with high cardinality (such as URLs), the volume of data is not reduced by much (less than twofold). diff --git a/docs/about-us/support.md b/docs/about-us/support.md index 36fa685b600..9d832acb728 100644 --- a/docs/about-us/support.md +++ b/docs/about-us/support.md @@ -28,7 +28,7 @@ Customers can only log Severity 3 tickets for single replica services across tie You can also subscribe to our [status page](https://status.clickhouse.com) to get notified quickly about any incidents affecting our platform. :::note -Please note that only Subscription Customers have a Service Level Agreement on Support Incidents. If you are not currently a ClickHouse Cloud user – while we will try to answer your question, we’d encourage you to go instead to one of our Community resources: +Please note that only Subscription Customers have a Service Level Agreement on Support Incidents. If you are not currently a ClickHouse Cloud user – while we will try to answer your question, we'd encourage you to go instead to one of our Community resources: - [ClickHouse Community Slack Channel](https://clickhouse.com/slack) - [Other Community Options](https://github.com/ClickHouse/ClickHouse/blob/master/README.md#useful-links) diff --git a/docs/architecture/cluster-deployment.md b/docs/architecture/cluster-deployment.md index 574b9390303..142498ffd6f 100644 --- a/docs/architecture/cluster-deployment.md +++ b/docs/architecture/cluster-deployment.md @@ -3,12 +3,12 @@ slug: /architecture/cluster-deployment sidebar_label: 'Cluster Deployment' sidebar_position: 100 title: 'Cluster Deployment' -description: 'By going through this tutorial, you’ll learn how to set up a simple ClickHouse cluster.' +description: 'By going through this tutorial, you'll learn how to set up a simple ClickHouse cluster.' --- This tutorial assumes you've already set up a [local ClickHouse server](../getting-started/install.md) -By going through this tutorial, you’ll learn how to set up a simple ClickHouse cluster. It’ll be small, but fault-tolerant and scalable. Then we will use one of the example datasets to fill it with data and execute some demo queries. +By going through this tutorial, you'll learn how to set up a simple ClickHouse cluster. It'll be small, but fault-tolerant and scalable. Then we will use one of the example datasets to fill it with data and execute some demo queries. ## Cluster Deployment {#cluster-deployment} @@ -19,7 +19,7 @@ This ClickHouse cluster will be a homogeneous cluster. Here are the steps: 3. Create local tables on each instance 4. Create a [Distributed table](../engines/table-engines/special/distributed.md) -A [distributed table](../engines/table-engines/special/distributed.md) is a kind of "view" to the local tables in a ClickHouse cluster. A SELECT query from a distributed table executes using resources of all cluster’s shards. You may specify configs for multiple clusters and create multiple distributed tables to provide views for different clusters. +A [distributed table](../engines/table-engines/special/distributed.md) is a kind of "view" to the local tables in a ClickHouse cluster. A SELECT query from a distributed table executes using resources of all cluster's shards. You may specify configs for multiple clusters and create multiple distributed tables to provide views for different clusters. Here is an example config for a cluster with three shards, with one replica each: @@ -48,7 +48,7 @@ Here is an example config for a cluster with three shards, with one replica each ``` -For further demonstration, let’s create a new local table with the same `CREATE TABLE` query that we used for `hits_v1` in the single node deployment tutorial, but with a different table name: +For further demonstration, let's create a new local table with the same `CREATE TABLE` query that we used for `hits_v1` in the single node deployment tutorial, but with a different table name: ```sql CREATE TABLE tutorial.hits_local (...) ENGINE = MergeTree() ... @@ -63,7 +63,7 @@ ENGINE = Distributed(perftest_3shards_1replicas, tutorial, hits_local, rand()); A common practice is to create similar distributed tables on all machines of the cluster. This allows running distributed queries on any machine of the cluster. There's also an alternative option to create a temporary distributed table for a given SELECT query using [remote](../sql-reference/table-functions/remote.md) table function. -Let’s run [INSERT SELECT](../sql-reference/statements/insert-into.md) into the distributed table to spread the table to multiple servers. +Let's run [INSERT SELECT](../sql-reference/statements/insert-into.md) into the distributed table to spread the table to multiple servers. ```sql INSERT INTO tutorial.hits_all SELECT * FROM tutorial.hits_v1; @@ -99,10 +99,10 @@ Here is an example config for a cluster of one shard containing three replicas: ``` -To enable native replication [ZooKeeper](http://zookeeper.apache.org/), is required. ClickHouse takes care of data consistency on all replicas and runs a restore procedure after a failure automatically. It’s recommended to deploy the ZooKeeper cluster on separate servers (where no other processes including ClickHouse are running). +To enable native replication [ZooKeeper](http://zookeeper.apache.org/), is required. ClickHouse takes care of data consistency on all replicas and runs a restore procedure after a failure automatically. It's recommended to deploy the ZooKeeper cluster on separate servers (where no other processes including ClickHouse are running). :::note Note -ZooKeeper is not a strict requirement: in some simple cases, you can duplicate the data by writing it into all the replicas from your application code. This approach is **not** recommended, as in this case, ClickHouse won’t be able to guarantee data consistency on all replicas. Thus, it becomes the responsibility of your application. +ZooKeeper is not a strict requirement: in some simple cases, you can duplicate the data by writing it into all the replicas from your application code. This approach is **not** recommended, as in this case, ClickHouse won't be able to guarantee data consistency on all replicas. Thus, it becomes the responsibility of your application. ::: ZooKeeper locations are specified in the configuration file: diff --git a/docs/best-practices/_snippets/_async_inserts.md b/docs/best-practices/_snippets/_async_inserts.md new file mode 100644 index 00000000000..02e12fb39ee --- /dev/null +++ b/docs/best-practices/_snippets/_async_inserts.md @@ -0,0 +1,64 @@ +import Image from '@theme/IdealImage'; +import async_inserts from '@site/static/images/bestpractices/async_inserts.png'; + +Asynchronous inserts in ClickHouse provide a powerful alternative when client-side batching isn't feasible. This is especially valuable in observability workloads, where hundreds or thousands of agents send data continuously - logs, metrics, traces - often in small, real-time payloads. Buffering data client-side in these environments increases complexity, requiring a centralized queue to ensure sufficiently large batches can be sent. + +:::note +Sending many small batches in synchronous mode is not recommended, leading to many parts being created. This will lead to poor query performance and ["too many part"](/knowledgebase/exception-too-many-parts) errors. +::: + +Asynchronous inserts shift batching responsibility from the client to the server by writing incoming data to an in-memory buffer, then flushing it to storage based on configurable thresholds. This approach significantly reduces part creation overhead, lowers CPU usage, and ensures ingestion remains efficient - even under high concurrency. + +The core behavior is controlled via the [`async_insert`](/operations/settings/settings#async_insert) setting. + +Async inserts + +When enabled (1), inserts are buffered and only written to disk once one of the flush conditions is met: + +(1) the buffer reaches a specified size (async_insert_max_data_size) +(2) a time threshold elapses (async_insert_busy_timeout_ms) or +(3) a maximum number of insert queries accumulate (async_insert_max_query_number). + +This batching process is invisible to clients and helps ClickHouse efficiently merge insert traffic from multiple sources. However, until a flush occurs, the data cannot be queried. Importantly, there are multiple buffers per insert shape and settings combination, and in clusters, buffers are maintained per node - enabling fine-grained control across multi-tenant environments. Insert mechanics are otherwise identical to those described for [synchronous inserts](/best-practices/selecting-an-insert-strategy#synchronous-inserts-by-default). + +### Choosing a Return Mode {#choosing-a-return-mode} + +The behavior of asynchronous inserts is further refined using the [`wait_for_async_insert`](/operations/settings/settings#wait_for_async_insert) setting. + +When set to 1 (the default), ClickHouse only acknowledges the insert after the data is successfully flushed to disk. This ensures strong durability guarantees and makes error handling straightforward: if something goes wrong during the flush, the error is returned to the client. This mode is recommended for most production scenarios, especially when insert failures must be tracked reliably. + +[Benchmarks](https://clickhouse.com/blog/asynchronous-data-inserts-in-clickhouse) show it scales well with concurrency - whether you're running 200 or 500 clients- thanks to adaptive inserts and stable part creation behavior. + +Setting `wait_for_async_insert = 0` enables "fire-and-forget" mode. Here, the server acknowledges the insert as soon as the data is buffered, without waiting for it to reach storage. + +This offers ultra-low-latency inserts and maximal throughput, ideal for high-velocity, low-criticality data. However, this comes with trade-offs: there's no guarantee the data will be persisted, errors may only surface during flush, and it's difficult to trace failed inserts. Use this mode only if your workload can tolerate data loss. + +[Benchmarks also demonstrate](https://clickhouse.com/blog/asynchronous-data-inserts-in-clickhouse) substantial part reduction and lower CPU usage when buffer flushes are infrequent (e.g. every 30 seconds), but the risk of silent failure remains. + +Our strong recommendation is to use `async_insert=1,wait_for_async_insert=1` if using asynchronous inserts. Using `wait_for_async_insert=0` is very risky because your INSERT client may not be aware if there are errors, and also can cause potential overload if your client continues to write quickly in a situation where the ClickHouse server needs to slow down the writes and create some backpressure in order to ensure reliability of the service. + +### Deduplication and reliability {#deduplication-and-reliability} + +By default, ClickHouse performs automatic deduplication for synchronous inserts, which makes retries safe in failure scenarios. However, this is disabled for asynchronous inserts unless explicitly enabled (this should not be enabled if you have dependent materialized views - [see issue](https://github.com/ClickHouse/ClickHouse/issues/66003)). + +In practice, if deduplication is turned on and the same insert is retried - due to, for instance, a timeout or network drop - ClickHouse can safely ignore the duplicate. This helps maintain idempotency and avoids double-writing data. Still, it's worth noting that insert validation and schema parsing happen only during buffer flush - so errors (like type mismatches) will only surface at that point. + +### Enabling asynchronous inserts {#enabling-asynchronous-inserts} + +Asynchronous inserts can be enabled for a particular user, or for a specific query: + +- Enabling asynchronous inserts at the user level. This example uses the user `default`, if you create a different user then substitute that username: + ```sql + ALTER USER default SETTINGS async_insert = 1 + ``` +- You can specify the asynchronous insert settings by using the SETTINGS clause of insert queries: + ```sql + INSERT INTO YourTable SETTINGS async_insert=1, wait_for_async_insert=1 VALUES (...) + ``` +- You can also specify asynchronous insert settings as connection parameters when using a ClickHouse programming language client. + + As an example, this is how you can do that within a JDBC connection string when you use the ClickHouse Java JDBC driver for connecting to ClickHouse Cloud : + ```bash + "jdbc:ch://HOST.clickhouse.cloud:8443/?user=default&password=PASSWORD&ssl=true&custom_http_params=async_insert=1,wait_for_async_insert=1" + ``` + diff --git a/docs/best-practices/_snippets/_avoid_mutations.md b/docs/best-practices/_snippets/_avoid_mutations.md new file mode 100644 index 00000000000..9bfd2ce741c --- /dev/null +++ b/docs/best-practices/_snippets/_avoid_mutations.md @@ -0,0 +1,13 @@ +In ClickHouse, **mutations** refer to operations that modify or delete existing data in a table - typically using `ALTER TABLE ... DELETE` or `ALTER TABLE ... UPDATE`. While these statements may appear similar to standard SQL operations, they are fundamentally different under the hood. + +Rather than modifying rows in place, mutations in ClickHouse are asynchronous background processes that rewrite entire [data parts](/parts) affected by the change. This approach is necessary due to ClickHouse's column-oriented, immutable storage model, but it can lead to significant I/O and resource usage. + +When a mutation is issued, ClickHouse schedules the creation of new **mutated parts**, leaving the original parts untouched until the new ones are ready. Once ready, the mutated parts atomically replace the originals. However, because the operation rewrites entire parts, even a minor change (such as updating a single row) may result in large-scale rewrites and excessive write amplification. + +For large datasets, this can produce a substantial spike in disk I/O and degrade overall cluster performance. Unlike merges, mutations can't be rolled back once submitted and will continue to execute even after server restarts unless explicitly cancelled - see [`KILL MUTATION`](/sql-reference/statements/kill#kill-mutation). + +Mutations are **totally ordered**: they apply to data inserted before the mutation was issued, while newer data remains unaffected. They do not block inserts but can still overlap with other ongoing queries. A SELECT running during a mutation may read a mix of mutated and unmutated parts, which can lead to inconsistent views of the data during execution. ClickHouse executes mutations in parallel per part, which can further intensify memory and CPU usage, especially when complex subqueries (like x IN (SELECT ...)) are involved. + +As a rule, **avoid frequent or large-scale mutations**, especially on high-volume tables. Instead, use alternative table engines such as [ReplacingMergeTree](/guides/replacing-merge-tree) or [CollapsingMergeTree](/engines/table-engines/mergetree-family/collapsingmergetree), which are designed to handle data corrections more efficiently at query time or during merges. If mutations are absolutely necessary, monitor them carefully using the system.mutations table and use `KILL MUTATION` if a process is stuck or misbehaving. Misusing mutations can lead to degraded performance, excessive storage churn, and potential service instability—so apply them with caution and sparingly. + +For deleting data, users can also consider [Lightweight deletes](/guides/developer/lightweight-delete) or the management of data through [partitions](/best-practices/choosing-a-partitioning-key) which allow entire parts to be [dropped efficiently](/sql-reference/statements/alter/partition#drop-partitionpart). diff --git a/docs/cloud/bestpractices/avoidnullablecolumns.md b/docs/best-practices/_snippets/_avoid_nullable_columns.md similarity index 71% rename from docs/cloud/bestpractices/avoidnullablecolumns.md rename to docs/best-practices/_snippets/_avoid_nullable_columns.md index 75e750ca48e..7db65c74183 100644 --- a/docs/cloud/bestpractices/avoidnullablecolumns.md +++ b/docs/best-practices/_snippets/_avoid_nullable_columns.md @@ -1,11 +1,4 @@ ---- -slug: /cloud/bestpractices/avoid-nullable-columns -sidebar_label: 'Avoid Nullable Columns' -title: 'Avoid Nullable Columns' -description: 'Page describing why you should avoid Nullable columns' ---- - -[`Nullable` column](/sql-reference/data-types/nullable/) (e.g. `Nullable(String)`) creates a separate column of `UInt8` type. This additional column has to be processed every time a user works with a nullable column. This leads to additional storage space used and almost always negatively affects performance. +[`Nullable` column](/sql-reference/data-types/nullable/) (e.g. `Nullable(String)`) creates a separate column of `UInt8` type. This additional column has to be processed every time a user works with a Nullable column. This leads to additional storage space used and almost always negatively affects performance. To avoid `Nullable` columns, consider setting a default value for that column. For example, instead of: @@ -32,6 +25,4 @@ ENGINE = MergeTree ORDER BY x ``` -:::note Consider your use case, a default value may be inappropriate. -::: diff --git a/docs/best-practices/_snippets/_avoid_optimize_final.md b/docs/best-practices/_snippets/_avoid_optimize_final.md new file mode 100644 index 00000000000..262f1f8f7d5 --- /dev/null +++ b/docs/best-practices/_snippets/_avoid_optimize_final.md @@ -0,0 +1,44 @@ +import Image from '@theme/IdealImage'; +import simple_merges from '@site/static/images/bestpractices/simple_merges.png'; + + +ClickHouse tables using the **MergeTree engine** store data on disk as **immutable parts**, which are created every time data is inserted. + +Each insert creates a new part containing sorted, compressed column files, along with metadata like indexes and checksums. For a detailed description of part structures and how they are formed we recommend this [guide](/parts). + +Over time, background processes merge smaller parts into larger ones to reduce fragmentation and improve query performance. + +Simple merges + +While it's tempting to manually trigger this merge using: + +```sql +OPTIMIZE TABLE FINAL; +``` + +**you should avoid this operation in most cases** as it initiates resource intensive operations which may impact cluster performance. + +## Why Avoid? {#why-avoid} + +### It's expensive {#its-expensive} + +Running `OPTIMIZE FINAL` forces ClickHouse to merge **all** active parts into a **single part**, even if large merges have already occurred. This involves: + +1. **Decompressing** all parts +2. **Merging** the data +3. **Compressing** it again +4. **Writing** the final part to disk or object storage + +These steps are **CPU and I/O-intensive** and can put significant strain on your system, especially when large datasets are involved. + +### It ignores safety limits {#it-ignores-safety-limits} + +Normally, ClickHouse avoids merging parts larger than ~150 GB (configurable via [max_bytes_to_merge_at_max_space_in_pool](/operations/settings/merge-tree-settings#max_bytes_to_merge_at_max_space_in_pool)). But `OPTIMIZE FINAL` **ignores this safeguard**, which means: + +* It may try to merge **multiple 150 GB parts** into one massive part +* This could result in **long merge times**, **memory pressure**, or even **out-of-memory errors** +* These large parts may become challenging to merge i.e. attempts to merge them further fails for the reasons stated above. In cases where merges are required for correct query time behavior, this can result in undesired consequences e.g. [duplicates accumulating for a ReplacingMergeTree](/guides/developer/deduplication#using-replacingmergetree-for-upserts), increasing query time performance. + +## Let background merges do the work {#let-background-merges-do-the-work} + +ClickHouse already performs smart background merges to optimize storage and query efficiency. These are incremental, resource-aware, and respect configured thresholds. Unless you have a very specific need (e.g., finalizing data before freezing a table or exporting), **you're better off letting ClickHouse manage merges on its own**. diff --git a/docs/best-practices/_snippets/_bulk_inserts.md b/docs/best-practices/_snippets/_bulk_inserts.md new file mode 100644 index 00000000000..8ecbad18078 --- /dev/null +++ b/docs/best-practices/_snippets/_bulk_inserts.md @@ -0,0 +1,11 @@ +The above mechanics illustrate a constant overhead regardless of the insert size, making batch size the single most important optimization for ingest throughput. Batching inserts reduce the overhead as a proportion of total insert time and improves processing efficiency. + +We recommend inserting data in batches of at least 1,000 rows, and ideally between 10,000–100,000 rows. Fewer, larger inserts reduce the number of parts written, minimize merge load, and lower overall system resource usage. + +**For a synchronous insert strategy to be effective this client-side batching is required.** + +If you're unable to batch data client-side, ClickHouse supports asynchronous inserts that shift batching to the server ([see](/best-practices/selecting-an-insert-strategy#asynchronous-inserts)). + +:::tip +Regardless of the size of your inserts, we recommend keeping the number of insert queries around one insert query per second. The reason for that recommendation is that the created parts are merged to larger parts in the background (in order to optimize your data for read queries), and sending too many insert queries per second can lead to situations where the background merging can't keep up with the number of new parts. However, you can use a higher rate of insert queries per second when you use asynchronous inserts (see asynchronous inserts). +::: \ No newline at end of file diff --git a/docs/best-practices/avoid_mutations.md b/docs/best-practices/avoid_mutations.md new file mode 100644 index 00000000000..d8e70dcf802 --- /dev/null +++ b/docs/best-practices/avoid_mutations.md @@ -0,0 +1,11 @@ +--- +slug: /best-practices/avoid-mutations +sidebar_position: 10 +sidebar_label: 'Avoid Mutations' +title: 'Avoid Mutations' +description: 'Page describing why to avoid mutations in ClickHouse' +--- + +import Content from '@site/docs/best-practices/_snippets/_avoid_mutations.md'; + + diff --git a/docs/best-practices/avoid_optimize_final.md b/docs/best-practices/avoid_optimize_final.md new file mode 100644 index 00000000000..5079e09ccf9 --- /dev/null +++ b/docs/best-practices/avoid_optimize_final.md @@ -0,0 +1,11 @@ +--- +slug: /best-practices/avoid-optimize-final +sidebar_position: 10 +sidebar_label: 'Avoid Optimize Final' +title: 'Avoid Optimize Final' +description: 'Page describing why to avoid Optimize Final in ClickHouse' +--- + +import Content from '@site/docs/best-practices/_snippets/_avoid_optimize_final.md'; + + diff --git a/docs/best-practices/choosing_a_primary_key.md b/docs/best-practices/choosing_a_primary_key.md new file mode 100644 index 00000000000..28c7530dd4a --- /dev/null +++ b/docs/best-practices/choosing_a_primary_key.md @@ -0,0 +1,175 @@ +--- +slug: /best-practices/choosing-a-primary-key +sidebar_position: 10 +sidebar_label: 'Choosing a Primary Key' +title: 'Choosing a Primary Key' +description: 'Page describing how to choose a primary key in ClickHouse' +--- + +import Image from '@theme/IdealImage'; +import create_primary_key from '@site/static/images/bestpractices/create_primary_key.gif'; +import primary_key from '@site/static/images/bestpractices/primary_key.gif'; + + +> We interchangeably use the term "ordering key" to refer to the "primary key" on this page. Strictly, [these differ in ClickHouse](/engines/table-engines/mergetree-family/mergetree#choosing-a-primary-key-that-differs-from-the-sorting-key), but for the purposes of this document, readers can use them interchangeably, with the ordering key referring to the columns specified in the table `ORDER BY`. + +Note that a ClickHouse primary key works [very differently](/migrations/postgresql/designing-schemas#how-are-clickhouse-primary-keys-different) to those familiar with similar terms in OLTP databases such as Postgres. + +Choosing an effective primary key in ClickHouse is crucial for query performance and storage efficiency. ClickHouse organizes data into parts, each containing its own sparse primary index. This index significantly speeds up queries by reducing the volume of data scanned. Additionally, because the primary key determines the physical order of data on disk, it directly impacts compression efficiency. Optimally ordered data compresses more effectively, which further enhances performance by reducing I/O. + + +1. When selecting an ordering key, prioritize columns frequently used in query filters (i.e. the `WHERE` clause), especially those that exclude large numbers of rows. +2. Columns highly correlated with other data in the table are also beneficial, as contiguous storage improves compression ratios and memory efficiency during `GROUP BY` and `ORDER BY` operations. +
+Some simple rules can be applied to help choose an ordering key. The following can sometimes be in conflict, so consider these in order. **Users can identify a number of keys from this process, with 4-5 typically sufficient**: + +:::note Important +Ordering keys must be defined on table creation and cannot be added. Additional ordering can be added to a table after (or before) data insertion through a feature known as projections. Be aware these result in data duplication. Further details [here](/sql-reference/statements/alter/projection). +::: + +## Example {#example} + +Consider the following `posts_unordered` table. This contains a row per Stack Overflow post. + +This table has no primary key - as indicated by `ORDER BY tuple()`. + +```sql +CREATE TABLE posts_unordered +( + `Id` Int32, + `PostTypeId` Enum('Question' = 1, 'Answer' = 2, 'Wiki' = 3, 'TagWikiExcerpt' = 4, + 'TagWiki' = 5, 'ModeratorNomination' = 6, 'WikiPlaceholder' = 7, 'PrivilegeWiki' = 8), + `AcceptedAnswerId` UInt32, + `CreationDate` DateTime, + `Score` Int32, + `ViewCount` UInt32, + `Body` String, + `OwnerUserId` Int32, + `OwnerDisplayName` String, + `LastEditorUserId` Int32, + `LastEditorDisplayName` String, + `LastEditDate` DateTime, + `LastActivityDate` DateTime, + `Title` String, + `Tags` String, + `AnswerCount` UInt16, + `CommentCount` UInt8, + `FavoriteCount` UInt8, + `ContentLicense`LowCardinality(String), + `ParentId` String, + `CommunityOwnedDate` DateTime, + `ClosedDate` DateTime +) +ENGINE = MergeTree +ORDER BY tuple() +``` + +Suppose a user wishes to compute the number of questions submitted after 2024, with this representing their most common access pattern. + +```sql +SELECT count() +FROM stackoverflow.posts_unordered +WHERE (CreationDate >= '2024-01-01') AND (PostTypeId = 'Question') + +┌─count()─┐ +│ 192611 │ +└─────────┘ +--highlight-next-line +1 row in set. Elapsed: 0.055 sec. Processed 59.82 million rows, 361.34 MB (1.09 billion rows/s., 6.61 GB/s.) +``` + +Note the number of rows and bytes read by this query. Without a primary key, queries must scan the entire dataset. + +Using `EXPLAIN indexes=1` confirms a full table scan due to lack of indexing. + +```sql +EXPLAIN indexes = 1 +SELECT count() +FROM stackoverflow.posts_unordered +WHERE (CreationDate >= '2024-01-01') AND (PostTypeId = 'Question') + +┌─explain───────────────────────────────────────────────────┐ +│ Expression ((Project names + Projection)) │ +│ Aggregating │ +│ Expression (Before GROUP BY) │ +│ Expression │ +│ ReadFromMergeTree (stackoverflow.posts_unordered) │ +└───────────────────────────────────────────────────────────┘ + +5 rows in set. Elapsed: 0.003 sec. +``` + +Assume a table `posts_ordered`, containing the same data, is defined with an `ORDER BY` defined as `(PostTypeId, toDate(CreationDate))` i.e. + +```sql +CREATE TABLE posts_ordered +( + `Id` Int32, + `PostTypeId` Enum('Question' = 1, 'Answer' = 2, 'Wiki' = 3, 'TagWikiExcerpt' = 4, 'TagWiki' = 5, 'ModeratorNomination' = 6, + 'WikiPlaceholder' = 7, 'PrivilegeWiki' = 8), +... +) +ENGINE = MergeTree +ORDER BY (PostTypeId, toDate(CreationDate)) +``` + +`PostTypeId` has a cardinality of 8 and represents the logical choice for the first entry in our ordering key. Recognizing date granularity filtering is likely to be sufficient (it will still benefit datetime filters) so we use `toDate(CreationDate)` as the 2nd component of our key. This will also produce a smaller index as a date can be represented by 16 bits, speeding up filtering. + +The following animation shows how an optimized sparse primary index is created for the Stack Overflow posts table. Instead of indexing individual rows, the index targets blocks of rows: + + + +If the same query is repeated on a table with this ordering key: + +```sql +SELECT count() +FROM stackoverflow.posts_ordered +WHERE (CreationDate >= '2024-01-01') AND (PostTypeId = 'Question') + +┌─count()─┐ +│ 192611 │ +└─────────┘ +--highlight-next-line +1 row in set. Elapsed: 0.013 sec. Processed 196.53 thousand rows, 1.77 MB (14.64 million rows/s., 131.78 MB/s.) +``` + +This query now leverages sparse indexing, significantly reducing the amount of data read and speeding up the execution time by 4x - note the reduction of rows and bytes read. + +The use of the index can be confirmed with an `EXPLAIN indexes=1`. + +```sql +EXPLAIN indexes = 1 +SELECT count() +FROM stackoverflow.posts_ordered +WHERE (CreationDate >= '2024-01-01') AND (PostTypeId = 'Question') + +┌─explain─────────────────────────────────────────────────────────────────────────────────────┐ +│ Expression ((Project names + Projection)) │ +│ Aggregating │ +│ Expression (Before GROUP BY) │ +│ Expression │ +│ ReadFromMergeTree (stackoverflow.posts_ordered) │ +│ Indexes: │ +│ PrimaryKey │ +│ Keys: │ +│ PostTypeId │ +│ toDate(CreationDate) │ +│ Condition: and((PostTypeId in [1, 1]), (toDate(CreationDate) in [19723, +Inf))) │ +│ Parts: 14/14 │ +│ Granules: 39/7578 │ +└─────────────────────────────────────────────────────────────────────────────────────────────┘ + +13 rows in set. Elapsed: 0.004 sec. +``` + +Additionally, we visualize how the sparse index prunes all row blocks that can't possibly contain matches for our example query: + + + +:::note +All columns in a table will be sorted based on the value of the specified ordering key, regardless of whether they are included in the key itself. For instance, if `CreationDate` is used as the key, the order of values in all other columns will correspond to the order of values in the `CreationDate` column. Multiple ordering keys can be specified - this will order with the same semantics as an `ORDER BY` clause in a `SELECT` query. +::: + +A complete advanced guide on choosing primary keys can be found [here](/guides/best-practices/sparse-primary-indexes). + +For deeper insights into how ordering keys improve compression and further optimize storage, explore the official guides on [Compression in ClickHouse](/data-compression/compression-in-clickhouse) and [Column Compression Codecs](/data-compression/compression-in-clickhouse#choosing-the-right-column-compression-codec). diff --git a/docs/best-practices/index.md b/docs/best-practices/index.md new file mode 100644 index 00000000000..4f672250e63 --- /dev/null +++ b/docs/best-practices/index.md @@ -0,0 +1,24 @@ +--- +slug: /best-practices +keywords: ['Cloud', 'Primary key', 'Ordering key', 'Materialized Views', 'Best Practices', 'Bulk Inserts', 'Asynchronous Inserts', 'Avoid Mutations', 'Avoid Nullable Columns', 'Avoid Optimize Final', 'Partitioning Key'] +title: 'Overview' +hide_title: true +description: 'Landing page for Best Practices section in ClickHouse' +--- + +# Best Practices in ClickHouse {#best-practices-in-clickhouse} + +This section provides the best practices you will want to follow to get the most out of ClickHouse. + +| Page | Description | +|----------------------------------------------------------------------|--------------------------------------------------------------------------| +| [Choosing a Primary Key](/best-practices/choosing-a-primary-key) | Guidance on selecting an effective Primary Key in ClickHouse. | +| [Select Data Types](/best-practices/select-data-types) | Recommendations for choosing appropriate data types. | +| [Use Materialized Views](/best-practices/use-materialized-views) | When and how to benefit from materialized views. | +| [Minimize and Optimize JOINs](/best-practices/minimize-optimize-joins)| Best practices for minimizing and optimizing JOIN operations. | +| [Choosing a Partitioning Key](/best-practices/choosing-a-partitioning-key) | How to choose and apply partitioning keys effectively. | +| [Selecting an Insert Strategy](/best-practices/selecting-an-insert-strategy) | Strategies for efficient data insertion in ClickHouse. | +| [Data Skipping Indices](/best-practices/use-data-skipping-indices-where-appropriate) | When to apply data skipping indices for performance gains. | +| [Avoid Mutations](/best-practices/avoid-mutations) | Reasons to avoid mutations and how to design without them. | +| [Avoid OPTIMIZE FINAL](/best-practices/avoid-optimize-final) | Why `OPTIMIZE FINAL` can be costly and how to work around it. | +| [Use JSON where appropriate](/best-practices/use-json-where-appropriate) | Considerations for using JSON columns in ClickHouse. | diff --git a/docs/best-practices/json_type.md b/docs/best-practices/json_type.md new file mode 100644 index 00000000000..d3ac39065f5 --- /dev/null +++ b/docs/best-practices/json_type.md @@ -0,0 +1,316 @@ +--- +slug: /best-practices/use-json-where-appropriate +sidebar_position: 10 +sidebar_label: 'Using JSON' +title: 'Use JSON where appropriate' +description: 'Page describing when to use JSON' +--- + +ClickHouse now offers a native JSON column type designed for semi-structured and dynamic data. It's important to clarify that **this is a column type, not a data format**—you can insert JSON into ClickHouse as a string or via supported formats like [JSONEachRow](/docs/interfaces/formats/JSONEachRow), but that does not imply using the JSON column type. Users should only use the JSON type when the structure of their data is dynamic, not when they simply happen to store JSON. + +## When to use the JSON type {#when-to-use-the-json-type} + +Use the JSON type when your data: + +* Has **unpredictable keys** that can change over time. +* Contains **values with varying types** (e.g., a path might sometimes contain a string, sometimes a number). +* Requires schema flexibility where strict typing isn't viable. + +If your data structure is known and consistent, there is rarely a need for the JSON type, even if your data is in JSON format. Specifically, if your data has: + +* **A flat structure with known keys**: use standard column types e.g. String. +* **Predictable nesting**: use Tuple, Array, or Nested types for these structures. +* **Predictable structure with varying types**: consider Dynamic or Variant types instead. + +You can also mix approaches - for example, use static columns for predictable top-level fields and a single JSON column for a dynamic section of the payload. + +## Considerations and tips for using JSON {#considerations-and-tips-for-using-json} + +The JSON type enables efficient columnar storage by flattening paths into subcolumns. But with flexibility comes responsibility. To use it effectively: + +* **Specify path types** using [hints in the column definition](/sql-reference/data-types/newjson) to specify types for known sub columns, avoiding unnecessary type inference. +* **Skip paths** if you don't need the values, with [SKIP and SKIP REGEXP](/sql-reference/data-types/newjson) to reduce storage and improve performance. +* **Avoid setting [`max_dynamic_paths`](/sql-reference/data-types/newjson#reaching-the-limit-of-dynamic-paths-inside-json) too high** - large values increase resource consumption and reduce efficiency. As a rule of thumb, keep it below 10,000. + +:::note Type hints +Type hits offer more than just a way to avoid unnecessary type inference - they eliminate storage and processing indirection entirely. JSON paths with type hints are always stored just like traditional columns, bypassing the need for [**discriminator columns**](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse#storage-extension-for-dynamically-changing-data) or dynamic resolution during query time. This means that with well-defined type hints, nested JSON fields achieve the same performance and efficiency as if they were modeled as top-level fields from the outset. As a result, for datasets that are mostly consistent but still benefit from the flexibility of JSON, type hints provide a convenient way to preserve performance without needing to restructure your schema or ingest pipeline. +::: + +## Advanced Features {#advanced-features} + +* JSON columns **can be used in primary keys** like any other columns. Codecs cannot be specified for a sub-column. +* They support introspection via functions like [`JSONAllPathsWithTypes()` and `JSONDynamicPaths()`](/sql-reference/data-types/newjson#introspection-functions). +* You can read nested sub-objects using the `.^` syntax. +* Query syntax may differ from standard SQL and may require special casting or operators for nested fields. + +For additional guidance, see[ ClickHouse JSON documentation](/sql-reference/data-types/newjson) or explore our blog post[ A New Powerful JSON Data Type for ClickHouse](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse). + +## Examples {#examples} + +Consider the following JSON sample, representing a row from the [Python PyPI dataset](https://clickpy.clickhouse.com/): + +```json +{ + "date": "2022-11-15", + "country_code": "ES", + "project": "clickhouse-connect", + "type": "bdist_wheel", + "installer": "pip", + "python_minor": "3.9", + "system": "Linux", + "version": "0.3.0" +} +``` + +Lets assume this schema is static and the types can be well defined. Even if the data is in NDJSON format (JSON row per line), there is no need to use the JSON type for such a schema. Simply define the schema with classic types. + +```sql +CREATE TABLE pypi ( + `date` Date, + `country_code` String, + `project` String, + `type` String, + `installer` String, + `python_minor` String, + `system` String, + `version` String +) +ENGINE = MergeTree +ORDER BY (project, date) +``` + +and insert JSON rows: + +```sql +INSERT INTO pypi FORMAT JSONEachRow +{"date":"2022-11-15","country_code":"ES","project":"clickhouse-connect","type":"bdist_wheel","installer":"pip","python_minor":"3.9","system":"Linux","version":"0.3.0"} +``` + +Consider the [arXiv dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv?resource=download) containing 2.5m scholarly papers. Each row in this dataset, distributed as NDJSON, represents a published academic paper. An example row is shown below: + +```json +{ + "id": "2101.11408", + "submitter": "Daniel Lemire", + "authors": "Daniel Lemire", + "title": "Number Parsing at a Gigabyte per Second", + "comments": "Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/", + "journal-ref": "Software: Practice and Experience 51 (8), 2021", + "doi": "10.1002/spe.2984", + "report-no": null, + "categories": "cs.DS cs.MS", + "license": "http://creativecommons.org/licenses/by/4.0/", + "abstract": "With disks and networks providing gigabytes per second ....\n", + "versions": [ + { + "created": "Mon, 11 Jan 2021 20:31:27 GMT", + "version": "v1" + }, + { + "created": "Sat, 30 Jan 2021 23:57:29 GMT", + "version": "v2" + } + ], + "update_date": "2022-11-07", + "authors_parsed": [ + [ + "Lemire", + "Daniel", + "" + ] + ] +} +``` + +While the JSON here is complex, with nested structures, it is predictable. The number and type of the fields will not change. While we could use the JSON type for this example, we can also just define the structure explicitly using [Tuples](/sql-reference/data-types/tuple) and [Nested](/sql-reference/data-types/nested-data-structures/nested) types: + +```sql +CREATE TABLE arxiv +( + `id` String, + `submitter` String, + `authors` String, + `title` String, + `comments` String, + `journal-ref` String, + `doi` String, + `report-no` String, + `categories` String, + `license` String, + `abstract` String, + `versions` Array(Tuple(created String, version String)), + `update_date` Date, + `authors_parsed` Array(Array(String)) +) +ENGINE = MergeTree +ORDER BY update_date +``` + +Again we can insert the data as JSON: + +```sql +INSERT INTO arxiv FORMAT JSONEachRow +{"id":"2101.11408","submitter":"Daniel Lemire","authors":"Daniel Lemire","title":"Number Parsing at a Gigabyte per Second","comments":"Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/","journal-ref":"Software: Practice and Experience 51 (8), 2021","doi":"10.1002/spe.2984","report-no":null,"categories":"cs.DS cs.MS","license":"http://creativecommons.org/licenses/by/4.0/","abstract":"With disks and networks providing gigabytes per second ....\n","versions":[{"created":"Mon, 11 Jan 2021 20:31:27 GMT","version":"v1"},{"created":"Sat, 30 Jan 2021 23:57:29 GMT","version":"v2"}],"update_date":"2022-11-07","authors_parsed":[["Lemire","Daniel",""]]} +``` + +Suppose another column called `tags` is added. If this was simply a list of strings we could model as an `Array(String)`, but let's assume users can add arbitrary tag structures with mixed types (notice score is a string or integer). Our modified JSON document: + +```sql +{ + "id": "2101.11408", + "submitter": "Daniel Lemire", + "authors": "Daniel Lemire", + "title": "Number Parsing at a Gigabyte per Second", + "comments": "Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/", + "journal-ref": "Software: Practice and Experience 51 (8), 2021", + "doi": "10.1002/spe.2984", + "report-no": null, + "categories": "cs.DS cs.MS", + "license": "http://creativecommons.org/licenses/by/4.0/", + "abstract": "With disks and networks providing gigabytes per second ....\n", + "versions": [ + { + "created": "Mon, 11 Jan 2021 20:31:27 GMT", + "version": "v1" + }, + { + "created": "Sat, 30 Jan 2021 23:57:29 GMT", + "version": "v2" + } + ], + "update_date": "2022-11-07", + "authors_parsed": [ + [ + "Lemire", + "Daniel", + "" + ] + ], + "tags": { + "tag_1": { + "name": "ClickHouse user", + "score": "A+", + "comment": "A good read, applicable to ClickHouse" + }, + "28_03_2025": { + "name": "professor X", + "score": 10, + "comment": "Didn't learn much", + "updates": [ + { + "name": "professor X", + "comment": "Wolverine found more interesting" + } + ] + } + } +} +``` + +In this case, we could model the arXiv documents as either all JSON or simply add a JSON `tags` column. We provide both examples below: + + +```sql +CREATE TABLE arxiv +( + `doc` JSON(update_date Date) +) +ENGINE = MergeTree +ORDER BY doc.update_date +``` + +:::note +We specify the `update_date` in the JSON definition as we use it in the ordering/primary key. This helps ClickHouse know this column won't be null. If not specified, the user must explicitly allow nullable primary keys (not recommended for performance reasons) via the setting [`allow_nullable_key=1`](/operations/settings/merge-tree-settings#allow_nullable_key) +::: + +We can insert into this table and view the subsequently inferred schema using the [`JSONAllPathsWithTypes`](/sql-reference/functions/json-functions#jsonallpathswithtypes) function and [`PrettyJSONEachRow`](/interfaces/formats/PrettyJSONEachRow) output format: + +```sql +INSERT INTO arxiv FORMAT JSONAsObject +{"id":"2101.11408","submitter":"Daniel Lemire","authors":"Daniel Lemire","title":"Number Parsing at a Gigabyte per Second","comments":"Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/","journal-ref":"Software: Practice and Experience 51 (8), 2021","doi":"10.1002/spe.2984","report-no":null,"categories":"cs.DS cs.MS","license":"http://creativecommons.org/licenses/by/4.0/","abstract":"With disks and networks providing gigabytes per second ....\n","versions":[{"created":"Mon, 11 Jan 2021 20:31:27 GMT","version":"v1"},{"created":"Sat, 30 Jan 2021 23:57:29 GMT","version":"v2"}],"update_date":"2022-11-07","authors_parsed":[["Lemire","Daniel",""]],"tags":{"tag_1":{"name":"ClickHouse user","score":"A+","comment":"A good read, applicable to ClickHouse"},"28_03_2025":{"name":"professor X","score":10,"comment":"Didn't learn much","updates":[{"name":"professor X","comment":"Wolverine found more interesting"}]}}} +``` + +```sql +SELECT JSONAllPathsWithTypes(doc) +FROM arxiv +FORMAT PrettyJSONEachRow + +{ + "JSONAllPathsWithTypes(doc)": { + "abstract": "String", + "authors": "String", + "authors_parsed": "Array(Array(Nullable(String)))", + "categories": "String", + "comments": "String", + "doi": "String", + "id": "String", + "journal-ref": "String", + "license": "String", + "submitter": "String", + "tags.28_03_2025.comment": "String", + "tags.28_03_2025.name": "String", + "tags.28_03_2025.score": "Int64", + "tags.28_03_2025.updates": "Array(JSON(max_dynamic_types=16, max_dynamic_paths=256))", + "tags.tag_1.comment": "String", + "tags.tag_1.name": "String", + "tags.tag_1.score": "String", + "title": "String", + "update_date": "Date", + "versions": "Array(JSON(max_dynamic_types=16, max_dynamic_paths=256))" + } +} + +1 row in set. Elapsed: 0.003 sec. +``` + +Alternatively, we could model this using our earlier schema and a JSON `tags` column. This is generally preferred, minimizing the inference required by ClickHouse: + +```sql +CREATE TABLE arxiv +( + `id` String, + `submitter` String, + `authors` String, + `title` String, + `comments` String, + `journal-ref` String, + `doi` String, + `report-no` String, + `categories` String, + `license` String, + `abstract` String, + `versions` Array(Tuple(created String, version String)), + `update_date` Date, + `authors_parsed` Array(Array(String)), + `tags` JSON() +) +ENGINE = MergeTree +ORDER BY update_date +``` + +```sql +INSERT INTO arxiv FORMAT JSONEachRow +{"id":"2101.11408","submitter":"Daniel Lemire","authors":"Daniel Lemire","title":"Number Parsing at a Gigabyte per Second","comments":"Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/","journal-ref":"Software: Practice and Experience 51 (8), 2021","doi":"10.1002/spe.2984","report-no":null,"categories":"cs.DS cs.MS","license":"http://creativecommons.org/licenses/by/4.0/","abstract":"With disks and networks providing gigabytes per second ....\n","versions":[{"created":"Mon, 11 Jan 2021 20:31:27 GMT","version":"v1"},{"created":"Sat, 30 Jan 2021 23:57:29 GMT","version":"v2"}],"update_date":"2022-11-07","authors_parsed":[["Lemire","Daniel",""]],"tags":{"tag_1":{"name":"ClickHouse user","score":"A+","comment":"A good read, applicable to ClickHouse"},"28_03_2025":{"name":"professor X","score":10,"comment":"Didn't learn much","updates":[{"name":"professor X","comment":"Wolverine found more interesting"}]}}} +``` + +We can now infer the types of the sub column tags. + +```sql +SELECT JSONAllPathsWithTypes(tags) +FROM arxiv +FORMAT PrettyJSONEachRow + +{ + "JSONAllPathsWithTypes(tags)": { + "28_03_2025.comment": "String", + "28_03_2025.name": "String", + "28_03_2025.score": "Int64", + "28_03_2025.updates": "Array(JSON(max_dynamic_types=16, max_dynamic_paths=256))", + "tag_1.comment": "String", + "tag_1.name": "String", + "tag_1.score": "String" + } +} + +1 row in set. Elapsed: 0.002 sec. +``` diff --git a/docs/best-practices/minimize_optimize_joins.md b/docs/best-practices/minimize_optimize_joins.md new file mode 100644 index 00000000000..c59c6ec13e3 --- /dev/null +++ b/docs/best-practices/minimize_optimize_joins.md @@ -0,0 +1,69 @@ +--- +slug: /best-practices/minimize-optimize-joins +sidebar_position: 10 +sidebar_label: 'Minimize and Optimize JOINs' +title: 'Minimize and Optimize JOINs' +description: 'Page describing best practices for JOINs' +--- + +import Image from '@theme/IdealImage'; +import joins from '@site/static/images/bestpractices/joins-speed-memory.png'; + + + +ClickHouse supports a wide variety of JOIN types and algorithms, and JOIN performance has improved significantly in recent releases. However, JOINs are inherently more expensive than querying from a single, denormalized table. Denormalization shifts computational work from query time to insert or pre-processing time, which often results in significantly lower latency at runtime. For real-time or latency-sensitive analytical queries, **denormalization is strongly recommended**. + +In general, denormalize when: + +- Tables change infrequently or when batch refreshes are acceptable. +- Relationships are not many-to-many or not excessively high in cardinality. +- Only a limited subset of the columns will be queried, i.e. certain columns can be excluded from denormalization. +- You have the capability to shift processing out of ClickHouse into upstream systems like Flink, where real-time enrichment or flattening can be managed. + +Not all data needs to be denormalized - focus on the attributes that are frequently queried. Also consider [materialized views](/best-practices/use-materialized-views) to incrementally compute aggregates instead of duplicating entire sub-tables. When schema updates are rare and latency is critical, denormalization offers the best performance trade-off. + +For a full guide on denormalizing data in ClickHouse see [here](/data-modeling/denormalization). + +## When JOINs are required {#when-joins-are-required} + +When JOINs are required, ensure you’re using **at least version 24.12 and preferably the latest version**, as JOIN performance continues to improve with each new release. As of ClickHouse 24.12, the query planner now automatically places the smaller table on the right side of the join for optimal performance - a task that previously had to be done manually. Even more enhancements are coming soon, including more aggressive filter pushdown and automatic re-ordering of multiple joins. + +Follow these best practices to improve JOIN performance: + +* **Avoid cartesian products**: If a value on the left-hand side matches multiple values on the right-hand side, the JOIN will return multiple rows - the so-called cartesian product. If your use case doesn't need all matches from the right-hand side but just any single match, you can use `ANY` JOINs (e.g. `LEFT ANY JOIN`). They are faster and use less memory than regular JOINs. +* **Reduce the sizes of JOINed tables**: The runtime and memory consumption of JOINs grows proportionally with the sizes of the left and right tables. To reduce the amount of processed data by the JOIN, add additional filter conditions in the `WHERE` or `JOIN ON` clauses of the query. ClickHouse pushes filter conditions as deep as possible down in the query plan, usually before JOINs. If the filters are not pushed down automatically (for any reason), rewrite one side of the JOIN as a sub-query to force pushdown. +* **Use direct JOINs via dictionaries if appropriate**: Standard JOINs in ClickHouse are executed in two phases: a build phase which iterates the right-hand side to build a hash table, followed by a probe phase which iterates the left-hand side to find matching join partners via hash table lookups. If the right-hand side is a [dictionary](/dictionary) or another table engine with key-value characteristics (e.g. [EmbeddedRocksDB](/engines/table-engines/integrations/embedded-rocksdb) or the [Join table engine](/engines/table-engines/special/join)), then ClickHouse can use the "direct" join algorithm, which effectively removes the need to build a hash table, speeding up query processing. This works for `INNER` and `LEFT OUTER` JOINs and is preferred for real-time analytical workloads. +* **Utilize the table sorting for JOINs**: Each table in ClickHouse is sorted by the table's primary key columns. It is possible to exploit the table sorting by so-called sort-merge JOINs algorithms like `full_sorting_merge` and `partial_merge`. Unlike standard JOIN algorithms based on hash tables (see below, `parallel_hash`, `hash`, `grace_hash`), sort-merge JOIN algorithms first sort and then merge both tables. If the query JOINs both tables by their respective primary key columns, then sort-merge has an optimization which omits the sort step, saving processing time and overhead. +* **Avoid disk-spilling JOINs**: Intermediate states of JOINs (e.g. hash tables) can become so big that they no longer fit into main memory. In this situation, ClickHouse will return an out-of-memory error by default. Some join algorithms (see below), for example [`grace_hash`](https://clickhouse.com/blog/clickhouse-fully-supports-joins-hash-joins-part2), [`partial_merge`](https://clickhouse.com/blog/clickhouse-fully-supports-joins-full-sort-partial-merge-part3) and [`full_sorting_merge`](https://clickhouse.com/blog/clickhouse-fully-supports-joins-full-sort-partial-merge-part3), are able to spill intermediate states to disk and continue query execution. These join algorithms should nevertheless be used with care as disk access can significantly slow down join processing. We instead recommend optimizing the JOIN query in other ways to reduce the size of intermediate states. +* **Default values as no-match markers in outer JOINs**: Left/right/full outer joins include all values from the left/right/both tables. If no join partner is found in the other table for some value, ClickHouse replaces the join partner by a special marker. The SQL standard mandates that databases use NULL as such a marker. In ClickHouse, this requires wrapping the result column in Nullable, creating an additional memory and performance overhead. As an alternative, you can configure the setting `join_use_nulls = 0` and use the default value of the result column data type as marker. + + +:::note Use dictionaries carefully +When using dictionaries for JOINs in ClickHouse, it's important to understand that dictionaries, by design, do not allow duplicate keys. During data loading, any duplicate keys are silently deduplicated—only the last loaded value for a given key is retained. This behavior makes dictionaries ideal for one-to-one or many-to-one relationships where only the latest or authoritative value is needed. However, using a dictionary for a one-to-many or many-to-many relationship (e.g. joining roles to actors where an actor can have multiple roles) will result in silent data loss, as all but one of the matching rows will be discarded. As a result, dictionaries are not suitable for scenarios requiring full relational fidelity across multiple matches. +::: + +## Choosing the right JOIN Algorithm {#choosing-the-right-join-algorithm} + +ClickHouse supports several JOIN algorithms that trade off between speed and memory: + +* **Parallel Hash JOIN (default):** Fast for small-to-medium right-hand tables that fit in memory. +* **Direct JOIN:** Ideal when using dictionaries (or other table engines with key-value characteristics) with `INNER` or `LEFT ANY JOIN` - the fastest method for point lookups as it eliminates the need to build a hash table. +* **Full Sorting Merge JOIN:** Efficient when both tables are sorted on the join key. +* **Partial Merge JOIN:** Minimizes memory but is slower—best for joining large tables with limited memory. +* **Grace Hash JOIN:** Flexible and memory-tunable, good for large datasets with adjustable performance characteristics. + + + +:::note +Each algorithm has varying support for JOIN types. A full list of supported join types for each algorithm can be found [here](/guides/joining-tables#choosing-a-join-algorithm). +::: + +You can let ClickHouse choose the best algorithm by setting `join_algorithm = 'auto'` (the default), or explicitly control it based on your workload. If you need to select a join algorithm to optimize for performance or memory overhead, we recommend [this guide](/guides/joining-tables#choosing-a-join-algorithm). + +For optimal performance: + +* Keep JOINs to a minimum in high-performance workloads. +* Avoid more than 3–4 joins per query. +* Benchmark different algorithms on real data - performance varies based on JOIN key distribution and data size. + +For more on JOIN optimization strategies, JOIN algorithms, and how to tune them, refer to the[ ClickHouse documentation](/guides/joining-tables) and this [blog series](https://clickhouse.com/blog/clickhouse-fully-supports-joins-part1). \ No newline at end of file diff --git a/docs/best-practices/partionning_keys.md b/docs/best-practices/partionning_keys.md new file mode 100644 index 00000000000..c13c1d31e6f --- /dev/null +++ b/docs/best-practices/partionning_keys.md @@ -0,0 +1,67 @@ +--- +slug: /best-practices/choosing-a-partitioning-key +sidebar_position: 10 +sidebar_label: 'Choosing a Partitioning Key' +title: 'Choosing a Partitioning Key' +description: 'Page describing how to choose a partitioning key' +--- + +import Image from '@theme/IdealImage'; +import partitions from '@site/static/images/bestpractices/partitions.png'; +import merges_with_partitions from '@site/static/images/bestpractices/merges_with_partitions.png'; + +:::note A data management technique +Partitioning is primarily a data management technique and not a query optimization tool, and while it can improve performance in specific workloads, it should not be the first mechanism used to accelerate queries; the partitioning key must be chosen carefully, with a clear understanding of its implications, and only applied when it aligns with data life cycle needs or well-understood access patterns. +::: + +In ClickHouse, partitioning organizes data into logical segments based on a specified key. This is defined using the `PARTITION BY` clause at table creation time and is commonly used to group rows by time intervals, categories, or other business-relevant dimensions. Each unique value of the partitioning expression forms its own physical partition on disk, and ClickHouse stores data in separate parts for each of these values. Partitioning improves data management, simplifies retention policies, and can help with certain query patterns. + +For example, consider the following UK price paid dataset table with a partitioning key of `toStartOfMonth(date)`. + +```sql +CREATE TABLE uk.uk_price_paid_simple_partitioned +( + date Date, + town LowCardinality(String), + street LowCardinality(String), + price UInt32 +) +ENGINE = MergeTree +ORDER BY (town, street) +PARTITION BY toStartOfMonth(date) +``` + +Whenever a set of rows is inserted into the table, instead of creating (at[ least](/operations/settings/settings#max_insert_block_size)) one single data part containing all the inserted rows (as described [here](/parts)), ClickHouse creates one new data part for each unique partition key value among the inserted rows: + + + + +The ClickHouse server first splits the rows from the example insert with 4 rows sketched in the diagram above by their partition key value `toStartOfMonth(date)`. Then, for each identified partition, the rows are processed as[ usual](/parts) by performing several sequential steps (① Sorting, ② Splitting into columns, ③ Compression, ④ Writing to Disk). + +For a more detailed explanation of partitioning, we recommend [this guide](/partitions). + +With partitioning enabled, ClickHouse only [merges](/merges) data parts within, but not across partitions. We sketch that for our example table from above: + + + +## Applications of partitioning {#applications-of-partionning} + +Partitioning is a powerful tool for managing large datasets in ClickHouse, especially in observability and analytics use cases. It enables efficient data life cycle operations by allowing entire partitions, often aligned with time or business logic, to be dropped, moved, or archived in a single metadata operation. This is significantly faster and less resource-intensive than row-level delete or copy operations. Partitioning also integrates cleanly with ClickHouse features like TTL and tiered storage, making it possible to implement retention policies or hot/cold storage strategies without custom orchestration. For example, recent data can be kept on fast SSD-backed storage, while older partitions are automatically moved to cheaper object storage. + +While partitioning can improve query performance for some workloads, it can also negatively impact response time. + +If the partitioning key is not in the primary key and you are filtering by it, users may see an improvement in query performance with partitioning. See [here](/partitions#query-optimization) for an example. + +Conversely, if queries need to query across partitions performance may be negatively impacted due to a higher number of total parts. For this reason, users should understand their access patterns before considering partitioning a a query optimization technique. + +In summary, users should primarily think of partitioning as a data management technique. For an example of managing data, see ["Managing Data"](/observability/managing-data) from the observability use-case guide and ["What are table partitions used for?"](/partitions#data-management) from Core Concepts - Table partitions. + +## Choose a low cardinality partitioning key {#choose-a-low-cardinality-partitioning-key} + +Importantly, a higher number of parts will negatively affect query performance. ClickHouse will therefore respond to inserts with a [“too many parts”](/knowledgebase/exception-too-many-parts) error if the number of parts exceeds specified limits either in [total](/operations/settings/merge-tree-settings#max_parts_in_total) or [per partition](/operations/settings/merge-tree-settings#parts_to_throw_insert). + +Choosing the right **cardinality** for the partitioning key is critical. A high-cardinality partitioning key - where the number of distinct partition values is large - can lead to a proliferation of data parts. Since ClickHouse does not merge parts across partitions, too many partitions will result in too many unmerged parts, eventually triggering the “Too many parts” error. [Merges are essential](/merges) for reducing storage fragmentation and optimizing query speed, but with high-cardinality partitions, that merge potential is lost. + +By contrast, a **low-cardinality partitioning key**—with fewer than 100 - 1,000 distinct values - is usually optimal. It enables efficient part merging, keeps metadata overhead low, and avoids excessive object creation in storage. In addition, ClickHouse automatically builds MinMax indexes on partition columns, which can significantly speed up queries that filter on those columns. For example, filtering by month when the table is partitioned by `toStartOfMonth(date)` allows the engine to skip irrelevant partitions and their parts entirely. + +While partitioning can improve performance in some query patterns, it's primarily a data management feature. In many cases, querying across all partitions can be slower than using a non-partitioned table due to increased data fragmentation and more parts being scanned. Use partitioning judiciously, and always ensure that the chosen key is low-cardinality and aligns with your data life cycle policies (e.g., retention via TTL). If you're unsure whether partitioning is necessary, you may want to start without it and optimize later based on observed access patterns. diff --git a/docs/best-practices/select_data_type.md b/docs/best-practices/select_data_type.md new file mode 100644 index 00000000000..317931a60ca --- /dev/null +++ b/docs/best-practices/select_data_type.md @@ -0,0 +1,140 @@ +--- +slug: /best-practices/select-data-types +sidebar_position: 10 +sidebar_label: 'Select Data Types' +title: 'Select Data Types' +description: 'Page describing how to choose data types in ClickHouse' +--- + +import NullableColumns from '@site/docs/best-practices/_snippets/_avoid_nullable_columns.md'; + +One of the core reasons for ClickHouse's query performance is its efficient data compression. Less data on disk results in faster queries and inserts by minimizing I/O overhead. ClickHouse's column-oriented architecture naturally arranges similar data adjacently, enabling compression algorithms and codecs to reduce data size dramatically. To maximize these compression benefits, it's essential to carefully choose appropriate data types. + +Compression efficiency in ClickHouse depends mainly on three factors: the ordering key, data types, and codecs, all defined through the table schema. Choosing optimal data types yields immediate improvements in both storage and query performance. + +Some straightforward guidelines can significantly enhance the schema: + + +* **Use Strict Types:** Always select the correct data type for columns. Numeric and date fields should use appropriate numeric and date types rather than general-purpose String types. This ensures correct semantics for filtering and aggregations. + +* **Avoid Nullable Columns:** Nullable columns introduce additional overhead by maintaining separate columns for tracking null values. Only use Nullable if explicitly required to distinguish between empty and null states. Otherwise, default or zero-equivalent values typically suffice. For further information on why this type should be avoided unless needed, see [Avoid Nullable Columns](/best-practices/select-data-types#avoid-nullable-columns). + +* **Minimize Numeric Precision:** Select numeric types with minimal bit-width that still accommodate the expected data range. For instance, prefer [UInt16 over Int32](/sql-reference/data-types/int-uint) if negative values aren't needed, and the range fits within 0–65535. + +* **Optimize Date and Time Precision:** Choose the most coarse-grained date or datetime type that meets query requirements. Use Date or Date32 for date-only fields, and prefer DateTime over DateTime64 unless millisecond or finer precision is essential. + +* **Leverage LowCardinality and Specialized Types:** For columns with fewer than approximately 10,000 unique values, use LowCardinality types to significantly reduce storage through dictionary encoding. Similarly, use FixedString only when the column values are strictly fixed-length strings (e.g., country or currency codes), and prefer Enum types for columns with a finite set of possible values to enable efficient storage and built-in data validation. + +* **Enums for data validation:** The Enum type can be used to efficiently encode enumerated types. Enums can either be 8 or 16 bits, depending on the number of unique values they are required to store. Consider using this if you need either the associated validation at insert time (undeclared values will be rejected) or wish to perform queries which exploit a natural ordering in the Enum values e.g. imagine a feedback column containing user responses Enum(':(' = 1, ':|' = 2, ':)' = 3). + +## Example {#example} + +ClickHouse offers built-in tools to streamline type optimization. For example, schema inference can automatically identify initial types. Consider the Stack Overflow dataset, publicly available in Parquet format. Running a simple schema inference via the [`DESCRIBE`](/sql-reference/statements/describe-table) command provides an initial non-optimized schema. + +:::note +By default, ClickHouse maps these to equivalent Nullable types. This is preferred as the schema is based on a sample of the rows only. +::: + + +```sql +DESCRIBE TABLE s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/stackoverflow/parquet/posts/*.parquet') +SETTINGS describe_compact_output = 1 + +┌─name───────────────────────┬─type──────────────────────────────┐ +│ Id │ Nullable(Int64) │ +│ PostTypeId │ Nullable(Int64) │ +│ AcceptedAnswerId │ Nullable(Int64) │ +│ CreationDate │ Nullable(DateTime64(3, 'UTC')) │ +│ Score │ Nullable(Int64) │ +│ ViewCount │ Nullable(Int64) │ +│ Body │ Nullable(String) │ +│ OwnerUserId │ Nullable(Int64) │ +│ OwnerDisplayName │ Nullable(String) │ +│ LastEditorUserId │ Nullable(Int64) │ +│ LastEditorDisplayName │ Nullable(String) │ +│ LastEditDate │ Nullable(DateTime64(3, 'UTC')) │ +│ LastActivityDate │ Nullable(DateTime64(3, 'UTC')) │ +│ Title │ Nullable(String) │ +│ Tags │ Nullable(String) │ +│ AnswerCount │ Nullable(Int64) │ +│ CommentCount │ Nullable(Int64) │ +│ FavoriteCount │ Nullable(Int64) │ +│ ContentLicense │ Nullable(String) │ +│ ParentId │ Nullable(String) │ +│ CommunityOwnedDate │ Nullable(DateTime64(3, 'UTC')) │ +│ ClosedDate │ Nullable(DateTime64(3, 'UTC')) │ +└────────────────────────────┴───────────────────────────────────┘ + +22 rows in set. Elapsed: 0.130 sec. +``` + +:::note +Note below we use the glob pattern *.parquet to read all files in the stackoverflow/parquet/posts folder. +::: + +By applying our early simple rules to our posts table, we can identify an optimal type for each column: + +| Column | Is Numeric | Min, Max | Unique Values | Nulls | Comment | Optimized Type | +|------------------------|------------|------------------------------------------------------------------------|----------------|--------|----------------------------------------------------------------------------------------------|------------------------------------------| +| `PostTypeId` | Yes | 1, 8 | 8 | No | | `Enum('Question' = 1, 'Answer' = 2, 'Wiki' = 3, 'TagWikiExcerpt' = 4, 'TagWiki' = 5, 'ModeratorNomination' = 6, 'WikiPlaceholder' = 7, 'PrivilegeWiki' = 8)` | +| `AcceptedAnswerId` | Yes | 0, 78285170 | 12282094 | Yes | Differentiate Null with 0 value | UInt32 | +| `CreationDate` | No | 2008-07-31 21:42:52.667000000, 2024-03-31 23:59:17.697000000 | - | No | Millisecond granularity is not required, use DateTime | DateTime | +| `Score` | Yes | -217, 34970 | 3236 | No | | Int32 | +| `ViewCount` | Yes | 2, 13962748 | 170867 | No | | UInt32 | +| `Body` | No | - | - | No | | String | +| `OwnerUserId` | Yes | -1, 4056915 | 6256237 | Yes | | Int32 | +| `OwnerDisplayName` | No | - | 181251 | Yes | Consider Null to be empty string | String | +| `LastEditorUserId` | Yes | -1, 9999993 | 1104694 | Yes | 0 is an unused value can be used for Nulls | Int32 | +| `LastEditorDisplayName` | No | - | 70952 | Yes | Consider Null to be an empty string. Tested LowCardinality and no benefit | String | +| `LastEditDate` | No | 2008-08-01 13:24:35.051000000, 2024-04-06 21:01:22.697000000 | - | No | Millisecond granularity is not required, use DateTime | DateTime | +| `LastActivityDate` | No | 2008-08-01 12:19:17.417000000, 2024-04-06 21:01:22.697000000 | - | No | Millisecond granularity is not required, use DateTime | DateTime | +| `Title` | No | - | - | No | Consider Null to be an empty string | String | +| `Tags` | No | - | - | No | Consider Null to be an empty string | String | +| `AnswerCount` | Yes | 0, 518 | 216 | No | Consider Null and 0 to same | UInt16 | +| `CommentCount` | Yes | 0, 135 | 100 | No | Consider Null and 0 to same | UInt8 | +| `FavoriteCount` | Yes | 0, 225 | 6 | Yes | Consider Null and 0 to same | UInt8 | +| `ContentLicense` | No | - | 3 | No | LowCardinality outperforms FixedString | LowCardinality(String) | +| `ParentId` | No | - | 20696028 | Yes | Consider Null to be an empty string | String | +| `CommunityOwnedDate` | No | 2008-08-12 04:59:35.017000000, 2024-04-01 05:36:41.380000000 | - | Yes | Consider default 1970-01-01 for Nulls. Millisecond granularity is not required, use DateTime | DateTime | +| `ClosedDate` | No | 2008-09-04 20:56:44, 2024-04-06 18:49:25.393000000 | - | Yes | Consider default 1970-01-01 for Nulls. Millisecond granularity is not required, use DateTime | DateTime | + +:::note tip +Identifying the type for a column relies on understanding its numeric range and number of unique values. To find the range of all columns, and the number of distinct values, users can use the simple query `SELECT * APPLY min, * APPLY max, * APPLY uniq FROM table FORMAT Vertical`. We recommend performing this over a smaller subset of the data as this can be expensive. +::: + +This results in the following optimized schema (with respect to types): + +```sql +CREATE TABLE posts +( + Id Int32, + PostTypeId Enum('Question' = 1, 'Answer' = 2, 'Wiki' = 3, 'TagWikiExcerpt' = 4, 'TagWiki' = 5, + 'ModeratorNomination' = 6, 'WikiPlaceholder' = 7, 'PrivilegeWiki' = 8), + AcceptedAnswerId UInt32, + CreationDate DateTime, + Score Int32, + ViewCount UInt32, + Body String, + OwnerUserId Int32, + OwnerDisplayName String, + LastEditorUserId Int32, + LastEditorDisplayName String, + LastEditDate DateTime, + LastActivityDate DateTime, + Title String, + Tags String, + AnswerCount UInt16, + CommentCount UInt8, + FavoriteCount UInt8, + ContentLicenseLowCardinality(String), + ParentId String, + CommunityOwnedDate DateTime, + ClosedDate DateTime +) +ENGINE = MergeTree +ORDER BY tuple() +``` + +## Avoid Nullable columns {#avoid-nullable-columns} + + diff --git a/docs/best-practices/selecting_an_insert_strategy.md b/docs/best-practices/selecting_an_insert_strategy.md new file mode 100644 index 00000000000..15a0fd96df1 --- /dev/null +++ b/docs/best-practices/selecting_an_insert_strategy.md @@ -0,0 +1,150 @@ +--- +slug: /best-practices/selecting-an-insert-strategy +sidebar_position: 10 +sidebar_label: 'Selecting an Insert Strategy' +title: 'Selecting an Insert Strategy' +description: 'Page describing how to choose an insert strategy in ClickHouse' +--- + +import Image from '@theme/IdealImage'; +import insert_process from '@site/static/images/bestpractices/insert_process.png'; +import async_inserts from '@site/static/images/bestpractices/async_inserts.png'; +import AsyncInserts from '@site/docs/best-practices/_snippets/_async_inserts.md'; +import BulkInserts from '@site/docs/best-practices/_snippets/_bulk_inserts.md'; + +Efficient data ingestion forms the basis of high-performance ClickHouse deployments. Selecting the right insert strategy can dramatically impact throughput, cost, and reliability. This section outlines best practices, tradeoffs, and configuration options to help you make the right decision for your workload. + +:::note +The following assumes you are pushing data to ClickHouse via a client. If you are pulling data into ClickHouse e.g. using built in table functions such as [s3](/sql-reference/table-functions/s3) and [gcs](/sql-reference/table-functions/gcs), we recommend our guide ["Optimizing for S3 Insert and Read Performance"](/integrations/s3/performance). +::: + +## Synchronous inserts by default {#synchronous-inserts-by-default} + +By default, inserts into ClickHouse are synchronous. Each insert query immediately creates a storage part on disk, including metadata and indexes. + +:::note Use synchronous inserts if you can batch the data client side +If not, see [Asynchronous inserts](#asynchronous-inserts) below. +::: + + We briefly review ClickHouse's MergeTree insert mechanics below: + + + +#### Client-side steps {#client-side-steps} + +For optimal performance, data must be ①[ batched](https://clickhouse.com/blog/asynchronous-data-inserts-in-clickhouse#data-needs-to-be-batched-for-optimal-performance), making batch size the **first decision**. + +ClickHouse stores inserted data on disk,[ ordered](/guides/best-practices/sparse-primary-indexes#data-is-stored-on-disk-ordered-by-primary-key-columns) by the table's primary key column(s). The **second decision** is whether to ② pre-sort the data before transmission to the server. If a batch arrives pre-sorted by primary key column(s), ClickHouse can [skip](https://github.com/ClickHouse/ClickHouse/blob/94ce8e95404e991521a5608cd9d636ff7269743d/src/Storages/MergeTree/MergeTreeDataWriter.cpp#L595) the ⑨ sorting step, speeding up ingestion. + +If the data to be ingested has no predefined format, the **key decision** is choosing a format. ClickHouse supports inserting data in [over 70 formats](/interfaces/formats). However, when using the ClickHouse command-line client or programming language clients, this choice is often handled automatically. If needed, this automatic selection can also be overridden explicitly. + +The next **major decision** is ④ whether to compress data before transmission to the ClickHouse server. Compression reduces transfer size and improves network efficiency, leading to faster data transfers and lower bandwidth usage, especially for large datasets. + +The data is ⑤ transmitted to a ClickHouse network interface—either the [native](/interfaces/tcp) or[ HTTP](/interfaces/http) interface (which we [compare](https://clickhouse.com/blog/clickhouse-input-format-matchup-which-is-fastest-most-efficient#clickhouse-client-defaults) later in this post). + +#### Server-side steps {#server-side-steps} + +After ⑥ receiving the data, ClickHouse ⑦ decompresses it if compression was used, then ⑧ parses it from the originally sent format. + +Using the values from that formatted data and the target table's [DDL](/sql-reference/statements/create/table) statement, ClickHouse ⑨ builds an in-memory [block](/development/architecture#block) in the MergeTree format, ⑩ [sorts](/parts#what-are-table-parts-in-clickhouse) rows by the primary key columns if they are not already pre-sorted, ⑪ creates a [sparse primary index](/guides/best-practices/sparse-primary-indexes), ⑫ applies [per-column compression](/parts#what-are-table-parts-in-clickhouse), and ⑬ writes the data as a new ⑭ [data part](/parts) to disk. + + +### Batch inserts if synchronous {#batch-inserts-if-synchronous} + + + +### Ensure idempotent retries {#ensure-idempotent-retries} + +Synchronous inserts are also **idempotent**. When using MergeTree engines, ClickHouse will deduplicate inserts by default. This protects against ambiguous failure cases, such as: + +* The insert succeeded but the client never received an acknowledgment due to a network interruption. +* The insert failed server-side and timed out. + +In both cases, it's safe to **retry the insert** - as long as the batch contents and order remain identical. For this reason, it's critical that clients retry consistently, without modifying or reordering data. + +### Choose the right insert target {#choose-the-right-insert-target} + +For sharded clusters, you have two options: + +* Insert directly into a **MergeTree** or **ReplicatedMergeTree** table. This is the most efficient option when the client can perform load balancing across shards. With `internal_replication = true`, ClickHouse handles replication transparently. +* Insert into a [Distributed table](/engines/table-engines/special/distributed). This allows clients to send data to any node and let ClickHouse forward it to the correct shard. This is simpler but slightly less performant due to the extra forwarding step. `internal_replication = true` is still recommended. + +**In ClickHouse Cloud all nodes read and write to the same single shard. Inserts are automatically balanced across nodes. Users can simply send inserts to the exposed endpoint.** + +### Choose the right format {#choose-the-right-format} + +Choosing the right input format is crucial for efficient data ingestion in ClickHouse. With over 70 supported formats, selecting the most performant option can significantly impact insert speed, CPU and memory usage, and overall system efficiency. + +While flexibility is useful for data engineering and file-based imports, **applications should prioritize performance-oriented formats**: + +* **Native format** (recommended): Most efficient. Column-oriented, minimal parsing required server-side. Used by default in Go and Python clients. +* **RowBinary**: Efficient row-based format, ideal if columnar transformation is hard client-side. Used by the Java client. +* **JSONEachRow**: Easy to use but expensive to parse. Suitable for low-volume use cases or quick integrations. + +### Use compression {#use-compression} + +Compression plays a critical role in reducing network overhead, speeding up inserts, and lowering storage costs in ClickHouse. Used effectively, it enhances ingestion performance without requiring changes to data format or schema. + +Compressing insert data reduces the size of the payload sent over the network, minimizing bandwidth usage and accelerating transmission. + +For inserts, compression is especially effective when used with the Native format, which already matches ClickHouse's internal columnar storage model. In this setup, the server can efficiently decompress and directly store the data with minimal transformation. + +#### Use LZ4 for speed, ZSTD for compression ratio {#use-lz4-for-speed-zstd-for-compression-ratio} + +ClickHouse supports several compression codecs during data transmission. Two common options are: + +* **LZ4**: Fast and lightweight. It reduces data size significantly with minimal CPU overhead, making it ideal for high-throughput inserts and default in most ClickHouse clients. +* **ZSTD**: Higher compression ratio but more CPU-intensive. It's useful when network transfer costs are high—such as in cross-region or cloud provider scenarios—though it increases client-side compute and server-side decompression time slightly. + +Best practice: Use LZ4 unless you have constrained bandwidth or incur data egress costs - then consider ZSTD. + +:::note +In tests from the [FastFormats benchmark](https://clickhouse.com/blog/clickhouse-input-format-matchup-which-is-fastest-most-efficient), LZ4-compressed Native inserts reduced data size by more than 50%, cutting ingestion time from 150s to 131s for a 5.6 GiB dataset. Switching to ZSTD compressed the same dataset down to 1.69 GiB, but increased server-side processing time slightly. +::: + +#### Compression reduces resource usage {#compression-reduces-resource-usage} + +Compression not only reduces network traffic—it also improves CPU and memory efficiency on the server. With compressed data, ClickHouse receives fewer bytes and spends less time parsing large inputs. This benefit is especially important when ingesting from multiple concurrent clients, such as in observability scenarios. + +The impact of compression on CPU and memory is modest for LZ4, and moderate for ZSTD. Even under load, server-side efficiency improves due to the reduced data volume. + +**Combining compression with batching and an efficient input format (like Native) yields the best ingestion performance.** + +When using the native interface (e.g. [clickhouse-client](/interfaces/cli)), LZ4 compression is enabled by default. You can optionally switch to ZSTD via settings. + +With the [HTTP interface](/interfaces/http), use the Content-Encoding header to apply compression (e.g. Content-Encoding: lz4). The entire payload must be compressed before sending. + +### Pre-sort if low cost {#pre-sort-if-low-cost} + +Pre-sorting data by primary key before insertion can improve ingestion efficiency in ClickHouse, particularly for large batches. + +When data arrives pre-sorted, ClickHouse can skip or simplify the internal sorting step during part creation, reducing CPU usage and accelerating the insert process. Pre-sorting also improves compression efficiency, since similar values are grouped together - enabling codecs like LZ4 or ZSTD to achieve a better compression ratio. This is especially beneficial when combined with large batch inserts and compression, as it reduces both the processing overhead and the amount of data transferred. + +**That said, pre-sorting is an optional optimization—not a requirement.** ClickHouse sorts data highly efficiently using parallel processing, and in many cases, server-side sorting is faster or more convenient than pre-sorting client-side. + +**We recommend pre-sorting only if the data is already nearly ordered or if client-side resources (CPU, memory) are sufficient and underutilized.** In latency-sensitive or high-throughput use cases, such as observability, where data arrives out of order or from many agents, it's often better to skip pre-sorting and rely on ClickHouse's built-in performance. + +## Asynchronous inserts {#asynchronous-inserts} + + + +## Choose an interface - HTTP or Native {#choose-an-interface} + +### Native {#choose-an-interface-native} + +ClickHouse offers two main interfaces for data ingestion: the **native interface** and the **HTTP interface** - each with trade-offs between performance and flexibility. The native interface, used by [clickhouse-client](/interfaces/cli) and select language clients like Go and C++, is purpose-built for performance. It always transmits data in ClickHouse's highly efficient Native format, supports block-wise compression with LZ4 or ZSTD, and minimizes server-side processing by offloading work such as parsing and format conversion to the client. + +It even enables client-side computation of MATERIALIZED and DEFAULT column values, allowing the server to skip these steps entirely. This makes the native interface ideal for high-throughput ingestion scenarios where efficiency is critical. + +### HTTP {#choose-an-interface-http} + +Unlike many traditional databases, ClickHouse also supports an HTTP interface. **This, by contrast, prioritizes compatibility and flexibility.** It allows data to be sent in [any supported format](/integrations/data-formats) - including JSON, CSV, Parquet, and others - and is widely supported across most ClickHouse clients, including Python, Java, JavaScript, and Rust. + +This is often preferable to ClickHouse's native protocol as it allows traffic to be easily switched with load balancers. We expect small differences in insert performance with the native protocol, which incurs a little less overhead. + +However, it lacks the native protocol's deeper integration and cannot perform client-side optimizations like materialized value computation or automatic conversion to Native format. While HTTP inserts can still be compressed using standard HTTP headers (e.g. `Content-Encoding: lz4`), the compression is applied to the entire payload rather than individual data blocks. This interface is often preferred in environments where protocol simplicity, load balancing, or broad format compatibility is more important than raw performance. + +For a more detailed description of these interfaces see [here](/interfaces/overview). + + + diff --git a/docs/best-practices/use_materialized_views.md b/docs/best-practices/use_materialized_views.md new file mode 100644 index 00000000000..a4fb905207a --- /dev/null +++ b/docs/best-practices/use_materialized_views.md @@ -0,0 +1,82 @@ +--- +slug: /best-practices/use-materialized-views +sidebar_position: 10 +sidebar_label: 'Use Materialized Views' +title: 'Use Materialized Views' +description: 'Page describing Materialized Views' +--- + +import Image from '@theme/IdealImage'; +import incremental_materialized_view from '@site/static/images/bestpractices/incremental_materialized_view.gif'; +import refreshable_materialized_view from '@site/static/images/bestpractices/refreshable_materialized_view.gif'; + + +ClickHouse supports two types of materialized views: [**incremental**](/materialized-view/incremental-materialized-view) and [**refreshable**](/materialized-view/refreshable-materialized-view). While both are designed to accelerate queries by pre-computing and storing results, they differ significantly in how and when the underlying queries are executed, what workloads they are suited for, and how data freshness is handled. + +**Users should consider materialized views for specific query patterns which need to be accelerated, assuming previous best practices [regarding type](/best-practices/select-data-types) and [primary key optimization](/best-practices/choosing-a-primary-key) have been performed.** + + +**Incremental materialized views** are updated in real-time. As new data is inserted into the source table, ClickHouse automatically applies the materialized view's query to the new data block and writes the results to a separate target table. Over time, ClickHouse merges these partial results to produce a complete, up-to-date view. This approach is highly efficient because it shifts the computational cost to insert time and only processes new data. As a result, `SELECT` queries against the target table are fast and lightweight. Incremental views support all aggregation functions and scale well—even to petabytes of data—because each query operates on a small, recent subset of the dataset being inserted. + + + +**Refreshable materialized views**, by contrast, are updated on a schedule. These views periodically re-execute their full query and overwrite the result in the target table. This is similar to materialized views in traditional OLTP databases like Postgres. + + + +The choice between incremental and refreshable materialized views depends largely on the nature of the query, how frequently data changes, and whether updates to the view must reflect every row as it is inserted, or if a periodic refresh is acceptable. Understanding these trade-offs is key to designing performant, scalable materialized views in ClickHouse. + +## When to Use Incremental Materialized Views {#when-to-use-incremental-materialized-views} + +Incremental materialized views are generally preferred, as they update automatically in real-time whenever the source tables receive new data. They support all aggregation functions and are particularly effective for aggregations over a single table. By computing results incrementally at insert-time, queries run against significantly smaller data subsets, allowing these views to scale effortlessly even to petabytes of data. In most cases they will have no appreciable impact on overall cluster performance. + +Use incremental materialized views when: + +- You require real-time query results updated with every insert. +- You're aggregating or filtering large volumes of data frequently. +- Your queries involve straightforward transformations or aggregations on single tables. + +For examples of incremental materialized views see [here](/materialized-view/incremental-materialized-view). + +## When to Use Refreshable Materialized Views {#when-to-use-refreshable-materialized-views} + +Refreshable materialized views execute their queries periodically rather than incrementally, storing the query result set for rapid retrieval. + +They are most useful when query performance is critical (e.g. sub-millisecond latency) and slightly stale results are acceptable. Since the query is re-run in full, refreshable views are best suited to queries that are either relatively fast to compute or which can be computed at infrequent intervals (e.g. hourly), such as caching “top N” results or lookup tables. + +Execution frequency should be tuned carefully to avoid excessive load on the system. Extremely complex queries which consume significant resources should be scheduled cautiously - these can cause overall cluster performance to degrade by impacting caches and consuming CPU and memory. The query should run relatively quickly compared to the refresh interval to avoid overloading your cluster. For example, do not schedule a view to be updated every 10 seconds if the query itself takes at least 10 seconds to compute. + +## Summary {#summary} + +In summary, use refreshable materialized views when: + +- You need cached query results available instantly, and minor delays in freshness are acceptable. +- You need the top N for a query result set. +- The size of the result set does not grow unbounded over time. This will cause performance of the target view to degrade. +- You're performing complex joins or denormalization involving multiple tables, requiring updates whenever any source table changes. +- You're building batch workflows, denormalization tasks, or creating view dependencies similar to DBT DAGs. + +For examples of refreshable materialized views see [here](/materialized-view/refreshable-materialized-view). + +### APPEND vs REPLACE Mode {#append-vs-replace-mode} + +Refreshable materialized views support two modes for writing data to the target table: `APPEND` and `REPLACE`. These modes define how the result of the view's query is written when the view is refreshed. + +`REPLACE` is the default behavior. Each time the view is refreshed, the previous contents of the target table are completely overwritten with the latest query result. This is suitable for use cases where the view should always reflect the latest state, such as caching a result set. + +`APPEND`, by contrast, allows new rows to be added to the end of the target table instead of replacing its contents. This enables additional use cases, such as capturing periodic snapshots. `APPEND` is particularly useful when each refresh represents a distinct point-in-time or when historical accumulation of results is desired. + +Choose `APPEND` mode when: + +- You want to keep a history of past refreshes. +- You're building periodic snapshots or reports. +- You need to incrementally collect refreshed results over time. + +Choose `REPLACE` mode when: + +- You only need the most recent result. +- Stale data should be discarded entirely. +- The view represents a current state or lookup. + +Users can find an application of the `APPEND` functionality if building a [Medallion architecture](https://clickhouse.com/blog/building-a-medallion-architecture-for-bluesky-json-data-with-clickhouse). + diff --git a/docs/best-practices/using_data_skipping_indices.md b/docs/best-practices/using_data_skipping_indices.md new file mode 100644 index 00000000000..09a4c67a327 --- /dev/null +++ b/docs/best-practices/using_data_skipping_indices.md @@ -0,0 +1,250 @@ +--- +slug: /best-practices/use-data-skipping-indices-where-appropriate +sidebar_position: 10 +sidebar_label: 'Data Skipping Indices' +title: 'Use Data Skipping Indices where Appropriate' +description: 'Page describing how and when to use data skipping indices' +--- + +import Image from '@theme/IdealImage'; +import building_skipping_indices from '@site/static/images/bestpractices/building_skipping_indices.gif'; +import using_skipping_indices from '@site/static/images/bestpractices/using_skipping_indices.gif'; + +Data skipping indices should be considered when previous best practices have been followed i.e. types are optimized, a good primary key has been selected and materialized views have been exploited. + +These types of indices can be used to accelerate query performance if used carefully with an understanding of how they work. + +ClickHouse provides a powerful mechanism called **data skipping indices** that can dramatically reduce the amount of data scanned during query execution - particularly when the primary key isn't helpful for a specific filter condition. Unlike traditional databases that rely on row-based secondary indexes (like B-trees), ClickHouse is a column-store and doesn't store row locations in a way that supports such structures. Instead, it uses skip indexes, which help it avoid reading blocks of data guaranteed not to match a query's filtering conditions. + +Skip indexes work by storing metadata about blocks of data - such as min/max values, value sets, or Bloom filter representations- and using this metadata during query execution to determine which data blocks can be skipped entirely. They apply only to the [MergeTree family](/engines/table-engines/mergetree-family/mergetree) of table engines and are defined using an expression, an index type, a name, and a granularity that defines the size of each indexed block. These indexes are stored alongside the table data and are consulted when the query filter matches the index expression. + +There are several types of data skipping indexes, each suited to different types of queries and data distributions: + +* **minmax**: Tracks the minimum and maximum value of an expression per block. Ideal for range queries on loosely sorted data. +* **set(N)**: Tracks a set of values up to a specified size N for each block. Effective on columns with low cardinality per blocks. +* **bloom_filter**: Probabilistically determines if a value exists in a block, allowing fast approximate filtering for set membership. Effective for optimizing queries looking for the “needle in a haystack”, where a positive match is needed. +* **tokenbf_v1 / ngrambf_v1**: Specialized Bloom filter variants designed for searching tokens or character sequences in strings - particularly useful for log data or text search use cases. + +While powerful, skip indexes must be used with care. They only provide benefit when they eliminate a meaningful number of data blocks, and can actually introduce overhead if the query or data structure doesn't align. If even a single matching value exists in a block, that entire block must still be read. + +**Effective skip index usage often depends on a strong correlation between the indexed column and the table's primary key, or inserting data in a way that groups similar values together.** + +In general, data skipping indices are best applied after ensuring proper primary key design and type optimization. They are particularly useful for: + +* Columns with high overall cardinality but low cardinality within a block. +* Rare values that are critical for search (e.g. error codes, specific IDs). +* Cases where filtering occurs on non-primary key columns with localized distribution. + +Always: + +1. test skip indexes on real data with realistic queries. Try different index types and granularity values. +2. Evaluate their impact using tools like send_logs_level='trace' and `EXPLAIN indexes=1` to view index effectiveness. +3. Always evaluate the size of an index and how it is impacted by granularity. Reducing granularity size often will improve performance to a point, resulting in more granules being filtered and needing to be scanned. However, as index size increases with lower granularity performance can also degrade. Measure the performance and index size for various granularity data points. This is particularly pertinent on bloom filter indexes. + +

+**When used appropriately, skip indexes can provide a substantial performance boost - when used blindly, they can add unnecessary cost.** + +For a more detailed guide on Data Skipping Indices see [here](/sql-reference/statements/alter/skipping-index). + +## Example {#example} + +Consider the following optimized table. This contains Stack Overflow data with a row per post. + +```sql +CREATE TABLE stackoverflow.posts +( + `Id` Int32 CODEC(Delta(4), ZSTD(1)), + `PostTypeId` Enum8('Question' = 1, 'Answer' = 2, 'Wiki' = 3, 'TagWikiExcerpt' = 4, 'TagWiki' = 5, 'ModeratorNomination' = 6, 'WikiPlaceholder' = 7, 'PrivilegeWiki' = 8), + `AcceptedAnswerId` UInt32, + `CreationDate` DateTime64(3, 'UTC'), + `Score` Int32, + `ViewCount` UInt32 CODEC(Delta(4), ZSTD(1)), + `Body` String, + `OwnerUserId` Int32, + `OwnerDisplayName` String, + `LastEditorUserId` Int32, + `LastEditorDisplayName` String, + `LastEditDate` DateTime64(3, 'UTC') CODEC(Delta(8), ZSTD(1)), + `LastActivityDate` DateTime64(3, 'UTC'), + `Title` String, + `Tags` String, + `AnswerCount` UInt16 CODEC(Delta(2), ZSTD(1)), + `CommentCount` UInt8, + `FavoriteCount` UInt8, + `ContentLicense` LowCardinality(String), + `ParentId` String, + `CommunityOwnedDate` DateTime64(3, 'UTC'), + `ClosedDate` DateTime64(3, 'UTC') +) +ENGINE = MergeTree +PARTITION BY toYear(CreationDate) +ORDER BY (PostTypeId, toDate(CreationDate)) +``` + +This table is optimized for queries which filter and aggregate by post type and date. Suppose we wished to count the number of posts with over 10,000,000 views published after 2009. + +```sql +SELECT count() +FROM stackoverflow.posts +WHERE (CreationDate > '2009-01-01') AND (ViewCount > 10000000) + +┌─count()─┐ +│ 5 │ +└─────────┘ + +1 row in set. Elapsed: 0.720 sec. Processed 59.55 million rows, 230.23 MB (82.66 million rows/s., 319.56 MB/s.) +``` + +This query is able to exclude some of the rows (and granules) using the primary index. However, the majority of rows still need to be read as indicated by the above response and following `EXPLAIN indexes=1`: + +```sql +EXPLAIN indexes = 1 +SELECT count() +FROM stackoverflow.posts +WHERE (CreationDate > '2009-01-01') AND (ViewCount > 10000000) +LIMIT 1 + +┌─explain──────────────────────────────────────────────────────────┐ +│ Expression ((Project names + Projection)) │ +│ Limit (preliminary LIMIT (without OFFSET)) │ +│ Aggregating │ +│ Expression (Before GROUP BY) │ +│ Expression │ +│ ReadFromMergeTree (stackoverflow.posts) │ +│ Indexes: │ +│ MinMax │ +│ Keys: │ +│ CreationDate │ +│ Condition: (CreationDate in ('1230768000', +Inf)) │ +│ Parts: 123/128 │ +│ Granules: 8513/8545 │ +│ Partition │ +│ Keys: │ +│ toYear(CreationDate) │ +│ Condition: (toYear(CreationDate) in [2009, +Inf)) │ +│ Parts: 123/123 │ +│ Granules: 8513/8513 │ +│ PrimaryKey │ +│ Keys: │ +│ toDate(CreationDate) │ +│ Condition: (toDate(CreationDate) in [14245, +Inf)) │ +│ Parts: 123/123 │ +│ Granules: 8513/8513 │ +└──────────────────────────────────────────────────────────────────┘ + +25 rows in set. Elapsed: 0.070 sec. +``` + +A simple analysis shows that `ViewCount` is correlated with the `CreationDate` (a primary key) as one might expect - the longer a post exists, the more time it has to be viewed. + +```sql +SELECT toDate(CreationDate) as day, avg(ViewCount) as view_count FROM stackoverflow.posts WHERE day > '2009-01-01' GROUP BY day +``` + +This therefore makes a logical choice for a data skipping index. Given the numeric type, a min_max index makes sense. We add an index using the following `ALTER TABLE` commands - first adding it, then "materializing it". + +```sql +ALTER TABLE stackoverflow.posts + (ADD INDEX view_count_idx ViewCount TYPE minmax GRANULARITY 1); + +ALTER TABLE stackoverflow.posts MATERIALIZE INDEX view_count_idx; +``` + +This index could have also been added during initial table creation. The schema with the min max index defined as part of the DDL: + +```sql +CREATE TABLE stackoverflow.posts +( + `Id` Int32 CODEC(Delta(4), ZSTD(1)), + `PostTypeId` Enum8('Question' = 1, 'Answer' = 2, 'Wiki' = 3, 'TagWikiExcerpt' = 4, 'TagWiki' = 5, 'ModeratorNomination' = 6, 'WikiPlaceholder' = 7, 'PrivilegeWiki' = 8), + `AcceptedAnswerId` UInt32, + `CreationDate` DateTime64(3, 'UTC'), + `Score` Int32, + `ViewCount` UInt32 CODEC(Delta(4), ZSTD(1)), + `Body` String, + `OwnerUserId` Int32, + `OwnerDisplayName` String, + `LastEditorUserId` Int32, + `LastEditorDisplayName` String, + `LastEditDate` DateTime64(3, 'UTC') CODEC(Delta(8), ZSTD(1)), + `LastActivityDate` DateTime64(3, 'UTC'), + `Title` String, + `Tags` String, + `AnswerCount` UInt16 CODEC(Delta(2), ZSTD(1)), + `CommentCount` UInt8, + `FavoriteCount` UInt8, + `ContentLicense` LowCardinality(String), + `ParentId` String, + `CommunityOwnedDate` DateTime64(3, 'UTC'), + `ClosedDate` DateTime64(3, 'UTC'), + INDEX view_count_idx ViewCount TYPE minmax GRANULARITY 1 --index here +) +ENGINE = MergeTree +PARTITION BY toYear(CreationDate) +ORDER BY (PostTypeId, toDate(CreationDate)) +``` + +The following animation illustrates how our minmax skipping index is built for the example table, tracking the minimum and maximum `ViewCount` values for each block of rows (granule) in the table: + + + +Repeating our earlier query shows significant performance improvements. Notice all the reduced number of rows scanned: + +```sql +SELECT count() +FROM stackoverflow.posts +WHERE (CreationDate > '2009-01-01') AND (ViewCount > 10000000) + +┌─count()─┐ +│ 5 │ +└─────────┘ + +1 row in set. Elapsed: 0.012 sec. Processed 39.11 thousand rows, 321.39 KB (3.40 million rows/s., 27.93 MB/s.) +``` + +An `EXPLAIN indexes=1` confirms use of the index. + +```sql +EXPLAIN indexes = 1 +SELECT count() +FROM stackoverflow.posts +WHERE (CreationDate > '2009-01-01') AND (ViewCount > 10000000) + +┌─explain────────────────────────────────────────────────────────────┐ +│ Expression ((Project names + Projection)) │ +│ Aggregating │ +│ Expression (Before GROUP BY) │ +│ Expression │ +│ ReadFromMergeTree (stackoverflow.posts) │ +│ Indexes: │ +│ MinMax │ +│ Keys: │ +│ CreationDate │ +│ Condition: (CreationDate in ('1230768000', +Inf)) │ +│ Parts: 123/128 │ +│ Granules: 8513/8545 │ +│ Partition │ +│ Keys: │ +│ toYear(CreationDate) │ +│ Condition: (toYear(CreationDate) in [2009, +Inf)) │ +│ Parts: 123/123 │ +│ Granules: 8513/8513 │ +│ PrimaryKey │ +│ Keys: │ +│ toDate(CreationDate) │ +│ Condition: (toDate(CreationDate) in [14245, +Inf)) │ +│ Parts: 123/123 │ +│ Granules: 8513/8513 │ +│ Skip │ +│ Name: view_count_idx │ +│ Description: minmax GRANULARITY 1 │ +│ Parts: 5/123 │ +│ Granules: 23/8513 │ +└────────────────────────────────────────────────────────────────────┘ + +29 rows in set. Elapsed: 0.211 sec. +``` + +We also show an animation how the minmax skipping index prunes all row blocks that cannot possibly contain matches for the `ViewCount` > 10,000,000 predicate in our example query: + + diff --git a/docs/cloud/bestpractices/asyncinserts.md b/docs/cloud/bestpractices/asyncinserts.md deleted file mode 100644 index 5b12e49db0e..00000000000 --- a/docs/cloud/bestpractices/asyncinserts.md +++ /dev/null @@ -1,65 +0,0 @@ ---- -slug: /cloud/bestpractices/asynchronous-inserts -sidebar_label: 'Asynchronous Inserts' -title: 'Asynchronous Inserts (async_insert)' -description: 'Describes how to use asynchronous inserts into ClickHouse as an alternative best practice to batching' ---- - -import Image from '@theme/IdealImage'; -import asyncInsert01 from '@site/static/images/cloud/bestpractices/async-01.png'; -import asyncInsert02 from '@site/static/images/cloud/bestpractices/async-02.png'; -import asyncInsert03 from '@site/static/images/cloud/bestpractices/async-03.png'; - -Inserting data into ClickHouse in large batches is a best practice. It saves compute cycles and disk I/O, and therefore it saves money. If your use case allows you to batch your inserts external to ClickHouse, then that is one option. If you would like ClickHouse to create the batches, then you can use the asynchronous INSERT mode described here. - -Use asynchronous inserts as an alternative to both batching data on the client-side and keeping the insert rate at around one insert query per second by enabling the [async_insert](/operations/settings/settings.md/#async_insert) setting. This causes ClickHouse to handle the batching on the server-side. - -By default, ClickHouse is writing data synchronously. -Each insert sent to ClickHouse causes ClickHouse to immediately create a part containing the data from the insert. -This is the default behavior when the async_insert setting is set to its default value of 0: - - - -By setting async_insert to 1, ClickHouse first stores the incoming inserts into an in-memory buffer before flushing them regularly to disk. - -There are two possible conditions that can cause ClickHouse to flush the buffer to disk: -- buffer size has reached N bytes in size (N is configurable via [async_insert_max_data_size](/operations/settings/settings.md/#async_insert_max_data_size)) -- at least N ms has passed since the last buffer flush (N is configurable via [async_insert_busy_timeout_max_ms](/operations/settings/settings.md/#async_insert_busy_timeout_max_ms)) - -Any time any of the conditions above are met, ClickHouse will flush its in-memory buffer to disk. - -:::note -Your data is available for read queries once the data is written to a part on storage. Keep this in mind for when you want to modify the `async_insert_busy_timeout_ms` (set as 1 second by default) or the `async_insert_max_data_size` (set as 10 MiB by default) settings. -::: - -With the [wait_for_async_insert](/operations/settings/settings.md/#wait_for_async_insert) setting, you can configure if you want an insert statement to return with an acknowledgment either immediately after the data got inserted into the buffer (wait_for_async_insert = 0) or by default, after the data got written to a part after flushing from buffer (wait_for_async_insert = 1). - -The following two diagrams illustrate the two settings for async_insert and wait_for_async_insert: - - - - - -### Enabling asynchronous inserts {#enabling-asynchronous-inserts} - -Asynchronous inserts can be enabled for a particular user, or for a specific query: - -- Enabling asynchronous inserts at the user level. This example uses the user `default`, if you create a different user then substitute that username: - ```sql - ALTER USER default SETTINGS async_insert = 1 - ``` -- You can specify the asynchronous insert settings by using the SETTINGS clause of insert queries: - ```sql - INSERT INTO YourTable SETTINGS async_insert=1, wait_for_async_insert=1 VALUES (...) - ``` -- You can also specify asynchronous insert settings as connection parameters when using a ClickHouse programming language client. - - As an example, this is how you can do that within a JDBC connection string when you use the ClickHouse Java JDBC driver for connecting to ClickHouse Cloud : - ```bash - "jdbc:ch://HOST.clickhouse.cloud:8443/?user=default&password=PASSWORD&ssl=true&custom_http_params=async_insert=1,wait_for_async_insert=1" - ``` -Our strong recommendation is to use async_insert=1,wait_for_async_insert=1 if using asynchronous inserts. Using wait_for_async_insert=0 is very risky because your INSERT client may not be aware if there are errors, and also can cause potential overload if your client continues to write quickly in a situation where the ClickHouse server needs to slow down the writes and create some backpressure in order to ensure reliability of the service. - -:::note Automatic deduplication is disabled by default when using asynchronous inserts -Manual batching (see [bulk insert](/cloud/bestpractices/bulkinserts.md))) has the advantage that it supports the [built-in automatic deduplication](/engines/table-engines/mergetree-family/replication.md) of table data if (exactly) the same insert statement is sent multiple times to ClickHouse Cloud, for example, because of an automatic retry in client software because of some temporary network connection issues. -::: diff --git a/docs/cloud/bestpractices/avoidmutations.md b/docs/cloud/bestpractices/avoidmutations.md deleted file mode 100644 index 8a042a4367f..00000000000 --- a/docs/cloud/bestpractices/avoidmutations.md +++ /dev/null @@ -1,15 +0,0 @@ ---- -slug: /cloud/bestpractices/avoid-mutations -sidebar_label: 'Avoid Mutations' -title: 'Avoid Mutations' -description: 'Page describing why you should avoid mutations, ALTER queries that manipulate table data through deletion or updates' ---- - -Mutations refers to [ALTER](/sql-reference/statements/alter/) queries that manipulate table data through deletion or updates. Most notably they are queries like ALTER TABLE ... DELETE, UPDATE, etc. Performing such queries will produce new mutated versions of the data parts. This means that such statements would trigger a rewrite of whole data parts for all data that was inserted before the mutation, translating to a large amount of write requests. - -For updates, you can avoid these large amounts of write requests by using specialised table engines like [ReplacingMergeTree](/engines/table-engines/mergetree-family/replacingmergetree.md) or [CollapsingMergeTree](/engines/table-engines/mergetree-family/collapsingmergetree.md) instead of the default MergeTree table engine. - - -## Related content {#related-content} - -- Blog: [Handling Updates and Deletes in ClickHouse](https://clickhouse.com/blog/handling-updates-and-deletes-in-clickhouse) diff --git a/docs/cloud/bestpractices/avoidoptimizefinal.md b/docs/cloud/bestpractices/avoidoptimizefinal.md deleted file mode 100644 index 0cf27f3a7ae..00000000000 --- a/docs/cloud/bestpractices/avoidoptimizefinal.md +++ /dev/null @@ -1,34 +0,0 @@ ---- -slug: /cloud/bestpractices/avoid-optimize-final -sidebar_label: 'Avoid Optimize Final' -title: 'Avoid Optimize Final' -keywords: ['OPTIMIZE TABLE', 'FINAL', 'unscheduled merge'] -description: 'Page describing the behaviour of OPTIMIZE TABLE...FINAL, and why you should avoid it' ---- - -Using the [`OPTIMIZE TABLE ... FINAL`](/sql-reference/statements/optimize/) query initiates an unscheduled merge of data parts for a specific table into one single data part. -During this process, ClickHouse performs the following steps: - -- Data parts are read. -- The parts get uncompressed. -- The parts get merged. -- They are compressed into a single part. -- The part is then written back into the object store. - -The operations described above are resource intensive, consuming significant CPU and disk I/O. -It is important to note that using this optimization will force a rewrite of a part, -even if merging to a single part has already occurred. - -Additionally, use of the `OPTIMIZE TABLE ... FINAL` query may disregard -setting [`max_bytes_to_merge_at_max_space_in_pool`](/operations/settings/merge-tree-settings#max_bytes_to_merge_at_max_space_in_pool) which controls the maximum size of parts -that ClickHouse will typically merge by itself in the background. - -The [`max_bytes_to_merge_at_max_space_in_pool`](/operations/settings/merge-tree-settings#max_bytes_to_merge_at_max_space_in_pool) setting is by default set to 150 GB. -When running `OPTIMIZE TABLE ... FINAL`, -the steps outlined above will be performed resulting in a single part after merge. -This remaining single part could exceed the 150 GB specified by the default of this setting. -This is another important consideration and reason why you should avoid use of this statement, -since merging a large number of 150 GB parts into a single part could require a significant amount of time and/or memory. - - - diff --git a/docs/cloud/bestpractices/bulkinserts.md b/docs/cloud/bestpractices/bulkinserts.md deleted file mode 100644 index 07a22ed0321..00000000000 --- a/docs/cloud/bestpractices/bulkinserts.md +++ /dev/null @@ -1,18 +0,0 @@ ---- -slug: /cloud/bestpractices/bulk-inserts -sidebar_position: 63 -sidebar_label: 'Use Bulk Inserts' -title: 'Bulk Inserts' -description: 'Page describing why you should ingest data in bulk in ClickHouse' ---- - - -## Ingest data in bulk {#ingest-data-in-bulk} -By default, each insert sent to ClickHouse causes ClickHouse to immediately create a part on storage containing the data from the insert together with other metadata that needs to be stored. -Therefore sending a smaller amount of inserts that each contain more data, compared to sending a larger amount of inserts that each contain less data, will reduce the number of writes required. Generally, we recommend inserting data in fairly large batches of at least 1,000 rows at a time, and ideally between 10,000 to 100,000 rows. To achieve this, consider implementing a buffer mechanism such as using the [Buffer table Engine](/engines/table-engines/special/buffer.md) to enable batch inserts, or use asynchronous inserts (see [asynchronous inserts](/cloud/bestpractices/asyncinserts.md)). - -:::tip -Regardless of the size of your inserts, we recommend keeping the number of insert queries around one insert query per second. -The reason for that recommendation is that the created parts are merged to larger parts in the background (in order to optimize your data for read queries), and sending too many insert queries per second can lead to situations where the background merging can't keep up with the number of new parts. -However, you can use a higher rate of insert queries per second when you use asynchronous inserts (see [asynchronous inserts](/cloud/bestpractices/asyncinserts.md)). -::: diff --git a/docs/cloud/bestpractices/index.md b/docs/cloud/bestpractices/index.md index 20b8bda1fd4..c1d00a2b712 100644 --- a/docs/cloud/bestpractices/index.md +++ b/docs/cloud/bestpractices/index.md @@ -1,22 +1,31 @@ --- slug: /cloud/bestpractices -keywords: ['Cloud', 'Best Practices', 'Bulk Inserts', 'Asynchronous Inserts', 'Avoid Mutations', 'Avoid Nullable Columns', 'Avoid Optimize Final', 'Low Cardinality Partitioning Key', 'Multi Tenancy'] +keywords: ['Cloud', 'Best Practices', 'Bulk Inserts', 'Asynchronous Inserts', 'Avoid Mutations', 'Avoid Nullable Columns', 'Avoid Optimize Final', 'Low Cardinality Partitioning Key', 'Multi Tenancy', 'Usage Limits'] title: 'Overview' hide_title: true -description: 'Landing page for Best Practices section in ClickHouse' +description: 'Landing page for Best Practices section in ClickHouse Cloud' --- -# Best Practices in ClickHouse +# Best Practices in ClickHouse Cloud {#best-practices-in-clickhouse-cloud} -This section provides six best practices you will want to follow to get the most out of ClickHouse Cloud. +This section provides best practices you will want to follow to get the most out of ClickHouse Cloud. | Page | Description | |----------------------------------------------------------|----------------------------------------------------------------------------| -| [Use Bulk Inserts](/cloud/bestpractices/bulk-inserts) | Learn why you should ingest data in bulk in ClickHouse | -| [Asynchronous Inserts](/cloud/bestpractices/asynchronous-inserts) | Learn how to asynchronously insert data if bulk inserts are not an option. | -| [Avoid Mutations](/cloud/bestpractices/avoid-mutations) | Learn why you should avoid mutations which trigger rewrites. | -| [Avoid Nullable Columns](/cloud/bestpractices/avoid-nullable-columns) | Learn why you should ideally avoid Nullable columns | -| [Avoid Optimize Final](/cloud/bestpractices/avoid-optimize-final) | Learn why you should avoid `OPTIMIZE TABLE ... FINAL` | -| [Choose a Low Cardinality Partitioning Key](/cloud/bestpractices/low-cardinality-partitioning-key) | Learn how to choose a low cardinality partitioning key. | | [Usage Limits](/cloud/bestpractices/usage-limits)| Explore the limits of ClickHouse. | | [Multi tenancy](/cloud/bestpractices/multi-tenancy)| Learn about different strategies to implement multi-tenancy. | + +These are in addition to the standard best practices which apply to all deployments of ClickHouse. + +| Page | Description | +|----------------------------------------------------------------------|--------------------------------------------------------------------------| +| [Choosing a Primary Key](/best-practices/choosing-a-primary-key) | Guidance on selecting an effective Primary Key in ClickHouse. | +| [Select Data Types](/best-practices/select-data-types) | Recommendations for choosing appropriate data types. | +| [Use Materialized Views](/best-practices/use-materialized-views) | When and how to benefit from materialized views. | +| [Minimize and Optimize JOINs](/best-practices/minimize-optimize-joins)| Best practices for minimizing and optimizing JOIN operations. | +| [Choosing a Partitioning Key](/best-practices/choosing-a-partitioning-key) | How to choose and apply partitioning keys effectively. | +| [Selecting an Insert Strategy](/best-practices/selecting-an-insert-strategy) | Strategies for efficient data insertion in ClickHouse. | +| [Data Skipping Indices](/best-practices/use-data-skipping-indices-where-appropriate) | When to apply data skipping indices for performance gains. | +| [Avoid Mutations](/best-practices/avoid-mutations) | Reasons to avoid mutations and how to design without them. | +| [Avoid OPTIMIZE FINAL](/best-practices/avoid-optimize-final) | Why `OPTIMIZE FINAL` can be costly and how to work around it. | +| [Use JSON where appropriate](/best-practices/use-json-where-appropriate) | Considerations for using JSON columns in ClickHouse. | diff --git a/docs/cloud/bestpractices/multitenancy.md b/docs/cloud/bestpractices/multitenancy.md index 26ebe397e86..c0a82c725c4 100644 --- a/docs/cloud/bestpractices/multitenancy.md +++ b/docs/cloud/bestpractices/multitenancy.md @@ -11,7 +11,7 @@ Depending on the requirements, there are different ways to implement multi-tenan ## Shared table {#shared-table} -In this approach, data from all tenants is stored in a single shared table, with a field (or set of fields) used to identify each tenant’s data. To maximize performance, this field should be included in the [primary key](/sql-reference/statements/create/table#primary-key). To ensure that users can only access data belonging to their respective tenants we use [role-based access control](/operations/access-rights), implemented through [row policies](/operations/access-rights#row-policy-management). +In this approach, data from all tenants is stored in a single shared table, with a field (or set of fields) used to identify each tenant's data. To maximize performance, this field should be included in the [primary key](/sql-reference/statements/create/table#primary-key). To ensure that users can only access data belonging to their respective tenants we use [role-based access control](/operations/access-rights), implemented through [row policies](/operations/access-rights#row-policy-management). > **We recommend this approach as this is the simplest to manage, particularly when all tenants share the same data schema and data volumes are moderate (< TBs)** @@ -107,11 +107,11 @@ FROM events ## Separate tables {#separate-tables} -In this approach, each tenant’s data is stored in a separate table within the same database, eliminating the need for a specific field to identify tenants. User access is enforced using a [GRANT statement](/sql-reference/statements/grant), ensuring that each user can access only tables containing their tenants' data. +In this approach, each tenant's data is stored in a separate table within the same database, eliminating the need for a specific field to identify tenants. User access is enforced using a [GRANT statement](/sql-reference/statements/grant), ensuring that each user can access only tables containing their tenants' data. > **Using separate tables is a good choice when tenants have different data schemas.** -For scenarios involving a few tenants with very large datasets where query performance is critical, this approach may outperform a shared table model. Since there is no need to filter out other tenants’ data, queries can be more efficient. Additionally, primary keys can be further optimized, as there is no need to include an extra field (such as a tenant ID) in the primary key. +For scenarios involving a few tenants with very large datasets where query performance is critical, this approach may outperform a shared table model. Since there is no need to filter out other tenants' data, queries can be more efficient. Additionally, primary keys can be further optimized, as there is no need to include an extra field (such as a tenant ID) in the primary key. Note this approach doesn't scale for 1000s of tenants. See [usage limits](/cloud/bestpractices/usage-limits). @@ -199,7 +199,7 @@ FROM default.events_tenant_1 ## Separate databases {#separate-databases} -Each tenant’s data is stored in a separate database within the same ClickHouse service. +Each tenant's data is stored in a separate database within the same ClickHouse service. > **This approach is useful if each tenant requires a large number of tables and possibly materialized views, and has different data schema. However, it may become challenging to manage if the number of tenants is large.** @@ -311,7 +311,7 @@ The most radical approach is to use a different ClickHouse service per tenant. > **This less common method would be a solution if tenants data are required to be stored in different regions - for legal, security or proximity reasons.** -A user account must be created on each service where the user can access their respective tenant’s data. +A user account must be created on each service where the user can access their respective tenant's data. This approach is harder to manage and bring overhead with each service, as they each requires their own infrastructure to run. Services can be managed via the [ClickHouse Cloud API](/cloud/manage/api/api-overview) with orchestration also possible via the [official Terraform provider](https://registry.terraform.io/providers/ClickHouse/clickhouse/latest/docs). diff --git a/docs/cloud/bestpractices/partitioningkey.md b/docs/cloud/bestpractices/partitioningkey.md deleted file mode 100644 index 9cff6c2c793..00000000000 --- a/docs/cloud/bestpractices/partitioningkey.md +++ /dev/null @@ -1,24 +0,0 @@ ---- -slug: /cloud/bestpractices/low-cardinality-partitioning-key -sidebar_label: 'Choose a Low Cardinality Partitioning Key' -title: 'Choose a Low Cardinality Partitioning Key' -description: 'Page describing why you should choose a low cardinality partitioning key as a best practice' ---- - -import Image from '@theme/IdealImage'; -import partitioning01 from '@site/static/images/cloud/bestpractices/partitioning-01.png'; -import partitioning02 from '@site/static/images/cloud/bestpractices/partitioning-02.png'; - -When you send an insert statement (that should contain many rows - see [section above](/optimize/bulk-inserts)) to a table in ClickHouse Cloud, and that -table is not using a [partitioning key](/engines/table-engines/mergetree-family/custom-partitioning-key.md) then all row data from that insert is written into a new part on storage: - - - -However, when you send an insert statement to a table in ClickHouse Cloud, and that table has a partitioning key, then ClickHouse: -- checks the partitioning key values of the rows contained in the insert -- creates one new part on storage per distinct partitioning key value -- places the rows in the corresponding parts by partitioning key value - - - -Therefore, to minimize the number of write requests to the ClickHouse Cloud object storage, use a low cardinality partitioning key or avoid using any partitioning key for your table. diff --git a/docs/cloud/get-started/cloud-quick-start.md b/docs/cloud/get-started/cloud-quick-start.md index 444843cef90..c984024ea81 100644 --- a/docs/cloud/get-started/cloud-quick-start.md +++ b/docs/cloud/get-started/cloud-quick-start.md @@ -189,7 +189,7 @@ You can use the familiar [`INSERT INTO TABLE`](../../sql-reference/statements/in :::tip ClickHouse best practice Insert a large number of rows per batch - tens of thousands or even millions of -rows at once. Don't worry - ClickHouse can easily handle that type of volume - and it will [save you money](/cloud/bestpractices/bulkinserts.md) by sending fewer write requests to your service. +rows at once. Don't worry - ClickHouse can easily handle that type of volume - and it will [save you money](/best-practices/selecting-an-insert-strategy#batch-inserts-if-synchronous) by sending fewer write requests to your service. :::
diff --git a/docs/cloud/get-started/index.md b/docs/cloud/get-started/index.md index bc9901b0f98..c632a4e6f96 100644 --- a/docs/cloud/get-started/index.md +++ b/docs/cloud/get-started/index.md @@ -14,5 +14,5 @@ Welcome to ClickHouse Cloud! Explore the pages below to learn more about what Cl | [SQL Console](/cloud/get-started/sql-console) | Learn about the interactive SQL console available in Cloud | | [Query Insights](/cloud/get-started/query-insights) | Learn about how Cloud's Query Insights feature makes ClickHouse's built-in query log easier to use through various visualizations and tables. | | [Query Endpoints](/cloud/get-started/query-endpoints) | Learn about the Query API Endpoints feature which allows you to create an API endpoint directly from any saved SQL query in the ClickHouse Cloud console. | -| [Dashboards](/cloud/manage/dashboards) | Learn about how SQL Console’s dashboards feature allows you to collect and share visualizations from saved queries. | +| [Dashboards](/cloud/manage/dashboards) | Learn about how SQL Console's dashboards feature allows you to collect and share visualizations from saved queries. | | [Cloud Support](/cloud/support) | Learn more about Support Services for ClickHouse Cloud users and customers. | \ No newline at end of file diff --git a/docs/cloud/get-started/sql-console.md b/docs/cloud/get-started/sql-console.md index 83fd11447ea..5a8955e651d 100644 --- a/docs/cloud/get-started/sql-console.md +++ b/docs/cloud/get-started/sql-console.md @@ -70,7 +70,7 @@ Click on a table in the list to open it in a new tab. In the Table View, data ca ### Inspecting Cell Data {#inspecting-cell-data} -The Cell Inspector tool can be used to view large amounts of data contained within a single cell. To open it, right-click on a cell and select ‘Inspect Cell’. The contents of the cell inspector can be copied by clicking the copy icon in the top right corner of the inspector contents. +The Cell Inspector tool can be used to view large amounts of data contained within a single cell. To open it, right-click on a cell and select 'Inspect Cell'. The contents of the cell inspector can be copied by clicking the copy icon in the top right corner of the inspector contents. @@ -78,42 +78,42 @@ The Cell Inspector tool can be used to view large amounts of data contained with ### Sorting a table {#sorting-a-table} -To sort a table in the SQL console, open a table and select the ‘Sort’ button in the toolbar. This button will open a menu that will allow you to configure your sort. You can choose a column by which you want to sort and configure the ordering of the sort (ascending or descending). Select ‘Apply’ or press Enter to sort your table +To sort a table in the SQL console, open a table and select the 'Sort' button in the toolbar. This button will open a menu that will allow you to configure your sort. You can choose a column by which you want to sort and configure the ordering of the sort (ascending or descending). Select 'Apply' or press Enter to sort your table -The SQL console also allows you to add multiple sorts to a table. Click the ‘Sort’ button again to add another sort. +The SQL console also allows you to add multiple sorts to a table. Click the 'Sort' button again to add another sort. :::note -Sorts are applied in the order that they appear in the sort pane (top to bottom). To remove a sort, simply click the ‘x’ button next to the sort. +Sorts are applied in the order that they appear in the sort pane (top to bottom). To remove a sort, simply click the 'x' button next to the sort. ::: ### Filtering a table {#filtering-a-table} -To filter a table in the SQL console, open a table and select the ‘Filter’ button. Just like sorting, this button will open a menu that will allow you to configure your filter. You can choose a column by which to filter and select the necessary criteria. The SQL console intelligently displays filter options that correspond to the type of data contained in the column. +To filter a table in the SQL console, open a table and select the 'Filter' button. Just like sorting, this button will open a menu that will allow you to configure your filter. You can choose a column by which to filter and select the necessary criteria. The SQL console intelligently displays filter options that correspond to the type of data contained in the column. -When you’re happy with your filter, you can select ‘Apply’ to filter your data. You can also add additional filters as shown below. +When you're happy with your filter, you can select 'Apply' to filter your data. You can also add additional filters as shown below. -Similar to the sort functionality, click the ‘x’ button next to a filter to remove it. +Similar to the sort functionality, click the 'x' button next to a filter to remove it. ### Filtering and sorting together {#filtering-and-sorting-together} -The SQL console allows you to filter and sort a table at the same time. To do this, add all desired filters and sorts using the steps described above and click the ‘Apply’ button. +The SQL console allows you to filter and sort a table at the same time. To do this, add all desired filters and sorts using the steps described above and click the 'Apply' button. ### Creating a query from filters and sorts {#creating-a-query-from-filters-and-sorts} -The SQL console can convert your sorts and filters directly into queries with one click. Simply select the ‘Create Query’ button from the toolbar with the sort and filter parameters of your choosing. After clicking ‘Create query’, a new query tab will open pre-populated with the SQL command corresponding to the data contained in your table view. +The SQL console can convert your sorts and filters directly into queries with one click. Simply select the 'Create Query' button from the toolbar with the sort and filter parameters of your choosing. After clicking 'Create query', a new query tab will open pre-populated with the SQL command corresponding to the data contained in your table view. :::note -Filters and sorts are not mandatory when using the ‘Create Query’ feature. +Filters and sorts are not mandatory when using the 'Create Query' feature. ::: You can learn more about querying in the SQL console by reading the (link) query documentation. @@ -124,14 +124,14 @@ You can learn more about querying in the SQL console by reading the (link) query There are two ways to create a new query in the SQL console. -- Click the ‘+’ button in the tab bar -- Select the ‘New Query’ button from the left sidebar query list +- Click the '+' button in the tab bar +- Select the 'New Query' button from the left sidebar query list ### Running a Query {#running-a-query} -To run a query, type your SQL command(s) into the SQL Editor and click the ‘Run’ button or use the shortcut `cmd / ctrl + enter`. To write and run multiple commands sequentially, make sure to add a semicolon after each command. +To run a query, type your SQL command(s) into the SQL Editor and click the 'Run' button or use the shortcut `cmd / ctrl + enter`. To write and run multiple commands sequentially, make sure to add a semicolon after each command. Query Execution Options By default, clicking the run button will run all commands contained in the SQL Editor. The SQL console supports two other query execution options: @@ -139,17 +139,17 @@ By default, clicking the run button will run all commands contained in the SQL E - Run selected command(s) - Run command at the cursor -To run selected command(s), highlight the desired command or sequence of commands and click the ‘Run’ button (or use the `cmd / ctrl + enter` shortcut). You can also select ‘Run selected’ from the SQL Editor context menu (opened by right-clicking anywhere within the editor) when a selection is present. +To run selected command(s), highlight the desired command or sequence of commands and click the 'Run' button (or use the `cmd / ctrl + enter` shortcut). You can also select 'Run selected' from the SQL Editor context menu (opened by right-clicking anywhere within the editor) when a selection is present. Running the command at the current cursor position can be achieved in two ways: -- Select ‘At Cursor’ from the extended run options menu (or use the corresponding `cmd / ctrl + shift + enter` keyboard shortcut +- Select 'At Cursor' from the extended run options menu (or use the corresponding `cmd / ctrl + shift + enter` keyboard shortcut - - Selecting ‘Run at cursor’ from the SQL Editor context menu + - Selecting 'Run at cursor' from the SQL Editor context menu @@ -159,7 +159,7 @@ The command present at the cursor position will flash yellow on execution. ### Canceling a Query {#canceling-a-query} -While a query is running, the ‘Run’ button in the Query Editor toolbar will be replaced with a ‘Cancel’ button. Simply click this button or press `Esc` to cancel the query. Note: Any results that have already been returned will persist after cancellation. +While a query is running, the 'Run' button in the Query Editor toolbar will be replaced with a 'Cancel' button. Simply click this button or press `Esc` to cancel the query. Note: Any results that have already been returned will persist after cancellation. @@ -222,11 +222,11 @@ Values for any parameters that may exist in a query are automatically added to t ### Searching query results {#searching-query-results} -After a query is executed, you can quickly search through the returned result set using the search input in the result pane. This feature assists in previewing the results of an additional `WHERE` clause or simply checking to ensure that specific data is included in the result set. After inputting a value into the search input, the result pane will update and return records containing an entry that matches the inputted value. In this example, we’ll look for all instances of `breakfast` in the `hackernews` table for comments that contain `ClickHouse` (case-insensitive): +After a query is executed, you can quickly search through the returned result set using the search input in the result pane. This feature assists in previewing the results of an additional `WHERE` clause or simply checking to ensure that specific data is included in the result set. After inputting a value into the search input, the result pane will update and return records containing an entry that matches the inputted value. In this example, we'll look for all instances of `breakfast` in the `hackernews` table for comments that contain `ClickHouse` (case-insensitive): -Note: Any field matching the inputted value will be returned. For example, the third record in the above screenshot does not match ‘breakfast’ in the `by` field, but the `text` field does: +Note: Any field matching the inputted value will be returned. For example, the third record in the above screenshot does not match 'breakfast' in the `by` field, but the `text` field does: @@ -242,13 +242,13 @@ Selecting a page size will immediately apply pagination to the result set and na ### Exporting query result data {#exporting-query-result-data} -Query result sets can be easily exported to CSV format directly from the SQL console. To do so, open the `•••` menu on the right side of the result pane toolbar and select ‘Download as CSV’. +Query result sets can be easily exported to CSV format directly from the SQL console. To do so, open the `•••` menu on the right side of the result pane toolbar and select 'Download as CSV'. ## Visualizing Query Data {#visualizing-query-data} -Some data can be more easily interpreted in chart form. You can quickly create visualizations from query result data directly from the SQL console in just a few clicks. As an example, we’ll use a query that calculates weekly statistics for NYC taxi trips: +Some data can be more easily interpreted in chart form. You can quickly create visualizations from query result data directly from the SQL console in just a few clicks. As an example, we'll use a query that calculates weekly statistics for NYC taxi trips: ```sql select @@ -266,19 +266,19 @@ order by -Without visualization, these results are difficult to interpret. Let’s turn them into a chart. +Without visualization, these results are difficult to interpret. Let's turn them into a chart. ### Creating charts {#creating-charts} -To begin building your visualization, select the ‘Chart’ option from the query result pane toolbar. A chart configuration pane will appear: +To begin building your visualization, select the 'Chart' option from the query result pane toolbar. A chart configuration pane will appear: -We’ll start by creating a simple bar chart tracking `trip_total` by `week`. To accomplish this, we’ll drag the `week` field to the x-axis and the `trip_total` field to the y-axis: +We'll start by creating a simple bar chart tracking `trip_total` by `week`. To accomplish this, we'll drag the `week` field to the x-axis and the `trip_total` field to the y-axis: -Most chart types support multiple fields on numeric axes. To demonstrate, we’ll drag the fare_total field onto the y-axis: +Most chart types support multiple fields on numeric axes. To demonstrate, we'll drag the fare_total field onto the y-axis: @@ -292,7 +292,7 @@ Chart titles match the name of the query supplying the data. Updating the name o -A number of more advanced chart characteristics can also be adjusted in the ‘Advanced’ section of the chart configuration pane. To begin, we’ll adjust the following settings: +A number of more advanced chart characteristics can also be adjusted in the 'Advanced' section of the chart configuration pane. To begin, we'll adjust the following settings: - Subtitle - Axis titles @@ -302,6 +302,6 @@ Our chart will be updated accordingly: -In some scenarios, it may be necessary to adjust the axis scales for each field independently. This can also be accomplished in the ‘Advanced’ section of the chart configuration pane by specifying min and max values for an axis range. As an example, the above chart looks good, but in order to demonstrate the correlation between our `trip_total` and `fare_total` fields, the axis ranges need some adjustment: +In some scenarios, it may be necessary to adjust the axis scales for each field independently. This can also be accomplished in the 'Advanced' section of the chart configuration pane by specifying min and max values for an axis range. As an example, the above chart looks good, but in order to demonstrate the correlation between our `trip_total` and `fare_total` fields, the axis ranges need some adjustment: diff --git a/docs/cloud/manage/dashboards.md b/docs/cloud/manage/dashboards.md index 22ddc0c13c1..ff2783f260c 100644 --- a/docs/cloud/manage/dashboards.md +++ b/docs/cloud/manage/dashboards.md @@ -2,7 +2,7 @@ sidebar_label: 'Dashboards' slug: /cloud/manage/dashboards title: 'Dashboards' -description: 'The SQL Console’s dashboards feature allows you to collect and share visualizations from saved queries.' +description: 'The SQL Console's dashboards feature allows you to collect and share visualizations from saved queries.' --- import BetaBadge from '@theme/badges/BetaBadge'; @@ -22,7 +22,7 @@ import dashboards_11 from '@site/static/images/cloud/dashboards/11_dashboards.pn -The SQL Console’s dashboards feature allows you to collect and share visualizations from saved queries. Get started by saving and visualizing queries, adding query visualizations to a dashboard, and making the dashboard interactive using query parameters. +The SQL Console's dashboards feature allows you to collect and share visualizations from saved queries. Get started by saving and visualizing queries, adding query visualizations to a dashboard, and making the dashboard interactive using query parameters. ## Core Concepts {#core-concepts} @@ -38,7 +38,7 @@ You can toggle the query parameter input via the **Global** filters side pane by ## Quick Start {#quick-start} -Let’s create a dashboard to monitor our ClickHouse service using the [query\_log](/operations/system-tables/query_log) system table. +Let's create a dashboard to monitor our ClickHouse service using the [query\_log](/operations/system-tables/query_log) system table. ## Quick Start {#quick-start-1} @@ -46,7 +46,7 @@ Let’s create a dashboard to monitor our ClickHouse service using the [query\_l If you already have saved queries to visualize, you can skip this step. -Open a new query tab. Let’s write a query to count query volume by day on a service using ClickHouse system tables: +Open a new query tab. Let's write a query to count query volume by day on a service using ClickHouse system tables: @@ -56,21 +56,21 @@ We can view the results of the query in table format or start building visualiza More documentation around saved queries can be found in the [Saving a Query section](/cloud/get-started/sql-console#saving-a-query). -We can create and save another query, `query count by query kind`, to count the number of queries by query kind. Here’s a bar chart visualization of the data in the SQL console. +We can create and save another query, `query count by query kind`, to count the number of queries by query kind. Here's a bar chart visualization of the data in the SQL console. -Now that there’s two queries, let’s create a dashboard to visualize and collect these queries. +Now that there's two queries, let's create a dashboard to visualize and collect these queries. ### Create a dashboard {#create-a-dashboard} -Navigate to the Dashboards panel, and hit “New Dashboard”. After you assign a name, you’ll have successfully created your first dashboard! +Navigate to the Dashboards panel, and hit “New Dashboard”. After you assign a name, you'll have successfully created your first dashboard! ### Add a visualization {#add-a-visualization} -There’s two saved queries, `queries over time` and `query count by query kind`. Let’s visualize the first as a line chart. Give your visualization a title and subtitle, and select the query to visualize. Next, select the “Line” chart type, and assign an x and y axis. +There's two saved queries, `queries over time` and `query count by query kind`. Let's visualize the first as a line chart. Give your visualization a title and subtitle, and select the query to visualize. Next, select the “Line” chart type, and assign an x and y axis. @@ -80,23 +80,23 @@ Next, let's visualize the second query as a table, and position it below the lin -You’ve created your first dashboard by visualizing two saved queries! +You've created your first dashboard by visualizing two saved queries! ### Configure a filter {#configure-a-filter} -Let’s make this dashboard interactive by adding a filter on query kind so you can display just the trends related to Insert queries. We’ll accomplish this task using [query parameters](/sql-reference/syntax#defining-and-using-query-parameters). +Let's make this dashboard interactive by adding a filter on query kind so you can display just the trends related to Insert queries. We'll accomplish this task using [query parameters](/sql-reference/syntax#defining-and-using-query-parameters). Click on the three dots next to the line chart, and click on the pencil button next to the query to open the in-line query editor. Here, we can edit the underlying saved query directly from the dashboard. -Now, when the yellow run query button is pressed, you’ll see the same query from earlier filtered on just insert queries. Click on the save button to update the query. When you return to the chart settings, you’ll be able to filter the line chart. +Now, when the yellow run query button is pressed, you'll see the same query from earlier filtered on just insert queries. Click on the save button to update the query. When you return to the chart settings, you'll be able to filter the line chart. Now, using Global Filters on the top ribbon, you can toggle the filter by changing the input. -Suppose you want to link the line chart’s filter to the table. You can do this by going back to the visualization settings, and changing the query_kind query parameter’ value source to a table, and selecting the query_kind column as the field to link. +Suppose you want to link the line chart's filter to the table. You can do this by going back to the visualization settings, and changing the query_kind query parameter' value source to a table, and selecting the query_kind column as the field to link. diff --git a/docs/cloud/manage/jan2025_faq/new_tiers.md b/docs/cloud/manage/jan2025_faq/new_tiers.md index db610374949..704e3e442f4 100644 --- a/docs/cloud/manage/jan2025_faq/new_tiers.md +++ b/docs/cloud/manage/jan2025_faq/new_tiers.md @@ -62,6 +62,6 @@ The enterprise tier will support standard profiles (1:4 vCPU:memory ratio), as w - **Scheduled upgrades:** Users can select the day of the week/time window for upgrades, both database and cloud releases. - **HIPAA Compliance:** The customer must sign a Business Associate Agreement (BAA) through Legal before we enable HIPAA-compliant regions for them. - **Private Regions:** It is not self-serve enabled and will need users to route requests through Sales sales@clickhouse.com. -- **Export Backups** to the customer’s cloud account. +- **Export Backups** to the customer's cloud account. diff --git a/docs/cloud/manage/replica-aware-routing.md b/docs/cloud/manage/replica-aware-routing.md index 99f068f0928..a32f5de78c5 100644 --- a/docs/cloud/manage/replica-aware-routing.md +++ b/docs/cloud/manage/replica-aware-routing.md @@ -7,7 +7,7 @@ keywords: ['cloud', 'sticky endpoints', 'sticky', 'endpoints', 'sticky routing', # Replica-aware routing (Private Preview) -Replica-aware routing (also known as sticky sessions, sticky routing, or session affinity) utilizes [Envoy proxy’s ring hash load balancing](https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/load_balancers#ring-hash). The main purpose of replica-aware routing is to increase the chance of cache reuse. It does not guarantee isolation. +Replica-aware routing (also known as sticky sessions, sticky routing, or session affinity) utilizes [Envoy proxy's ring hash load balancing](https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/load_balancers#ring-hash). The main purpose of replica-aware routing is to increase the chance of cache reuse. It does not guarantee isolation. When enabling replica-aware routing for a service, we allow a wildcard subdomain on top of the service hostname. For a service with the host name `abcxyz123.us-west-2.aws.clickhouse.cloud`, you can use any hostname which matches `*.sticky.abcxyz123.us-west-2.aws.clickhouse.cloud` to visit the service: diff --git a/docs/cloud/reference/byoc.md b/docs/cloud/reference/byoc.md index a40e8d4ad1d..247ef30e1f2 100644 --- a/docs/cloud/reference/byoc.md +++ b/docs/cloud/reference/byoc.md @@ -33,7 +33,7 @@ BYOC is designed specifically for large-scale deployments, and requires customer ## Glossary {#glossary} - **ClickHouse VPC:** The VPC owned by ClickHouse Cloud. -- **Customer BYOC VPC:** The VPC, owned by the customer’s cloud account, is provisioned and managed by ClickHouse Cloud and dedicated to a ClickHouse Cloud BYOC deployment. +- **Customer BYOC VPC:** The VPC, owned by the customer's cloud account, is provisioned and managed by ClickHouse Cloud and dedicated to a ClickHouse Cloud BYOC deployment. - **Customer VPC** Other VPCs owned by the customer cloud account used for applications that need to connect to the Customer BYOC VPC. ## Architecture {#architecture} diff --git a/docs/cloud/reference/cloud-compatibility.md b/docs/cloud/reference/cloud-compatibility.md index 64859e327c2..8250ca62d7f 100644 --- a/docs/cloud/reference/cloud-compatibility.md +++ b/docs/cloud/reference/cloud-compatibility.md @@ -16,8 +16,8 @@ These benefits come as a result of architectural choices underlying ClickHouse C - Compute and storage are separated and thus can be automatically scaled along separate dimensions, so you do not have to over-provision either storage or compute in static instance configurations. - Tiered storage on top of object store and multi-level caching provides virtually limitless scaling and good price/performance ratio, so you do not have to size your storage partition upfront and worry about high storage costs. - High availability is on by default and replication is transparently managed, so you can focus on building your applications or analyzing your data. -- Automatic scaling for variable continuous workloads is on by default, so you don’t have to size your service upfront, scale up your servers when your workload increases, or manually scale down your servers when you have less activity -- Seamless hibernation for intermittent workloads is on by default. We automatically pause your compute resources after a period of inactivity and transparently start it again when a new query arrives, so you don’t have to pay for idle resources. +- Automatic scaling for variable continuous workloads is on by default, so you don't have to size your service upfront, scale up your servers when your workload increases, or manually scale down your servers when you have less activity +- Seamless hibernation for intermittent workloads is on by default. We automatically pause your compute resources after a period of inactivity and transparently start it again when a new query arrives, so you don't have to pay for idle resources. - Advanced scaling controls provide the ability to set an auto-scaling maximum for additional cost control or an auto-scaling minimum to reserve compute resources for applications with specialized performance requirements. ## Capabilities {#capabilities} diff --git a/docs/cloud/reference/shared-merge-tree.md b/docs/cloud/reference/shared-merge-tree.md index e1d028ac937..d9250e2e130 100644 --- a/docs/cloud/reference/shared-merge-tree.md +++ b/docs/cloud/reference/shared-merge-tree.md @@ -33,7 +33,7 @@ As you can see, even though the data stored in the ReplicatedMergeTree are in ob -Unlike ReplicatedMergeTree, SharedMergeTree doesn't require replicas to communicate with each other. Instead, all communication happens through shared storage and clickhouse-keeper. SharedMergeTree implements asynchronous leaderless replication and uses clickhouse-keeper for coordination and metadata storage. This means that metadata doesn’t need to be replicated as your service scales up and down. This leads to faster replication, mutation, merges and scale-up operations. SharedMergeTree allows for hundreds of replicas for each table, making it possible to dynamically scale without shards. A distributed query execution approach is used in ClickHouse Cloud to utilize more compute resources for a query. +Unlike ReplicatedMergeTree, SharedMergeTree doesn't require replicas to communicate with each other. Instead, all communication happens through shared storage and clickhouse-keeper. SharedMergeTree implements asynchronous leaderless replication and uses clickhouse-keeper for coordination and metadata storage. This means that metadata doesn't need to be replicated as your service scales up and down. This leads to faster replication, mutation, merges and scale-up operations. SharedMergeTree allows for hundreds of replicas for each table, making it possible to dynamically scale without shards. A distributed query execution approach is used in ClickHouse Cloud to utilize more compute resources for a query. ## Introspection {#introspection} @@ -52,7 +52,7 @@ This table is the alternative to `system.replicated_fetches` SharedMergeTree. It `SharedMergeTree` is enabled by default. -For services that support SharedMergeTree table engine, you don’t need to enable anything manually. You can create tables the same way as you did before and it will automatically use a SharedMergeTree-based table engine corresponding to the engine specified in your CREATE TABLE query. +For services that support SharedMergeTree table engine, you don't need to enable anything manually. You can create tables the same way as you did before and it will automatically use a SharedMergeTree-based table engine corresponding to the engine specified in your CREATE TABLE query. ```sql CREATE TABLE my_table( @@ -65,7 +65,7 @@ ORDER BY key This will create the table `my_table` using the SharedMergeTree table engine. -You don’t need to specify `ENGINE=MergeTree` as `default_table_engine=MergeTree` in ClickHouse Cloud. The following query is identical to the query above. +You don't need to specify `ENGINE=MergeTree` as `default_table_engine=MergeTree` in ClickHouse Cloud. The following query is identical to the query above. ```sql CREATE TABLE my_table( diff --git a/docs/cloud/security/gcp-private-service-connect.md b/docs/cloud/security/gcp-private-service-connect.md index 8b52a5b7e6c..45135092a10 100644 --- a/docs/cloud/security/gcp-private-service-connect.md +++ b/docs/cloud/security/gcp-private-service-connect.md @@ -65,7 +65,7 @@ Code examples are provided below to show how to set up Private Service Connect w - GCP VPC in customer GCP project: `default` ::: -You’ll need to retrieve information about your ClickHouse Cloud service. You can do this either via the ClickHouse Cloud Console or the ClickHouse API. If you are going to use the ClickHouse API, please set the following environment variables before proceeding: +You'll need to retrieve information about your ClickHouse Cloud service. You can do this either via the ClickHouse Cloud Console or the ClickHouse API. If you are going to use the ClickHouse API, please set the following environment variables before proceeding: ```shell REGION= diff --git a/docs/cloud/security/personal-data-access.md b/docs/cloud/security/personal-data-access.md index eea104fb95b..8682c52eb60 100644 --- a/docs/cloud/security/personal-data-access.md +++ b/docs/cloud/security/personal-data-access.md @@ -18,7 +18,7 @@ Depending on where you are located, applicable law may also provide you addition **Scope of Personal Data** -Please review ClickHouse’s Privacy Policy for details on personal data that ClickHouse collects and how it may be used. +Please review ClickHouse's Privacy Policy for details on personal data that ClickHouse collects and how it may be used. ## Self Service {#self-service} @@ -47,7 +47,7 @@ Please be sure to include the following details in your support case: | Field | Text to include in your request | |-------------|---------------------------------------------------| | Subject | Data Subject Access Request (DSAR) | -| Description | Detailed description of the information you’d like ClickHouse to look for, collect, and/or provide. | +| Description | Detailed description of the information you'd like ClickHouse to look for, collect, and/or provide. | diff --git a/docs/concepts/glossary.md b/docs/concepts/glossary.md index 8568ed07454..4b36ca9f57f 100644 --- a/docs/concepts/glossary.md +++ b/docs/concepts/glossary.md @@ -25,7 +25,7 @@ A dictionary is a mapping of key-value pairs that is useful for various types of ## Parts {#parts} -A physical file on a disk that stores a portion of the table's data. This is different from a partition, which is a logical division of a table’s data that is created using a partition key. +A physical file on a disk that stores a portion of the table's data. This is different from a partition, which is a logical division of a table's data that is created using a partition key. ## Replica {#replica} diff --git a/docs/concepts/olap.md b/docs/concepts/olap.md index 1aec431e107..74472dd0622 100644 --- a/docs/concepts/olap.md +++ b/docs/concepts/olap.md @@ -20,7 +20,7 @@ slug: /concepts/olap ## OLAP from the Business Perspective {#olap-from-the-business-perspective} -In recent years business people started to realize the value of data. Companies who make their decisions blindly more often than not fail to keep up with the competition. The data-driven approach of successful companies forces them to collect all data that might be even remotely useful for making business decisions, and imposes on them a need for mechanisms which allow them to analyze this data in a timely manner. Here’s where OLAP database management systems (DBMS) come in. +In recent years business people started to realize the value of data. Companies who make their decisions blindly more often than not fail to keep up with the competition. The data-driven approach of successful companies forces them to collect all data that might be even remotely useful for making business decisions, and imposes on them a need for mechanisms which allow them to analyze this data in a timely manner. Here's where OLAP database management systems (DBMS) come in. In a business sense, OLAP allows companies to continuously plan, analyze, and report operational activities, thus maximizing efficiency, reducing expenses, and ultimately conquering the market share. It could be done either in an in-house system or outsourced to SaaS providers like web/mobile analytics services, CRM services, etc. OLAP is the technology behind many BI applications (Business Intelligence). @@ -36,5 +36,5 @@ Even if a DBMS started out as a pure OLAP or pure OLTP, it is forced to move in The fundamental trade-off between OLAP and OLTP systems remains: -- To build analytical reports efficiently it’s crucial to be able to read columns separately, thus most OLAP databases are [columnar](https://clickhouse.com/engineering-resources/what-is-columnar-database), +- To build analytical reports efficiently it's crucial to be able to read columns separately, thus most OLAP databases are [columnar](https://clickhouse.com/engineering-resources/what-is-columnar-database), - While storing columns separately increases costs of operations on rows, like append or in-place modification, proportionally to the number of columns (which can be huge if the systems try to collect all details of an event just in case). Thus, most OLTP systems store data arranged by rows. diff --git a/docs/concepts/why-clickhouse-is-so-fast.md b/docs/concepts/why-clickhouse-is-so-fast.md index 3246c0453a0..980ffcf1d18 100644 --- a/docs/concepts/why-clickhouse-is-so-fast.md +++ b/docs/concepts/why-clickhouse-is-so-fast.md @@ -75,7 +75,7 @@ All three techniques aim to skip as many rows during full-column reads as possib -Besides that, ClickHouse’s storage layer additionally (and optionally) compresses the raw table data using different codecs. +Besides that, ClickHouse's storage layer additionally (and optionally) compresses the raw table data using different codecs. Column-stores are particularly well suited for such compression as values of the same type and data distribution are located together. @@ -108,7 +108,7 @@ If a single node becomes too small to hold the table data, further nodes can be What sets ClickHouse [apart](https://www.youtube.com/watch?v=CAS2otEoerM) is its meticulous attention to low-level optimization. Building a database that simply works is one thing, but engineering it to deliver speed across diverse query types, data structures, distributions, and index configurations is where the "[freak system](https://youtu.be/Vy2t_wZx4Is?si=K7MyzsBBxgmGcuGU&t=3579)" artistry shines. -**Hash Tables.** Let’s take a hash table as an example. Hash tables are central data structures used by joins and aggregations. As a programmer, one needs to consider these design decisions: +**Hash Tables.** Let's take a hash table as an example. Hash tables are central data structures used by joins and aggregations. As a programmer, one needs to consider these design decisions: * The hash function to choose, * The collision resolution: [open addressing](https://en.wikipedia.org/wiki/Open_addressing) or [chaining](https://en.wikipedia.org/wiki/Hash_table#Separate_chaining), diff --git a/docs/data-modeling/schema-design.md b/docs/data-modeling/schema-design.md index 348b714408e..d953c2a8c4e 100644 --- a/docs/data-modeling/schema-design.md +++ b/docs/data-modeling/schema-design.md @@ -247,7 +247,7 @@ LIMIT 3 Peak memory usage: 429.38 MiB. ``` -> The query here is very fast even though all 60m rows have been linearly scanned - ClickHouse is just fast :) You’ll have to trust us ordering keys is worth it at TB and PB scale! +> The query here is very fast even though all 60m rows have been linearly scanned - ClickHouse is just fast :) You'll have to trust us ordering keys is worth it at TB and PB scale! Lets select the columns `PostTypeId` and `CreationDate` as our ordering keys. @@ -317,7 +317,7 @@ In the other guides listed below, we will explore a number of techniques to rest The following approaches all aim to minimize the need to use JOINs to optimize reads and improve query performance. While JOINs are fully supported in ClickHouse, we recommend they are used sparingly (2 to 3 tables in a JOIN query is fine) to achieve optimal performance. -> ClickHouse has no notion of foreign keys. This does not prohibit joins but means referential integrity is left to the user to manage at an application level. In OLAP systems like ClickHouse, data integrity is often managed at the application level or during the data ingestion process rather than being enforced by the database itself where it incurs a significant overhead. This approach allows for more flexibility and faster data insertion. This aligns with ClickHouse’s focus on speed and scalability of read and insert queries with very large datasets. +> ClickHouse has no notion of foreign keys. This does not prohibit joins but means referential integrity is left to the user to manage at an application level. In OLAP systems like ClickHouse, data integrity is often managed at the application level or during the data ingestion process rather than being enforced by the database itself where it incurs a significant overhead. This approach allows for more flexibility and faster data insertion. This aligns with ClickHouse's focus on speed and scalability of read and insert queries with very large datasets. In order to minimize the use of Joins at query time, users have several tools/approaches: diff --git a/docs/dictionary/index.md b/docs/dictionary/index.md index a40d982c641..7eb5e1e8655 100644 --- a/docs/dictionary/index.md +++ b/docs/dictionary/index.md @@ -101,7 +101,7 @@ GROUP BY table └─────────────────┴─────────────────┴───────────────────┴───────┘ ``` -Data will be stored uncompressed in our dictionary, so we need at least 4GB of memory if we were to store all columns (we won’t) in a dictionary. The dictionary will be replicated across our cluster, so this amount of memory needs to be reserved *per node*. +Data will be stored uncompressed in our dictionary, so we need at least 4GB of memory if we were to store all columns (we won't) in a dictionary. The dictionary will be replicated across our cluster, so this amount of memory needs to be reserved *per node*. > In the example below the data for our dictionary originates from a ClickHouse table. While this represents the most common source of dictionaries, [a number of sources](/sql-reference/dictionaries#dictionary-sources) are supported including files, http and databases including [Postgres](/sql-reference/dictionaries#postgresql). As we'll show, dictionaries can be automatically refreshed providing an ideal way to ensure small datasets subject to frequent changes are available for direct joins. diff --git a/docs/faq/general/dbms-naming.md b/docs/faq/general/dbms-naming.md index e8081d9ad83..8376ba4cfdf 100644 --- a/docs/faq/general/dbms-naming.md +++ b/docs/faq/general/dbms-naming.md @@ -8,7 +8,7 @@ description: 'Learn about What does "ClickHouse" mean?' # What Does "ClickHouse" Mean? {#what-does-clickhouse-mean} -It’s a combination of "**Click**stream" and "Data ware**House**". It comes from the original use case at Yandex.Metrica, where ClickHouse was supposed to keep records of all clicks by people from all over the Internet, and it still does the job. You can read more about this use case on [ClickHouse history](../../about-us/history.md) page. +It's a combination of "**Click**stream" and "Data ware**House**". It comes from the original use case at Yandex.Metrica, where ClickHouse was supposed to keep records of all clicks by people from all over the Internet, and it still does the job. You can read more about this use case on [ClickHouse history](../../about-us/history.md) page. This two-part meaning has two consequences: diff --git a/docs/faq/general/index.md b/docs/faq/general/index.md index ee031680f59..0ed232eaeba 100644 --- a/docs/faq/general/index.md +++ b/docs/faq/general/index.md @@ -20,7 +20,7 @@ description: 'Index page listing general questions about ClickHouse' - [Why not use something like MapReduce?](../../faq/general/mapreduce.md) - [How do I contribute code to ClickHouse?](/knowledgebase/how-do-i-contribute-code-to-clickhouse) -:::info Don’t see what you're looking for? +:::info Don't see what you're looking for? Check out our [Knowledge Base](/knowledgebase/) and also browse the many helpful articles found here in the documentation. ::: diff --git a/docs/faq/general/mapreduce.md b/docs/faq/general/mapreduce.md index 0fcec115a1e..139dc1c503e 100644 --- a/docs/faq/general/mapreduce.md +++ b/docs/faq/general/mapreduce.md @@ -10,6 +10,6 @@ description: 'This page explains why you would use ClickHouse over MapReduce' We can refer to systems like MapReduce as distributed computing systems in which the reduce operation is based on distributed sorting. The most common open-source solution in this class is [Apache Hadoop](http://hadoop.apache.org). -These systems aren’t appropriate for online queries due to their high latency. In other words, they can’t be used as the back-end for a web interface. These types of systems aren’t useful for real-time data updates. Distributed sorting isn’t the best way to perform reduce operations if the result of the operation and all the intermediate results (if there are any) are located in the RAM of a single server, which is usually the case for online queries. In such a case, a hash table is an optimal way to perform reduce operations. A common approach to optimizing map-reduce tasks is pre-aggregation (partial reduce) using a hash table in RAM. The user performs this optimization manually. Distributed sorting is one of the main causes of reduced performance when running simple map-reduce tasks. +These systems aren't appropriate for online queries due to their high latency. In other words, they can't be used as the back-end for a web interface. These types of systems aren't useful for real-time data updates. Distributed sorting isn't the best way to perform reduce operations if the result of the operation and all the intermediate results (if there are any) are located in the RAM of a single server, which is usually the case for online queries. In such a case, a hash table is an optimal way to perform reduce operations. A common approach to optimizing map-reduce tasks is pre-aggregation (partial reduce) using a hash table in RAM. The user performs this optimization manually. Distributed sorting is one of the main causes of reduced performance when running simple map-reduce tasks. Most MapReduce implementations allow you to execute arbitrary code on a cluster. But a declarative query language is better suited to OLAP to run experiments quickly. For example, Hadoop has Hive and Pig. Also consider Cloudera Impala or Shark (outdated) for Spark, as well as Spark SQL, Presto, and Apache Drill. Performance when running such tasks is highly sub-optimal compared to specialized systems, but relatively high latency makes it unrealistic to use these systems as the backend for a web interface. diff --git a/docs/faq/general/ne-tormozit.md b/docs/faq/general/ne-tormozit.md index 0231f9d1b01..020f06b8a05 100644 --- a/docs/faq/general/ne-tormozit.md +++ b/docs/faq/general/ne-tormozit.md @@ -10,7 +10,7 @@ description: 'This page explains what "Не тормозит" means' We often get this question when people see vintage (limited production) ClickHouse t-shirts. They have the words **"ClickHouse не тормозит"** written in big bold text on the front. -Before ClickHouse became open-source, it was developed as an in-house storage system by a large European IT company, [Yandex](https://yandex.com/company/). That’s why it initially got its slogan in Cyrillic, which is "не тормозит" (pronounced as "ne tormozit"). After the open-source release, we first produced some of those t-shirts for local events, and it was a no-brainer to use the slogan as-is. +Before ClickHouse became open-source, it was developed as an in-house storage system by a large European IT company, [Yandex](https://yandex.com/company/). That's why it initially got its slogan in Cyrillic, which is "не тормозит" (pronounced as "ne tormozit"). After the open-source release, we first produced some of those t-shirts for local events, and it was a no-brainer to use the slogan as-is. A second batch of these t-shirts was supposed to be given away at international events, and we tried to make an English version of the slogan. Unfortunately, we just couldn't come up with a punchy equivalent in English. The original phrase is elegant in its expression while being succinct, and restrictions on space on the t-shirt meant that we failed to come up with a good enough translation as most options appeared to be either too long or inaccurate. @@ -21,7 +21,7 @@ So, what does it mean? Here are some ways to translate *"не тормозит"* - If you translate it literally, it sounds something like *"ClickHouse does not press the brake pedal"*. - Shorter, but less precise translations might be *"ClickHouse is not slow"*, *"ClickHouse does not lag"* or just *"ClickHouse is fast"*. -If you haven’t seen one of those t-shirts in person, you can check them out online in many ClickHouse-related videos. For example, this one: +If you haven't seen one of those t-shirts in person, you can check them out online in many ClickHouse-related videos. For example, this one:

diff --git a/docs/faq/general/olap.md b/docs/faq/general/olap.md index 26f97e55cdf..7160156a04c 100644 --- a/docs/faq/general/olap.md +++ b/docs/faq/general/olap.md @@ -21,7 +21,7 @@ Online ## OLAP from the Business Perspective {#olap-from-the-business-perspective} -In recent years, business people started to realize the value of data. Companies who make their decisions blindly, more often than not fail to keep up with the competition. The data-driven approach of successful companies forces them to collect all data that might be remotely useful for making business decisions and need mechanisms to timely analyze them. Here’s where OLAP database management systems (DBMS) come in. +In recent years, business people started to realize the value of data. Companies who make their decisions blindly, more often than not fail to keep up with the competition. The data-driven approach of successful companies forces them to collect all data that might be remotely useful for making business decisions and need mechanisms to timely analyze them. Here's where OLAP database management systems (DBMS) come in. In a business sense, OLAP allows companies to continuously plan, analyze, and report operational activities, thus maximizing efficiency, reducing expenses, and ultimately conquering the market share. It could be done either in an in-house system or outsourced to SaaS providers like web/mobile analytics services, CRM services, etc. OLAP is the technology behind many BI applications (Business Intelligence). @@ -31,11 +31,11 @@ ClickHouse is an OLAP database management system that is pretty often used as a All database management systems could be classified into two groups: OLAP (Online **Analytical** Processing) and OLTP (Online **Transactional** Processing). Former focuses on building reports, each based on large volumes of historical data, but doing it not so frequently. While the latter usually handle a continuous stream of transactions, constantly modifying the current state of data. -In practice OLAP and OLTP are not categories, it’s more like a spectrum. Most real systems usually focus on one of them but provide some solutions or workarounds if the opposite kind of workload is also desired. This situation often forces businesses to operate multiple storage systems integrated, which might be not so big deal but having more systems make it more expensive to maintain. So the trend of recent years is HTAP (**Hybrid Transactional/Analytical Processing**) when both kinds of the workload are handled equally well by a single database management system. +In practice OLAP and OLTP are not categories, it's more like a spectrum. Most real systems usually focus on one of them but provide some solutions or workarounds if the opposite kind of workload is also desired. This situation often forces businesses to operate multiple storage systems integrated, which might be not so big deal but having more systems make it more expensive to maintain. So the trend of recent years is HTAP (**Hybrid Transactional/Analytical Processing**) when both kinds of the workload are handled equally well by a single database management system. Even if a DBMS started as a pure OLAP or pure OLTP, they are forced to move towards that HTAP direction to keep up with their competition. And ClickHouse is no exception, initially, it has been designed as [fast-as-possible OLAP system](../../concepts/why-clickhouse-is-so-fast.md) and it still does not have full-fledged transaction support, but some features like consistent read/writes and mutations for updating/deleting data had to be added. The fundamental trade-off between OLAP and OLTP systems remains: -- To build analytical reports efficiently it’s crucial to be able to read columns separately, thus most OLAP databases are [columnar](../../faq/general/columnar-database.md), +- To build analytical reports efficiently it's crucial to be able to read columns separately, thus most OLAP databases are [columnar](../../faq/general/columnar-database.md), - While storing columns separately increases costs of operations on rows, like append or in-place modification, proportionally to the number of columns (which can be huge if the systems try to collect all details of an event just in case). Thus, most OLTP systems store data arranged by rows. diff --git a/docs/faq/general/who-is-using-clickhouse.md b/docs/faq/general/who-is-using-clickhouse.md index 0bb1ba03077..677078b5668 100644 --- a/docs/faq/general/who-is-using-clickhouse.md +++ b/docs/faq/general/who-is-using-clickhouse.md @@ -8,14 +8,14 @@ description: 'Describes who is using ClickHouse' # Who Is Using ClickHouse? {#who-is-using-clickhouse} -Being an open-source product makes this question not so straightforward to answer. You do not have to tell anyone if you want to start using ClickHouse, you just go grab source code or pre-compiled packages. There’s no contract to sign and the [Apache 2.0 license](https://github.com/ClickHouse/ClickHouse/blob/master/LICENSE) allows for unconstrained software distribution. +Being an open-source product makes this question not so straightforward to answer. You do not have to tell anyone if you want to start using ClickHouse, you just go grab source code or pre-compiled packages. There's no contract to sign and the [Apache 2.0 license](https://github.com/ClickHouse/ClickHouse/blob/master/LICENSE) allows for unconstrained software distribution. -Also, the technology stack is often in a grey zone of what’s covered by an NDA. Some companies consider technologies they use as a competitive advantage even if they are open-source and do not allow employees to share any details publicly. Some see some PR risks and allow employees to share implementation details only with their PR department approval. +Also, the technology stack is often in a grey zone of what's covered by an NDA. Some companies consider technologies they use as a competitive advantage even if they are open-source and do not allow employees to share any details publicly. Some see some PR risks and allow employees to share implementation details only with their PR department approval. So how to tell who is using ClickHouse? -One way is to **ask around**. If it’s not in writing, people are much more willing to share what technologies are used in their companies, what the use cases are, what kind of hardware is used, data volumes, etc. We’re talking with users regularly on [ClickHouse Meetups](https://www.youtube.com/channel/UChtmrD-dsdpspr42P_PyRAw/playlists) all over the world and have heard stories about 1000+ companies that use ClickHouse. Unfortunately, that’s not reproducible and we try to treat such stories as if they were told under NDA to avoid any potential troubles. But you can come to any of our future meetups and talk with other users on your own. There are multiple ways how meetups are announced, for example, you can subscribe to [our Twitter](http://twitter.com/ClickHouseDB/). +One way is to **ask around**. If it's not in writing, people are much more willing to share what technologies are used in their companies, what the use cases are, what kind of hardware is used, data volumes, etc. We're talking with users regularly on [ClickHouse Meetups](https://www.youtube.com/channel/UChtmrD-dsdpspr42P_PyRAw/playlists) all over the world and have heard stories about 1000+ companies that use ClickHouse. Unfortunately, that's not reproducible and we try to treat such stories as if they were told under NDA to avoid any potential troubles. But you can come to any of our future meetups and talk with other users on your own. There are multiple ways how meetups are announced, for example, you can subscribe to [our Twitter](http://twitter.com/ClickHouseDB/). -The second way is to look for companies **publicly saying** that they use ClickHouse. It’s more substantial because there’s usually some hard evidence like a blog post, talk video recording, slide deck, etc. We collect the collection of links to such evidence on our **[Adopters](../../about-us/adopters.md)** page. Feel free to contribute the story of your employer or just some links you’ve stumbled upon (but try not to violate your NDA in the process). +The second way is to look for companies **publicly saying** that they use ClickHouse. It's more substantial because there's usually some hard evidence like a blog post, talk video recording, slide deck, etc. We collect the collection of links to such evidence on our **[Adopters](../../about-us/adopters.md)** page. Feel free to contribute the story of your employer or just some links you've stumbled upon (but try not to violate your NDA in the process). You can find names of very large companies in the adopters list, like Bloomberg, Cisco, China Telecom, Tencent, or Lyft, but with the first approach, we found that there are many more. For example, if you take [the list of largest IT companies by Forbes (2020)](https://www.forbes.com/sites/hanktucker/2020/05/13/worlds-largest-technology-companies-2020-apple-stays-on-top-zoom-and-uber-debut/) over half of them are using ClickHouse in some way. Also, it would be unfair not to mention [Yandex](../../about-us/history.md), the company which initially open-sourced ClickHouse in 2016 and happens to be one of the largest IT companies in Europe. diff --git a/docs/faq/integration/index.md b/docs/faq/integration/index.md index 27676369415..76939b7bfb3 100644 --- a/docs/faq/integration/index.md +++ b/docs/faq/integration/index.md @@ -17,6 +17,6 @@ description: 'Landing page listing questions related to integrating ClickHouse w - [Can ClickHouse read tables from PostgreSQL](/integrations/data-ingestion/dbms/postgresql/connecting-to-postgresql.md) - [What if I have a problem with encodings when connecting to Oracle via ODBC?](/faq/integration/oracle-odbc.md) -:::info Don’t see what you're looking for? +:::info Don't see what you're looking for? Check out our [Knowledge Base](/knowledgebase/) and also browse the many helpful articles found here in the documentation. ::: diff --git a/docs/faq/operations/delete-old-data.md b/docs/faq/operations/delete-old-data.md index 862efa64df6..4cbd8b65f08 100644 --- a/docs/faq/operations/delete-old-data.md +++ b/docs/faq/operations/delete-old-data.md @@ -44,12 +44,12 @@ More details on [mutations](/sql-reference/statements/alter#mutations). ## DROP PARTITION {#drop-partition} -`ALTER TABLE ... DROP PARTITION` provides a cost-efficient way to drop a whole partition. It’s not that flexible and needs proper partitioning scheme configured on table creation, but still covers most common cases. Like mutations need to be executed from an external system for regular use. +`ALTER TABLE ... DROP PARTITION` provides a cost-efficient way to drop a whole partition. It's not that flexible and needs proper partitioning scheme configured on table creation, but still covers most common cases. Like mutations need to be executed from an external system for regular use. More details on [manipulating partitions](/sql-reference/statements/alter/partition). ## TRUNCATE {#truncate} -It’s rather radical to drop all data from a table, but in some cases it might be exactly what you need. +It's rather radical to drop all data from a table, but in some cases it might be exactly what you need. More details on [table truncation](/sql-reference/statements/truncate.md). diff --git a/docs/faq/operations/index.md b/docs/faq/operations/index.md index 320975f4478..2253a55fc7a 100644 --- a/docs/faq/operations/index.md +++ b/docs/faq/operations/index.md @@ -17,7 +17,7 @@ description: 'Landing page for questions about operating ClickHouse servers and - [Can you update or delete rows in ClickHouse?](/guides/developer/mutations.md) - [Does ClickHouse support multi-region replication?](/faq/operations/multi-region-replication.md) -:::info Don’t see what you're looking for? +:::info Don't see what you're looking for? Check out our [Knowledge Base](/knowledgebase/) and also browse the many helpful articles found here in the documentation. ::: diff --git a/docs/faq/operations/production.md b/docs/faq/operations/production.md index c89cda5345d..14505193002 100644 --- a/docs/faq/operations/production.md +++ b/docs/faq/operations/production.md @@ -8,34 +8,34 @@ description: 'This page provides guidance on which ClickHouse version to use in # Which ClickHouse Version to Use in Production? {#which-clickhouse-version-to-use-in-production} -First of all, let’s discuss why people ask this question in the first place. There are two key reasons: +First of all, let's discuss why people ask this question in the first place. There are two key reasons: 1. ClickHouse is developed with pretty high velocity, and usually there are 10+ stable releases per year. That makes a wide range of releases to choose from, which is not so trivial of a choice. 2. Some users want to avoid spending time figuring out which version works best for their use case and just follow someone else's advice. -The second reason is more fundamental, so we’ll start with that one and then get back to navigating through various ClickHouse releases. +The second reason is more fundamental, so we'll start with that one and then get back to navigating through various ClickHouse releases. ## Which ClickHouse Version Do You Recommend? {#which-clickhouse-version-do-you-recommend} -It’s tempting to hire consultants or trust some known experts to get rid of responsibility for your production environment. You install some specific ClickHouse version that someone else recommended; if there’s some issue with it - it’s not your fault, it’s someone else's. This line of reasoning is a big trap. No external person knows better than you what’s going on in your company’s production environment. +It's tempting to hire consultants or trust some known experts to get rid of responsibility for your production environment. You install some specific ClickHouse version that someone else recommended; if there's some issue with it - it's not your fault, it's someone else's. This line of reasoning is a big trap. No external person knows better than you what's going on in your company's production environment. -So how do you properly choose which ClickHouse version to upgrade to? Or how do you choose your first ClickHouse version? First of all, you need to invest in setting up a **realistic pre-production environment**. In an ideal world, it could be a completely identical shadow copy, but that’s usually expensive. +So how do you properly choose which ClickHouse version to upgrade to? Or how do you choose your first ClickHouse version? First of all, you need to invest in setting up a **realistic pre-production environment**. In an ideal world, it could be a completely identical shadow copy, but that's usually expensive. Here are some key points to get reasonable fidelity in a pre-production environment with not-so-high costs: - Pre-production environment needs to run an as close of a set of queries as you intend to run in production: - - Don’t make it read-only with some frozen data. - - Don’t make it write-only with just copying data without building some typical reports. - - Don’t wipe it clean instead of applying schema migrations. -- Use a sample of real production data and queries. Try to choose a sample that’s still representative and makes `SELECT` queries return reasonable results. Use obfuscation if your data is sensitive and internal policies do not allow it to leave the production environment. + - Don't make it read-only with some frozen data. + - Don't make it write-only with just copying data without building some typical reports. + - Don't wipe it clean instead of applying schema migrations. +- Use a sample of real production data and queries. Try to choose a sample that's still representative and makes `SELECT` queries return reasonable results. Use obfuscation if your data is sensitive and internal policies do not allow it to leave the production environment. - Make sure that pre-production is covered by your monitoring and alerting software the same way as your production environment does. - If your production spans across multiple datacenters or regions, make your pre-production do the same. - If your production uses complex features like replication, distributed tables and cascading materialized views, make sure they are configured similarly in pre-production. -- There’s a trade-off on using the roughly same number of servers or VMs in pre-production as in production but of smaller size, or much less of them but of the same size. The first option might catch extra network-related issues, while the latter is easier to manage. +- There's a trade-off on using the roughly same number of servers or VMs in pre-production as in production but of smaller size, or much less of them but of the same size. The first option might catch extra network-related issues, while the latter is easier to manage. -The second area to invest in is **automated testing infrastructure**. Don’t assume that if some kind of query has executed successfully once, it’ll continue to do so forever. It’s OK to have some unit tests where ClickHouse is mocked, but make sure your product has a reasonable set of automated tests that are run against real ClickHouse and check that all important use cases are still working as expected. +The second area to invest in is **automated testing infrastructure**. Don't assume that if some kind of query has executed successfully once, it'll continue to do so forever. It's OK to have some unit tests where ClickHouse is mocked, but make sure your product has a reasonable set of automated tests that are run against real ClickHouse and check that all important use cases are still working as expected. -An extra step forward could be contributing those automated tests to [ClickHouse’s open-source test infrastructure](https://github.com/ClickHouse/ClickHouse/tree/master/tests) that are continuously used in its day-to-day development. It definitely will take some additional time and effort to learn [how to run it](../../development/tests.md) and then how to adapt your tests to this framework, but it’ll pay off by ensuring that ClickHouse releases are already tested against them when they are announced stable, instead of repeatedly losing time on reporting the issue after the fact and then waiting for a bugfix to be implemented, backported and released. Some companies even have such test contributions to infrastructure by its use as an internal policy, (called [Beyonce's Rule](https://www.oreilly.com/library/view/software-engineering-at/9781492082781/ch01.html#policies_that_scale_well) at Google). +An extra step forward could be contributing those automated tests to [ClickHouse's open-source test infrastructure](https://github.com/ClickHouse/ClickHouse/tree/master/tests) that are continuously used in its day-to-day development. It definitely will take some additional time and effort to learn [how to run it](../../development/tests.md) and then how to adapt your tests to this framework, but it'll pay off by ensuring that ClickHouse releases are already tested against them when they are announced stable, instead of repeatedly losing time on reporting the issue after the fact and then waiting for a bugfix to be implemented, backported and released. Some companies even have such test contributions to infrastructure by its use as an internal policy, (called [Beyonce's Rule](https://www.oreilly.com/library/view/software-engineering-at/9781492082781/ch01.html#policies_that_scale_well) at Google). When you have your pre-production environment and testing infrastructure in place, choosing the best version is straightforward: @@ -44,11 +44,11 @@ When you have your pre-production environment and testing infrastructure in plac 3. Report any issues you discovered to [ClickHouse GitHub Issues](https://github.com/ClickHouse/ClickHouse/issues). 4. If there were no major issues, it should be safe to start deploying ClickHouse release to your production environment. Investing in gradual release automation that implements an approach similar to [canary releases](https://martinfowler.com/bliki/CanaryRelease.html) or [green-blue deployments](https://martinfowler.com/bliki/BlueGreenDeployment.html) might further reduce the risk of issues in production. -As you might have noticed, there’s nothing specific to ClickHouse in the approach described above - people do that for any piece of infrastructure they rely on if they take their production environment seriously. +As you might have noticed, there's nothing specific to ClickHouse in the approach described above - people do that for any piece of infrastructure they rely on if they take their production environment seriously. ## How to Choose Between ClickHouse Releases? {#how-to-choose-between-clickhouse-releases} -If you look into the contents of the ClickHouse package repository, you’ll see two kinds of packages: +If you look into the contents of the ClickHouse package repository, you'll see two kinds of packages: 1. `stable` 2. `lts` (long-term support) @@ -60,8 +60,8 @@ Here is some guidance on how to choose between them: - Your company has some internal policies that do not allow for frequent upgrades or using non-LTS software. - You are using ClickHouse in some secondary products that either do not require any complex ClickHouse features or do not have enough resources to keep it updated. -Many teams who initially think that `lts` is the way to go often switch to `stable` anyway because of some recent feature that’s important for their product. +Many teams who initially think that `lts` is the way to go often switch to `stable` anyway because of some recent feature that's important for their product. :::tip -One more thing to keep in mind when upgrading ClickHouse: we’re always keeping an eye on compatibility across releases, but sometimes it’s not reasonable to keep and some minor details might change. So make sure you check the [changelog](/whats-new/changelog/index.md) before upgrading to see if there are any notes about backward-incompatible changes. +One more thing to keep in mind when upgrading ClickHouse: we're always keeping an eye on compatibility across releases, but sometimes it's not reasonable to keep and some minor details might change. So make sure you check the [changelog](/whats-new/changelog/index.md) before upgrading to see if there are any notes about backward-incompatible changes. ::: diff --git a/docs/faq/use-cases/index.md b/docs/faq/use-cases/index.md index d90bdbfb21a..e86365bdc28 100644 --- a/docs/faq/use-cases/index.md +++ b/docs/faq/use-cases/index.md @@ -11,7 +11,7 @@ description: 'Landing page listing common questions about ClickHouse use cases' - [Can I use ClickHouse as a time-series database?](/knowledgebase/time-series) - [Can I use ClickHouse as a key-value storage?](/knowledgebase/key-value) -:::info Don’t see what you're looking for? +:::info Don't see what you're looking for? Check out our [Knowledge Base](/knowledgebase/) and also browse the many helpful articles found here in the documentation. ::: diff --git a/docs/faq/use-cases/key-value.md b/docs/faq/use-cases/key-value.md index 2bcfb5ceaad..49e5e134e6a 100644 --- a/docs/faq/use-cases/key-value.md +++ b/docs/faq/use-cases/key-value.md @@ -8,12 +8,12 @@ description: 'Answers the frequently asked question of whether or not ClickHouse # Can I Use ClickHouse As a Key-Value Storage? {#can-i-use-clickhouse-as-a-key-value-storage} -The short answer is **"no"**. The key-value workload is among top positions in the list of cases when **NOT** to use ClickHouse. It’s an [OLAP](../../faq/general/olap.md) system after all, while there are many excellent key-value storage systems out there. +The short answer is **"no"**. The key-value workload is among top positions in the list of cases when **NOT** to use ClickHouse. It's an [OLAP](../../faq/general/olap.md) system after all, while there are many excellent key-value storage systems out there. -However, there might be situations where it still makes sense to use ClickHouse for key-value-like queries. Usually, it’s some low-budget products where the main workload is analytical in nature and fits ClickHouse well, but there’s also some secondary process that needs a key-value pattern with not so high request throughput and without strict latency requirements. If you had an unlimited budget, you would have installed a secondary key-value database for this secondary workload, but in reality, there’s an additional cost of maintaining one more storage system (monitoring, backups, etc.) which might be desirable to avoid. +However, there might be situations where it still makes sense to use ClickHouse for key-value-like queries. Usually, it's some low-budget products where the main workload is analytical in nature and fits ClickHouse well, but there's also some secondary process that needs a key-value pattern with not so high request throughput and without strict latency requirements. If you had an unlimited budget, you would have installed a secondary key-value database for this secondary workload, but in reality, there's an additional cost of maintaining one more storage system (monitoring, backups, etc.) which might be desirable to avoid. If you decide to go against recommendations and run some key-value-like queries against ClickHouse, here are some tips: -- The key reason why point queries are expensive in ClickHouse is its sparse primary index of main [MergeTree table engine family](../..//engines/table-engines/mergetree-family/mergetree.md). This index can’t point to each specific row of data, instead, it points to each N-th and the system has to scan from the neighboring N-th row to the desired one, reading excessive data along the way. In a key-value scenario, it might be useful to reduce the value of N with the `index_granularity` setting. +- The key reason why point queries are expensive in ClickHouse is its sparse primary index of main [MergeTree table engine family](../..//engines/table-engines/mergetree-family/mergetree.md). This index can't point to each specific row of data, instead, it points to each N-th and the system has to scan from the neighboring N-th row to the desired one, reading excessive data along the way. In a key-value scenario, it might be useful to reduce the value of N with the `index_granularity` setting. - ClickHouse keeps each column in a separate set of files, so to assemble one complete row it needs to go through each of those files. Their count increases linearly with the number of columns, so in the key-value scenario, it might be worth avoiding using many columns and put all your payload in a single `String` column encoded in some serialization format like JSON, Protobuf, or whatever makes sense. -- There’s an alternative approach that uses [Join](../../engines/table-engines/special/join.md) table engine instead of normal `MergeTree` tables and [joinGet](../../sql-reference/functions/other-functions.md#joinget) function to retrieve the data. It can provide better query performance but might have some usability and reliability issues. Here’s an [usage example](https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/00800_versatile_storage_join.sql#L49-L51). +- There's an alternative approach that uses [Join](../../engines/table-engines/special/join.md) table engine instead of normal `MergeTree` tables and [joinGet](../../sql-reference/functions/other-functions.md#joinget) function to retrieve the data. It can provide better query performance but might have some usability and reliability issues. Here's an [usage example](https://github.com/ClickHouse/ClickHouse/blob/master/tests/queries/0_stateless/00800_versatile_storage_join.sql#L49-L51). diff --git a/docs/faq/use-cases/time-series.md b/docs/faq/use-cases/time-series.md index cc5183beb1d..d51aca80189 100644 --- a/docs/faq/use-cases/time-series.md +++ b/docs/faq/use-cases/time-series.md @@ -10,13 +10,13 @@ description: 'Page describing how to use ClickHouse as a time-series database' _Note: Please see the blog [Working with Time series data in ClickHouse](https://clickhouse.com/blog/working-with-time-series-data-and-functions-ClickHouse) for additional examples of using ClickHouse for time series analysis._ -ClickHouse is a generic data storage solution for [OLAP](../../faq/general/olap.md) workloads, while there are many specialized [time-series database management systems](https://clickhouse.com/engineering-resources/what-is-time-series-database). Nevertheless, ClickHouse’s [focus on query execution speed](../../concepts/why-clickhouse-is-so-fast.md) allows it to outperform specialized systems in many cases. There are many independent benchmarks on this topic out there, so we’re not going to conduct one here. Instead, let’s focus on ClickHouse features that are important to use if that’s your use case. +ClickHouse is a generic data storage solution for [OLAP](../../faq/general/olap.md) workloads, while there are many specialized [time-series database management systems](https://clickhouse.com/engineering-resources/what-is-time-series-database). Nevertheless, ClickHouse's [focus on query execution speed](../../concepts/why-clickhouse-is-so-fast.md) allows it to outperform specialized systems in many cases. There are many independent benchmarks on this topic out there, so we're not going to conduct one here. Instead, let's focus on ClickHouse features that are important to use if that's your use case. First of all, there are **[specialized codecs](../../sql-reference/statements/create/table.md#specialized-codecs)** which make typical time-series. Either common algorithms like `DoubleDelta` and `Gorilla` or specific to ClickHouse like `T64`. Second, time-series queries often hit only recent data, like one day or one week old. It makes sense to use servers that have both fast NVMe/SSD drives and high-capacity HDD drives. ClickHouse [TTL](/engines/table-engines/mergetree-family/mergetree#table_engine-mergetree-ttl) feature allows to configure keeping fresh hot data on fast drives and gradually move it to slower drives as it ages. Rollup or removal of even older data is also possible if your requirements demand it. -Even though it’s against ClickHouse philosophy of storing and processing raw data, you can use [materialized views](../../sql-reference/statements/create/view.md) to fit into even tighter latency or costs requirements. +Even though it's against ClickHouse philosophy of storing and processing raw data, you can use [materialized views](../../sql-reference/statements/create/view.md) to fit into even tighter latency or costs requirements. ## Related Content {#related-content} diff --git a/docs/fast-release-24-2.md b/docs/fast-release-24-2.md index e30b80edab8..9a8d9ac8107 100644 --- a/docs/fast-release-24-2.md +++ b/docs/fast-release-24-2.md @@ -28,7 +28,7 @@ keywords: ['changelog'] * Add generate_series as a table function. This function generates table with an arithmetic progression with natural numbers. [#59390](https://github.com/ClickHouse/ClickHouse/pull/59390) ([divanik](https://github.com/divanik)). * Added query `ALTER TABLE table FORGET PARTITION partition` that removes ZooKeeper nodes, related to an empty partition. [#59507](https://github.com/ClickHouse/ClickHouse/pull/59507) ([Sergei Trifonov](https://github.com/serxa)). * Support reading and writing backups as tar archives. [#59535](https://github.com/ClickHouse/ClickHouse/pull/59535) ([josh-hildred](https://github.com/josh-hildred)). -* Provides new aggregate function ‘groupArrayIntersect’. Follows up: [#49862](https://github.com/ClickHouse/ClickHouse/issues/49862). [#59598](https://github.com/ClickHouse/ClickHouse/pull/59598) ([Yarik Briukhovetskyi](https://github.com/yariks5s)). +* Provides new aggregate function 'groupArrayIntersect'. Follows up: [#49862](https://github.com/ClickHouse/ClickHouse/issues/49862). [#59598](https://github.com/ClickHouse/ClickHouse/pull/59598) ([Yarik Briukhovetskyi](https://github.com/yariks5s)). * Implemented system.dns_cache table, which can be useful for debugging DNS issues. [#59856](https://github.com/ClickHouse/ClickHouse/pull/59856) ([Kirill Nikiforov](https://github.com/allmazz)). * Implemented support for S3Express buckets. [#59965](https://github.com/ClickHouse/ClickHouse/pull/59965) ([Nikita Taranov](https://github.com/nickitat)). * The codec `LZ4HC` will accept a new level 2, which is faster than the previous minimum level 3, at the expense of less compression. In previous versions, `LZ4HC(2)` and less was the same as `LZ4HC(3)`. Author: [Cyan4973](https://github.com/Cyan4973). [#60090](https://github.com/ClickHouse/ClickHouse/pull/60090) ([Alexey Milovidov](https://github.com/alexey-milovidov)). diff --git a/docs/guides/best-practices/asyncinserts.md b/docs/guides/best-practices/asyncinserts.md index ffc1dbc8d84..49c33ecb515 100644 --- a/docs/guides/best-practices/asyncinserts.md +++ b/docs/guides/best-practices/asyncinserts.md @@ -5,6 +5,6 @@ title: 'Asynchronous Inserts (async_insert)' description: 'Use asynchronous inserts as an alternative to batching data.' --- -import Content from '@site/docs/cloud/bestpractices/asyncinserts.md'; +import Content from '@site/docs/best-practices/_snippets/_async_inserts.md'; diff --git a/docs/guides/best-practices/avoidmutations.md b/docs/guides/best-practices/avoidmutations.md index 23e867f91ea..f59327ce3f1 100644 --- a/docs/guides/best-practices/avoidmutations.md +++ b/docs/guides/best-practices/avoidmutations.md @@ -5,6 +5,7 @@ title: 'Avoid Mutations' description: 'Mutations refers to ALTER queries that manipulate table data' --- -import Content from '@site/docs/cloud/bestpractices/avoidmutations.md'; +import Content from '@site/docs/best-practices/_snippets/_avoid_mutations.md'; + diff --git a/docs/guides/best-practices/avoidnullablecolumns.md b/docs/guides/best-practices/avoidnullablecolumns.md index 53ec1a53ea4..bcd9f6073a1 100644 --- a/docs/guides/best-practices/avoidnullablecolumns.md +++ b/docs/guides/best-practices/avoidnullablecolumns.md @@ -2,9 +2,9 @@ slug: /optimize/avoid-nullable-columns sidebar_label: 'Avoid Nullable Columns' title: 'Avoid Nullable Columns' -description: 'Nullable columns (e.g. Nullable(String)) create a separate column of UInt8 type.' +description: 'Why Nullable Columns should be avoided in ClickHouse' --- -import Content from '@site/docs/cloud/bestpractices/avoidnullablecolumns.md'; +import Content from '@site/docs/best-practices/_snippets/_avoid_nullable_columns.md'; diff --git a/docs/guides/best-practices/avoidoptimizefinal.md b/docs/guides/best-practices/avoidoptimizefinal.md index f198a26a629..20c8daa5e5d 100644 --- a/docs/guides/best-practices/avoidoptimizefinal.md +++ b/docs/guides/best-practices/avoidoptimizefinal.md @@ -5,6 +5,7 @@ title: 'Avoid Optimize Final' description: 'Using the OPTIMIZE TABLE ... FINAL query will initiate an unscheduled merge of data parts.' --- -import Content from '@site/docs/cloud/bestpractices/avoidoptimizefinal.md'; +import Content from '@site/docs/best-practices/_snippets/_avoid_optimize_final.md'; + diff --git a/docs/guides/best-practices/bulkinserts.md b/docs/guides/best-practices/bulkinserts.md index e2a6129c70f..d9b5ec1d45d 100644 --- a/docs/guides/best-practices/bulkinserts.md +++ b/docs/guides/best-practices/bulkinserts.md @@ -5,6 +5,6 @@ title: 'Bulk Inserts' description: 'Sending a smaller amount of inserts that each contain more data will reduce the number of writes required.' --- -import Content from '@site/docs/cloud/bestpractices/bulkinserts.md'; +import Content from '@site/docs/best-practices/_snippets/_bulk_inserts.md'; diff --git a/docs/guides/best-practices/partitioningkey.md b/docs/guides/best-practices/partitioningkey.md index ae28e853a32..86e2a4c9e3f 100644 --- a/docs/guides/best-practices/partitioningkey.md +++ b/docs/guides/best-practices/partitioningkey.md @@ -5,6 +5,6 @@ title: 'Choose a Low Cardinality Partitioning Key' description: 'Use a low cardinality partitioning key or avoid using any partitioning key for your table.' --- -import Content from '@site/docs/cloud/bestpractices/partitioningkey.md'; +import Content from '@site/docs/best-practices/partionning_keys.md'; diff --git a/docs/guides/best-practices/query-optimization.md b/docs/guides/best-practices/query-optimization.md index c4100f85138..4ed659e9c7b 100644 --- a/docs/guides/best-practices/query-optimization.md +++ b/docs/guides/best-practices/query-optimization.md @@ -15,9 +15,9 @@ This section aims to illustrate through common scenarios how to use different pe ## Understand query performance {#understand-query-performance} -The best moment to think about performance optimization is when you’re setting up your [data schema](/data-modeling/schema-design) before ingesting data into ClickHouse for the first time.  +The best moment to think about performance optimization is when you're setting up your [data schema](/data-modeling/schema-design) before ingesting data into ClickHouse for the first time.  -But let’s be honest; it is difficult to predict how much your data will grow or what types of queries will be executed.  +But let's be honest; it is difficult to predict how much your data will grow or what types of queries will be executed.  If you have an existing deployment with a few queries that you want to improve, the first step is understanding how those queries perform and why some execute in a few milliseconds while others take longer. @@ -27,7 +27,7 @@ In this section, we will look at those tools and how to use them.  ## General considerations {#general-considerations} -To understand query performance, let’s look at what happens in ClickHouse when a query is executed.  +To understand query performance, let's look at what happens in ClickHouse when a query is executed.  The following part is deliberately simplified and takes some shortcuts; the idea here is not to drown you with details but to get you up to speed with the basic concepts. For more information you can read about [query analyzer](/operations/analyzer).  @@ -51,7 +51,7 @@ The results are merged, sorted, and formatted into a final result before being s In reality, many [optimizations](/concepts/why-clickhouse-is-so-fast) are taking place, and we will discuss them a bit more in this guide, but for now, those main concepts give us a good understanding of what is happening behind the scenes when ClickHouse executes a query.  -With this high-level understanding, let’s examine the tooling ClickHouse provides and how we can use it to track the metrics that affect query performance.  +With this high-level understanding, let's examine the tooling ClickHouse provides and how we can use it to track the metrics that affect query performance.  ## Dataset {#dataset} @@ -113,7 +113,7 @@ For each executed query, ClickHouse logs statistics such as query execution time Therefore, the query log is a good place to start when investigating slow queries. You can easily spot the queries that take a long time to execute and display the resource usage information for each one.  -Let’s find the top five long-running queries on our NYC taxi dataset. +Let's find the top five long-running queries on our NYC taxi dataset. ```sql -- Find top 5 long running queries from nyc_taxi database in the last 1 hour @@ -236,7 +236,7 @@ ORDER BY memory_usage DESC LIMIT 30 ``` -Let’s isolate the long-running queries we found and rerun them a few times to understand the response time.  +Let's isolate the long-running queries we found and rerun them a few times to understand the response time.  At this point, it is essential to turn off the filesystem cache by setting the `enable_filesystem_cache` setting to 0 to improve reproducibility. @@ -401,7 +401,7 @@ If you know which user, database, or tables are having issues, you can use the f Once you identify the queries you want to optimize, you can start working on them to optimize. One common mistake developers make at this stage is changing multiple things simultaneously, running ad-hoc experiments, and usually ending up with mixed results, but, more importantly, missing a good understanding of what made the query faster.  -Query optimization requires structure. I’m not talking about advanced benchmarking, but having a simple process in place to understand how your changes affect query performance can go a long way.  +Query optimization requires structure. I'm not talking about advanced benchmarking, but having a simple process in place to understand how your changes affect query performance can go a long way.  Start by identifying your slow queries from query logs, then investigate potential improvements in isolation. When testing the query, make sure you disable the filesystem cache.  @@ -411,7 +411,7 @@ Once you have identified potential optimizations, it is recommended that you imp -_Finally, be cautious of outliers; it’s pretty common that a query might run slowly, either because a user tried an ad-hoc expensive query or the system was under stress for another reason. You can group by the field normalized_query_hash to identify expensive queries that are being executed regularly. Those are probably the ones you want to investigate._ +_Finally, be cautious of outliers; it's pretty common that a query might run slowly, either because a user tried an ad-hoc expensive query or the system was under stress for another reason. You can group by the field normalized_query_hash to identify expensive queries that are being executed regularly. Those are probably the ones you want to investigate._ ## Basic optimization {#basic-optimization} @@ -419,11 +419,11 @@ Now that we have our framework to test, we can start optimizing. The best place to start is to look at how the data is stored. As for any database, the less data we read, the faster the query will be executed.  -Depending on how you ingested your data, you might have leveraged ClickHouse [capabilities](/interfaces/schema-inference) to infer the table schema based on the ingested data. While this is very practical to get started, if you want to optimize your query performance, you’ll need to review the data schema to best fit your use case. +Depending on how you ingested your data, you might have leveraged ClickHouse [capabilities](/interfaces/schema-inference) to infer the table schema based on the ingested data. While this is very practical to get started, if you want to optimize your query performance, you'll need to review the data schema to best fit your use case. ### Nullable {#nullable} -As described in the [best practices documentation](/cloud/bestpractices/avoid-nullable-columns), avoid nullable columns wherever possible. It is tempting to use them often, as they make the data ingestion mechanism more flexible, but they negatively affect performance as an additional column has to be processed every time. +As described in the [best practices documentation](/best-practices/select-data-types#avoid-nullable-columns), avoid nullable columns wherever possible. It is tempting to use them often, as they make the data ingestion mechanism more flexible, but they negatively affect performance as an additional column has to be processed every time. Running an SQL query that counts the rows with a NULL value can easily reveal the columns in your tables that actually need a Nullable value. @@ -517,11 +517,11 @@ Query id: 4306a8e1-2a9c-4b06-97b4-4d902d2233eb └───────────────────┴───────────────────┘ ``` -For dates, you should pick a precision that matches your dataset and is best suited to answering the queries you’re planning to run. +For dates, you should pick a precision that matches your dataset and is best suited to answering the queries you're planning to run. ### Apply the optimizations {#apply-the-optimizations} -Let’s create a new table to use the optimized schema and re-ingest the data. +Let's create a new table to use the optimized schema and re-ingest the data. ```sql -- Create table with optimized data @@ -743,7 +743,7 @@ We then rerun our queries. We compile the results from the three experiments to We can see significant improvement across the board in execution time and memory used.  -Query 2 benefits most from the primary key. Let’s have a look at how the query plan generated is different from before. +Query 2 benefits most from the primary key. Let's have a look at how the query plan generated is different from before. ```sql EXPLAIN indexes = 1 diff --git a/docs/guides/best-practices/sparse-primary-indexes.md b/docs/guides/best-practices/sparse-primary-indexes.md index eda5564f86e..b706ea0dbd2 100644 --- a/docs/guides/best-practices/sparse-primary-indexes.md +++ b/docs/guides/best-practices/sparse-primary-indexes.md @@ -38,7 +38,7 @@ import Image from '@theme/IdealImage'; In this guide we are going to do a deep dive into ClickHouse indexing. We will illustrate and discuss in detail: - [how indexing in ClickHouse is different from traditional relational database management systems](#an-index-design-for-massive-data-scales) -- [how ClickHouse is building and using a table’s sparse primary index](#a-table-with-a-primary-key) +- [how ClickHouse is building and using a table's sparse primary index](#a-table-with-a-primary-key) - [what some of the best practices are for indexing in ClickHouse](#using-multiple-primary-indexes) You can optionally execute all ClickHouse SQL statements and queries given in this guide by yourself on your own machine. @@ -106,7 +106,7 @@ Ok. ``` -ClickHouse client’s result output shows us that the statement above inserted 8.87 million rows into the table. +ClickHouse client's result output shows us that the statement above inserted 8.87 million rows into the table. Lastly, in order to simplify the discussions later on in this guide and to make the diagrams and results reproducible, we [optimize](/sql-reference/statements/optimize.md) the table using the FINAL keyword: @@ -152,9 +152,9 @@ Processed 8.87 million rows, 70.45 MB (398.53 million rows/s., 3.17 GB/s.) ``` -ClickHouse client’s result output indicates that ClickHouse executed a full table scan! Each single row of the 8.87 million rows of our table was streamed into ClickHouse. That doesn’t scale. +ClickHouse client's result output indicates that ClickHouse executed a full table scan! Each single row of the 8.87 million rows of our table was streamed into ClickHouse. That doesn't scale. -To make this (way) more efficient and (much) faster, we need to use a table with a appropriate primary key. This will allow ClickHouse to automatically (based on the primary key’s column(s)) create a sparse primary index which can then be used to significantly speed up the execution of our example query. +To make this (way) more efficient and (much) faster, we need to use a table with a appropriate primary key. This will allow ClickHouse to automatically (based on the primary key's column(s)) create a sparse primary index which can then be used to significantly speed up the execution of our example query. ### Related content {#related-content} - Blog: [Super charging your ClickHouse queries](https://clickhouse.com/blog/clickhouse-faster-queries-with-projections-and-primary-indexes) @@ -166,7 +166,7 @@ To make this (way) more efficient and (much) faster, we need to use a table with In traditional relational database management systems, the primary index would contain one entry per table row. This would result in the primary index containing 8.87 million entries for our data set. Such an index allows the fast location of specific rows, resulting in high efficiency for lookup queries and point updates. Searching an entry in a `B(+)-Tree` data structure has an average time complexity of `O(log n)`; more precisely, `log_b n = log_2 n / log_2 b` where `b` is the branching factor of the `B(+)-Tree` and `n` is the number of indexed rows. Because `b` is typically between several hundred and several thousand, `B(+)-Trees` are very shallow structures, and few disk-seeks are required to locate records. With 8.87 million rows and a branching factor of 1000, 2.3 disk seeks are needed on average. This capability comes at a cost: additional disk and memory overheads, higher insertion costs when adding new rows to the table and entries to the index, and sometimes rebalancing of the B-Tree. -Considering the challenges associated with B-Tree indexes, table engines in ClickHouse utilise a different approach. The ClickHouse [MergeTree Engine Family](/engines/table-engines/mergetree-family/index.md) has been designed and optimized to handle massive data volumes. These tables are designed to receive millions of row inserts per second and store very large (100s of Petabytes) volumes of data. Data is quickly written to a table [part by part](/engines/table-engines/mergetree-family/mergetree.md/#mergetree-data-storage), with rules applied for merging the parts in the background. In ClickHouse each part has its own primary index. When parts are merged, then the merged part’s primary indexes are also merged. At the very large scale that ClickHouse is designed for, it is paramount to be very disk and memory efficient. Therefore, instead of indexing every row, the primary index for a part has one index entry (known as a ‘mark’) per group of rows (called ‘granule’) - this technique is called **sparse index**. +Considering the challenges associated with B-Tree indexes, table engines in ClickHouse utilise a different approach. The ClickHouse [MergeTree Engine Family](/engines/table-engines/mergetree-family/index.md) has been designed and optimized to handle massive data volumes. These tables are designed to receive millions of row inserts per second and store very large (100s of Petabytes) volumes of data. Data is quickly written to a table [part by part](/engines/table-engines/mergetree-family/mergetree.md/#mergetree-data-storage), with rules applied for merging the parts in the background. In ClickHouse each part has its own primary index. When parts are merged, then the merged part's primary indexes are also merged. At the very large scale that ClickHouse is designed for, it is paramount to be very disk and memory efficient. Therefore, instead of indexing every row, the primary index for a part has one index entry (known as a 'mark') per group of rows (called 'granule') - this technique is called **sparse index**. Sparse indexing is possible because ClickHouse is storing the rows for a part on disk ordered by the primary key column(s). Instead of directly locating single rows (like a B-Tree based index), the sparse primary index allows it to quickly (via a binary search over index entries) identify groups of rows that could possibly match the query. The located groups of potentially matching rows (granules) are then in parallel streamed into the ClickHouse engine in order to find the matches. This index design allows for the primary index to be small (it can, and must, completely fit into the main memory), whilst still significantly speeding up query execution times: especially for range queries that are typical in data analytics use cases. @@ -293,12 +293,12 @@ bytes_on_disk: 207.07 MiB The output of the ClickHouse client shows: -- The table’s data is stored in [wide format](/engines/table-engines/mergetree-family/mergetree.md/#mergetree-data-storage) in a specific directory on disk meaning that there will be one data file (and one mark file) per table column inside that directory. +- The table's data is stored in [wide format](/engines/table-engines/mergetree-family/mergetree.md/#mergetree-data-storage) in a specific directory on disk meaning that there will be one data file (and one mark file) per table column inside that directory. - The table has 8.87 million rows. - The uncompressed data size of all rows together is 733.28 MB. - The compressed size on disk of all rows together is 206.94 MB. -- The table has a primary index with 1083 entries (called ‘marks’) and the size of the index is 96.93 KB. -- In total, the table’s data and mark files and primary index file together take 207.07 MB on disk. +- The table has a primary index with 1083 entries (called 'marks') and the size of the index is 96.93 KB. +- In total, the table's data and mark files and primary index file together take 207.07 MB on disk. ### Data is stored on disk ordered by primary key column(s) {#data-is-stored-on-disk-ordered-by-primary-key-columns} @@ -311,7 +311,7 @@ Our table that we created above has - In order to be memory efficient we explicitly specified a primary key that only contains columns that our queries are filtering on. The primary index that is based on the primary key is completely loaded into the main memory. -- In order to have consistency in the guide’s diagrams and in order to maximise compression ratio we defined a separate sorting key that includes all of our table's columns (if in a column similar data is placed close to each other, for example via sorting, then that data will be compressed better). +- In order to have consistency in the guide's diagrams and in order to maximise compression ratio we defined a separate sorting key that includes all of our table's columns (if in a column similar data is placed close to each other, for example via sorting, then that data will be compressed better). - The primary key needs to be a prefix of the sorting key if both are specified. ::: @@ -383,8 +383,8 @@ The primary index is created based on the granules shown in the diagram above. T The diagram below shows that the index stores the primary key column values (the values marked in orange in the diagram above) for each first row for each granule. Or in other words: the primary index stores the primary key column values from each 8192nd row of the table (based on the physical row order defined by the primary key columns). For example -- the first index entry (‘mark 0’ in the diagram below) is storing the key column values of the first row of granule 0 from the diagram above, -- the second index entry (‘mark 1’ in the diagram below) is storing the key column values of the first row of granule 1 from the diagram above, and so on. +- the first index entry ('mark 0' in the diagram below) is storing the key column values of the first row of granule 0 from the diagram above, +- the second index entry ('mark 1' in the diagram below) is storing the key column values of the first row of granule 1 from the diagram above, and so on. @@ -473,7 +473,7 @@ The primary key entries are called index marks because each index entry is marki - UserID index marks: The stored `UserID` values in the primary index are sorted in ascending order.
- ‘mark 1’ in the diagram above thus indicates that the `UserID` values of all table rows in granule 1, and in all following granules, are guaranteed to be greater than or equal to 4.073.710. + 'mark 1' in the diagram above thus indicates that the `UserID` values of all table rows in granule 1, and in all following granules, are guaranteed to be greater than or equal to 4.073.710. [As we will see later](#the-primary-index-is-used-for-selecting-granules), this global order enables ClickHouse to use a binary search algorithm over the index marks for the first key column when a query is filtering on the first column of the primary key. @@ -617,7 +617,7 @@ The following diagram illustrates a part of the primary index file for our table -As discussed above, via a binary search over the index’s 1083 UserID marks, mark 176 was identified. Its corresponding granule 176 can therefore possibly contain rows with a UserID column value of 749.927.693. +As discussed above, via a binary search over the index's 1083 UserID marks, mark 176 was identified. Its corresponding granule 176 can therefore possibly contain rows with a UserID column value of 749.927.693.
@@ -635,7 +635,7 @@ To achieve this, ClickHouse needs to know the physical location of granule 176. In ClickHouse the physical locations of all granules for our table are stored in mark files. Similar to data files, there is one mark file per table column. -The following diagram shows the three mark files `UserID.mrk`, `URL.mrk`, and `EventTime.mrk` that store the physical locations of the granules for the table’s `UserID`, `URL`, and `EventTime` columns. +The following diagram shows the three mark files `UserID.mrk`, `URL.mrk`, and `EventTime.mrk` that store the physical locations of the granules for the table's `UserID`, `URL`, and `EventTime` columns. @@ -820,7 +820,7 @@ When the UserID has high cardinality then it is unlikely that the same UserID va As we can see in the diagram above, all shown marks whose URL values are smaller than W3 are getting selected for streaming its associated granule's rows into the ClickHouse engine. -This is because whilst all index marks in the diagram fall into scenario 1 described above, they do not satisfy the mentioned exclusion-precondition that *the directly succeeding index mark has the same UserID value as the current mark* and thus can’t be excluded. +This is because whilst all index marks in the diagram fall into scenario 1 described above, they do not satisfy the mentioned exclusion-precondition that *the directly succeeding index mark has the same UserID value as the current mark* and thus can't be excluded. For example, consider index mark 0 for which the **URL value is smaller than W3 and for which the URL value of the directly succeeding index mark is also smaller than W3**. This can *not* be excluded because the directly succeeding index mark 1 does *not* have the same UserID value as the current mark 0. @@ -852,9 +852,9 @@ ClickHouse now created an additional index that is storing - per group of 4 cons -The first index entry (‘mark 0’ in the diagram above) is storing the minimum and maximum URL values for the [rows belonging to the first 4 granules of our table](#data-is-organized-into-granules-for-parallel-data-processing). +The first index entry ('mark 0' in the diagram above) is storing the minimum and maximum URL values for the [rows belonging to the first 4 granules of our table](#data-is-organized-into-granules-for-parallel-data-processing). -The second index entry (‘mark 1’) is storing the minimum and maximum URL values for the rows belonging to the next 4 granules of our table, and so on. +The second index entry ('mark 1') is storing the minimum and maximum URL values for the rows belonging to the next 4 granules of our table, and so on. (ClickHouse also created a special [mark file](#mark-files-are-used-for-locating-granules) for to the data skipping index for [locating](#mark-files-are-used-for-locating-granules) the groups of granules associated with the index marks.) @@ -1251,7 +1251,7 @@ The primary index of our [table with compound primary key (UserID, URL)](#a-tabl And vice versa: The primary index of our [table with compound primary key (URL, UserID)](/guides/best-practices/sparse-primary-indexes#option-1-secondary-tables) was speeding up a [query filtering on URL](/guides/best-practices/sparse-primary-indexes#secondary-key-columns-can-not-be-inefficient), but didn't provide much support for a [query filtering on UserID](#the-primary-index-is-used-for-selecting-granules). -Because of the similarly high cardinality of the primary key columns UserID and URL, a query that filters on the second key column [doesn’t benefit much from the second key column being in the index](#generic-exclusion-search-algorithm). +Because of the similarly high cardinality of the primary key columns UserID and URL, a query that filters on the second key column [doesn't benefit much from the second key column being in the index](#generic-exclusion-search-algorithm). Therefore it makes sense to remove the second key column from the primary index (resulting in less memory consumption of the index) and to [use multiple primary indexes](/guides/best-practices/sparse-primary-indexes#using-multiple-primary-indexes) instead. diff --git a/docs/guides/developer/deduplication.md b/docs/guides/developer/deduplication.md index 25cb94dc702..af77b949b72 100644 --- a/docs/guides/developer/deduplication.md +++ b/docs/guides/developer/deduplication.md @@ -338,7 +338,7 @@ FROM hackernews_views_vcmt A `VersionedCollapsingMergeTree` table is quite handy when you want to implement deduplication while inserting rows from multiple clients and/or threads. -## Why aren’t my rows being deduplicated? {#why-arent-my-rows-being-deduplicated} +## Why aren't my rows being deduplicated? {#why-arent-my-rows-being-deduplicated} One reason inserted rows may not be deduplicated is if you are using a non-idempotent function or expression in your `INSERT` statement. For example, if you are inserting rows with the column `createdAt DateTime64(3) DEFAULT now()`, your rows are guaranteed to be unique because each row will have a unique default value for the `createdAt` column. The MergeTree / ReplicatedMergeTree table engine will not know to deduplicate the rows as each inserted row will generate a unique checksum. diff --git a/docs/guides/inserting-data.md b/docs/guides/inserting-data.md index fd155eb2ff0..6ee77501064 100644 --- a/docs/guides/inserting-data.md +++ b/docs/guides/inserting-data.md @@ -87,7 +87,7 @@ It should be noted however that this approach is a little less performant as wri There are scenarios where client-side batching is not feasible e.g. an observability use case with 100s or 1000s of single-purpose agents sending logs, metrics, traces, etc. In this scenario real-time transport of that data is key to detect issues and anomalies as quickly as possible. Furthermore, there is a risk of event spikes in the observed systems, which could potentially cause large memory spikes and related issues when trying to buffer observability data client-side. -If large batches cannot be inserted, users can delegate batching to ClickHouse using [asynchronous inserts](/cloud/bestpractices/asynchronous-inserts). +If large batches cannot be inserted, users can delegate batching to ClickHouse using [asynchronous inserts](/best-practices/selecting-an-insert-strategy#asynchronous-inserts). With asynchronous inserts, data is inserted into a buffer first and then written to the database storage later in 3 steps, as illustrated by the diagram below: @@ -135,7 +135,7 @@ The [JSONEachRow](/interfaces/formats/JSONEachRow) format can be considered for Unlike many traditional databases, ClickHouse supports an HTTP interface. Users can use this for both inserting and querying data, using any of the above formats. -This is often preferable to ClickHouse’s native protocol as it allows traffic to be easily switched with load balancers. +This is often preferable to ClickHouse's native protocol as it allows traffic to be easily switched with load balancers. We expect small differences in insert performance with the native protocol, which incurs a little less overhead. Existing clients use either of these protocols ( in some cases both e.g. the Go client). The native protocol does allow query progress to be easily tracked. diff --git a/docs/guides/sizing-and-hardware-recommendations.md b/docs/guides/sizing-and-hardware-recommendations.md index 3c19228153c..aba7d723a31 100644 --- a/docs/guides/sizing-and-hardware-recommendations.md +++ b/docs/guides/sizing-and-hardware-recommendations.md @@ -10,7 +10,7 @@ description: 'This guide discusses our general recommendations regarding hardwar This guide discusses our general recommendations regarding hardware, compute, memory, and disk configurations for open-source users. If you would like to simplify your setup, we recommend using [ClickHouse Cloud](https://clickhouse.com/cloud) as it automatically scales and adapts to your workloads while minimizing costs pertaining to infrastructure management. -The configuration of your ClickHouse cluster is highly dependent on your application’s use case and workload patterns. When planning your architecture, you must consider the following factors: +The configuration of your ClickHouse cluster is highly dependent on your application's use case and workload patterns. When planning your architecture, you must consider the following factors: - Concurrency (requests per second) - Throughput (rows processed per second) diff --git a/docs/guides/sre/keeper/index.md b/docs/guides/sre/keeper/index.md index 4d20f838a3b..5ce376f8f82 100644 --- a/docs/guides/sre/keeper/index.md +++ b/docs/guides/sre/keeper/index.md @@ -639,7 +639,7 @@ Keeper can expose metrics data for scraping from [Prometheus](https://prometheus Settings: -- `endpoint` – HTTP endpoint for scraping metrics by the Prometheus server. Start from ‘/’. +- `endpoint` – HTTP endpoint for scraping metrics by the Prometheus server. Start from '/'. - `port` – Port for `endpoint`. - `metrics` – Flag that sets to expose metrics from the [system.metrics](/operations/system-tables/metrics) table. - `events` – Flag that sets to expose metrics from the [system.events](/operations/system-tables/events) table. diff --git a/docs/guides/sre/user-management/index.md b/docs/guides/sre/user-management/index.md index ac6bafb51f6..c25d07c17ca 100644 --- a/docs/guides/sre/user-management/index.md +++ b/docs/guides/sre/user-management/index.md @@ -29,7 +29,7 @@ You can configure access entities using: We recommend using SQL-driven workflow. Both of the configuration methods work simultaneously, so if you use the server configuration files for managing accounts and access rights, you can smoothly switch to SQL-driven workflow. :::note -You can’t manage the same access entity by both configuration methods simultaneously. +You can't manage the same access entity by both configuration methods simultaneously. ::: :::note @@ -45,7 +45,7 @@ By default, the ClickHouse server provides the `default` user account which is n If you just started using ClickHouse, consider the following scenario: 1. [Enable](#enabling-access-control) SQL-driven access control and account management for the `default` user. -2. Log in to the `default` user account and create all the required users. Don’t forget to create an administrator account (`GRANT ALL ON *.* TO admin_user_account WITH GRANT OPTION`). +2. Log in to the `default` user account and create all the required users. Don't forget to create an administrator account (`GRANT ALL ON *.* TO admin_user_account WITH GRANT OPTION`). 3. [Restrict permissions](/operations/settings/permissions-for-queries) for the `default` user and disable SQL-driven access control and account management for it. ### Properties of Current Solution {#access-control-properties} diff --git a/docs/guides/troubleshooting.md b/docs/guides/troubleshooting.md index 10fe07cbd23..b54089cedad 100644 --- a/docs/guides/troubleshooting.md +++ b/docs/guides/troubleshooting.md @@ -134,7 +134,7 @@ Revision: 54413 #### See system.d logs {#see-systemd-logs} -If you do not find any useful information in `clickhouse-server` logs or there aren’t any logs, you can view `system.d` logs using the command: +If you do not find any useful information in `clickhouse-server` logs or there aren't any logs, you can view `system.d` logs using the command: ```shell sudo journalctl -u clickhouse-server diff --git a/docs/integrations/data-ingestion/apache-spark/spark-native-connector.md b/docs/integrations/data-ingestion/apache-spark/spark-native-connector.md index c8760f04bb6..d0de1834ccb 100644 --- a/docs/integrations/data-ingestion/apache-spark/spark-native-connector.md +++ b/docs/integrations/data-ingestion/apache-spark/spark-native-connector.md @@ -53,7 +53,7 @@ catalog feature, it is now possible to add and work with multiple catalogs in a ## Installation & Setup {#installation--setup} For integrating ClickHouse with Spark, there are multiple installation options to suit different project setups. -You can add the ClickHouse Spark connector as a dependency directly in your project’s build file (such as in `pom.xml` +You can add the ClickHouse Spark connector as a dependency directly in your project's build file (such as in `pom.xml` for Maven or `build.sbt` for SBT). Alternatively, you can put the required JAR files in your `$SPARK_HOME/jars/` folder, or pass them directly as a Spark option using the `--jars` flag in the `spark-submit` command. diff --git a/docs/integrations/data-ingestion/clickpipes/postgres/index.md b/docs/integrations/data-ingestion/clickpipes/postgres/index.md index 0ccf38424b4..b24621903e0 100644 --- a/docs/integrations/data-ingestion/clickpipes/postgres/index.md +++ b/docs/integrations/data-ingestion/clickpipes/postgres/index.md @@ -135,7 +135,7 @@ You can configure the Advanced settings if needed. A brief description of each s :::warning - If you are defining a Ordering Key in ClickHouse differently from the Primary Key in Postgres, please don’t forget to read all the [considerations](https://docs.peerdb.io/mirror/ordering-key-different) around it! + If you are defining a Ordering Key in ClickHouse differently from the Primary Key in Postgres, please don't forget to read all the [considerations](https://docs.peerdb.io/mirror/ordering-key-different) around it! ::: diff --git a/docs/integrations/data-ingestion/clickpipes/postgres/source/rds.md b/docs/integrations/data-ingestion/clickpipes/postgres/source/rds.md index 0aa8626bb44..811a9a988fe 100644 --- a/docs/integrations/data-ingestion/clickpipes/postgres/source/rds.md +++ b/docs/integrations/data-ingestion/clickpipes/postgres/source/rds.md @@ -110,7 +110,7 @@ To connect to your RDS instance through a private network, you can use AWS Priva ### Workarounds for RDS Proxy {#workarounds-for-rds-proxy} RDS Proxy does not support logical replication connections. If you have dynamic IP addresses in RDS and cannot use DNS name or a lambda, here are some alternatives: -1. Using a cron job, resolve the RDS endpoint’s IP periodically and update the NLB if it has changed. +1. Using a cron job, resolve the RDS endpoint's IP periodically and update the NLB if it has changed. 2. Using RDS Event Notifications with EventBridge/SNS: Trigger updates automatically using AWS RDS event notifications 3. Stable EC2: Deploy an EC2 instance to act as a polling service or IP-based proxy 4. Automate IP address management using tools like Terraform or CloudFormation. diff --git a/docs/integrations/data-ingestion/data-formats/binary.md b/docs/integrations/data-ingestion/data-formats/binary.md index 5a293ad0dcc..94ed7032a1a 100644 --- a/docs/integrations/data-ingestion/data-formats/binary.md +++ b/docs/integrations/data-ingestion/data-formats/binary.md @@ -195,11 +195,11 @@ SETTINGS format_schema = 'schema:MessageType' This saves data to the [proto.bin](assets/proto.bin) file. ClickHouse also supports importing Protobuf data as well as nested messages. Consider using [ProtobufSingle](/interfaces/formats.md/#protobufsingle) to work with a single Protocol Buffer message (length delimiters will be omitted in this case). -## Cap’n Proto {#capn-proto} +## Cap'n Proto {#capn-proto} -Another popular binary serialization format supported by ClickHouse is [Cap’n Proto](https://capnproto.org/). Similarly to `Protobuf` format, we have to define a schema file ([`schema.capnp`](assets/schema.capnp)) in our example: +Another popular binary serialization format supported by ClickHouse is [Cap'n Proto](https://capnproto.org/). Similarly to `Protobuf` format, we have to define a schema file ([`schema.capnp`](assets/schema.capnp)) in our example: ```response @0xec8ff1a10aa10dbe; diff --git a/docs/integrations/data-ingestion/data-formats/csv-tsv.md b/docs/integrations/data-ingestion/data-formats/csv-tsv.md index 72882bcda82..bb70d67df5a 100644 --- a/docs/integrations/data-ingestion/data-formats/csv-tsv.md +++ b/docs/integrations/data-ingestion/data-formats/csv-tsv.md @@ -12,7 +12,7 @@ ClickHouse supports importing data from and exporting to CSV. Since CSV files ca ## Importing data from a CSV file {#importing-data-from-a-csv-file} -Before importing data, let’s create a table with a relevant structure: +Before importing data, let's create a table with a relevant structure: ```sql CREATE TABLE sometable @@ -33,7 +33,7 @@ To import data from the [CSV file](assets/data_small.csv) to the `sometable` tab clickhouse-client -q "INSERT INTO sometable FORMAT CSV" < data_small.csv ``` -Note that we use [FORMAT CSV](/interfaces/formats.md/#csv) to let ClickHouse know we’re ingesting CSV formatted data. Alternatively, we can load data from a local file using the [FROM INFILE](/sql-reference/statements/insert-into.md/#inserting-data-from-a-file) clause: +Note that we use [FORMAT CSV](/interfaces/formats.md/#csv) to let ClickHouse know we're ingesting CSV formatted data. Alternatively, we can load data from a local file using the [FROM INFILE](/sql-reference/statements/insert-into.md/#inserting-data-from-a-file) clause: ```sql @@ -96,7 +96,7 @@ Sometimes, we might skip a certain number of lines while importing data from a C SET input_format_csv_skip_first_lines = 10 ``` -In this case, we’re going to skip the first ten lines from the CSV file: +In this case, we're going to skip the first ten lines from the CSV file: ```sql SELECT count(*) FROM file('data-small.csv', CSV) @@ -107,7 +107,7 @@ SELECT count(*) FROM file('data-small.csv', CSV) └─────────┘ ``` -The [file](assets/data_small.csv) has 1k rows, but ClickHouse loaded only 990 since we’ve asked to skip the first 10. +The [file](assets/data_small.csv) has 1k rows, but ClickHouse loaded only 990 since we've asked to skip the first 10. :::tip When using the `file()` function, with ClickHouse Cloud you will need to run the commands in `clickhouse client` on the machine where the file resides. Another option is to use [`clickhouse-local`](/operations/utilities/clickhouse-local.md) to explore files locally. @@ -170,7 +170,7 @@ clickhouse-client -q "INSERT INTO sometable FORMAT TabSeparated" < data_small.ts ``` -There’s also a [TabSeparatedWithNames](/interfaces/formats.md/#tabseparatedwithnames) format to allow working with TSV files that have headers. And, like for CSV, we can skip the first X lines using the [input_format_tsv_skip_first_lines](/operations/settings/settings-formats.md/#input_format_tsv_skip_first_lines) option. +There's also a [TabSeparatedWithNames](/interfaces/formats.md/#tabseparatedwithnames) format to allow working with TSV files that have headers. And, like for CSV, we can skip the first X lines using the [input_format_tsv_skip_first_lines](/operations/settings/settings-formats.md/#input_format_tsv_skip_first_lines) option. ### Raw TSV {#raw-tsv} @@ -282,7 +282,7 @@ DESCRIBE file('data-small.csv', CSV) ``` -Here, ClickHouse could guess column types for our CSV file efficiently. If we don’t want ClickHouse to guess, we can disable this with the following option: +Here, ClickHouse could guess column types for our CSV file efficiently. If we don't want ClickHouse to guess, we can disable this with the following option: ```sql diff --git a/docs/integrations/data-ingestion/data-formats/intro.md b/docs/integrations/data-ingestion/data-formats/intro.md index 2cea6db2f95..ac167a9ab12 100644 --- a/docs/integrations/data-ingestion/data-formats/intro.md +++ b/docs/integrations/data-ingestion/data-formats/intro.md @@ -13,7 +13,7 @@ In this section of the docs, you can find examples for loading from various file ### [**Binary**](/integrations/data-ingestion/data-formats/binary.md) {#binary} -Export and load binary formats such as ClickHouse Native, MessagePack, Protocol Buffers and Cap’n Proto. +Export and load binary formats such as ClickHouse Native, MessagePack, Protocol Buffers and Cap'n Proto. ### [**CSV and TSV**](/integrations/data-ingestion/data-formats/csv-tsv.md) {#csv-and-tsv} diff --git a/docs/integrations/data-ingestion/data-formats/json/formats.md b/docs/integrations/data-ingestion/data-formats/json/formats.md index 6034cce4580..15acb085592 100644 --- a/docs/integrations/data-ingestion/data-formats/json/formats.md +++ b/docs/integrations/data-ingestion/data-formats/json/formats.md @@ -30,7 +30,7 @@ One of the most popular forms of JSON data is having a list of JSON objects in a ] ``` -Let’s create a table for this kind of data: +Let's create a table for this kind of data: ```sql CREATE TABLE sometable @@ -132,7 +132,7 @@ SELECT * FROM sometable; ### Specifying parent object key values {#specifying-parent-object-key-values} -Let’s say we also want to save values in parent object keys to the table. In this case, we can use the [following option](/operations/settings/settings-formats.md/#format_json_object_each_row_column_for_object_name) to define the name of the column we want key values to be saved to: +Let's say we also want to save values in parent object keys to the table. In this case, we can use the [following option](/operations/settings/settings-formats.md/#format_json_object_each_row_column_for_object_name) to define the name of the column we want key values to be saved to: ```sql SET format_json_object_each_row_column_for_object_name = 'id' @@ -312,7 +312,7 @@ This way we can flatten nested JSON objects or use some nested values to save th ## Skipping unknown columns {#skipping-unknown-columns} -By default, ClickHouse will ignore unknown columns when importing JSON data. Let’s try to import the original file into the table without the `month` column: +By default, ClickHouse will ignore unknown columns when importing JSON data. Let's try to import the original file into the table without the `month` column: ```sql CREATE TABLE shorttable @@ -356,7 +356,7 @@ ClickHouse will throw exceptions in cases of inconsistent JSON and table columns ClickHouse allows exporting to and importing data from [BSON](https://bsonspec.org/) encoded files. This format is used by some DBMSs, e.g. [MongoDB](https://github.com/mongodb/mongo) database. -To import BSON data, we use the [BSONEachRow](/interfaces/formats.md/#bsoneachrow) format. Let’s import data from [this BSON file](../assets/data.bson): +To import BSON data, we use the [BSONEachRow](/interfaces/formats.md/#bsoneachrow) format. Let's import data from [this BSON file](../assets/data.bson): ```sql @@ -379,4 +379,4 @@ INTO OUTFILE 'out.bson' FORMAT BSONEachRow ``` -After that, we’ll have our data exported to the `out.bson` file. +After that, we'll have our data exported to the `out.bson` file. diff --git a/docs/integrations/data-ingestion/data-formats/json/schema.md b/docs/integrations/data-ingestion/data-formats/json/schema.md index 29dda407708..31c79d3b332 100644 --- a/docs/integrations/data-ingestion/data-formats/json/schema.md +++ b/docs/integrations/data-ingestion/data-formats/json/schema.md @@ -358,7 +358,7 @@ FORMAT PrettyJSONEachRow ``` :::note Differentiating empty and null -If users need to differentiate between a value being empty and not provided, the [Nullable](/sql-reference/data-types/nullable) type can be used. This [should be avoided](/cloud/bestpractices/avoid-nullable-columns) unless absolutely required, as it will negatively impact storage and query performance on these columns. +If users need to differentiate between a value being empty and not provided, the [Nullable](/sql-reference/data-types/nullable) type can be used. This [should be avoided](/best-practices/select-data-types#avoid-nullable-columns) unless absolutely required, as it will negatively impact storage and query performance on these columns. ::: ### Handling new columns {#handling-new-columns} @@ -646,7 +646,7 @@ The above uses the `simpleJSONExtractString` to extract the `created` key, explo If an object is used to store arbitrary keys of mostly one type, consider using the `Map` type. Ideally, the number of unique keys should not exceed several hundred. We recommend the `Map` type be used for labels and tags e.g. Kubernetes pod labels in log data. While a simple way to represent nested structures, `Map`s have some notable limitations: - The fields must be of all the same type. -- Accessing sub-columns requires a special map syntax since the fields don’t exist as columns; the entire object is a column. +- Accessing sub-columns requires a special map syntax since the fields don't exist as columns; the entire object is a column. - Accessing a subcolumn loads the entire `Map` value i.e. all siblings and their respective values. For larger maps, this can result in a significant performance penalty. :::note String keys diff --git a/docs/integrations/data-ingestion/emqx/index.md b/docs/integrations/data-ingestion/emqx/index.md index a857ce6cd47..731dbe74151 100644 --- a/docs/integrations/data-ingestion/emqx/index.md +++ b/docs/integrations/data-ingestion/emqx/index.md @@ -125,7 +125,7 @@ Now click the panel to go to the cluster view. On this dashboard, you will see t EMQX Cloud does not allow anonymous connections by default,so you need add a client credential so you can use the MQTT client tool to send data to this broker. -Click ‘Authentication & ACL’ on the left menu and click ‘Authentication’ in the submenu. Click the ‘Add’ button on the right and give a username and password for the MQTT connection later. Here we will use `emqx` and `xxxxxx` for the username and password. +Click 'Authentication & ACL' on the left menu and click 'Authentication' in the submenu. Click the 'Add' button on the right and give a username and password for the MQTT connection later. Here we will use `emqx` and `xxxxxx` for the username and password. @@ -203,8 +203,8 @@ Now click on the "NEXT" button. This step is to tell EMQX Cloud how to insert re ### Add a response action {#add-a-response-action} -If you have only one resource, you don’t need to modify the ‘Resource’ and ‘Action Type’. -You only need to set the SQL template. Here’s the example used for this tutorial: +If you have only one resource, you don't need to modify the 'Resource' and 'Action Type'. +You only need to set the SQL template. Here's the example used for this tutorial: ```bash INSERT INTO temp_hum (client_id, timestamp, topic, temp, hum) VALUES ('${client_id}', ${timestamp}, '${topic}', ${temp}, ${hum}) @@ -282,4 +282,4 @@ SELECT * FROM emqx.temp_hum; ### Summary {#summary} -You didn’t write any piece of code, and now have the MQTT data move from EMQX cloud to ClickHouse Cloud. With EMQX Cloud and ClickHouse Cloud, you don’t need to manage the infra and just focus on writing you IoT applications with data storied securely in ClickHouse Cloud. +You didn't write any piece of code, and now have the MQTT data move from EMQX cloud to ClickHouse Cloud. With EMQX Cloud and ClickHouse Cloud, you don't need to manage the infra and just focus on writing you IoT applications with data storied securely in ClickHouse Cloud. diff --git a/docs/integrations/data-ingestion/etl-tools/dbt/index.md b/docs/integrations/data-ingestion/etl-tools/dbt/index.md index 9bbc51057b0..08c94da724a 100644 --- a/docs/integrations/data-ingestion/etl-tools/dbt/index.md +++ b/docs/integrations/data-ingestion/etl-tools/dbt/index.md @@ -31,7 +31,7 @@ Dbt is compatible with ClickHouse through a [ClickHouse-supported plugin](https: ## Concepts {#concepts} -dbt introduces the concept of a model. This is defined as a SQL statement, potentially joining many tables. A model can be "materialized" in a number of ways. A materialization represents a build strategy for the model’s select query. The code behind a materialization is boilerplate SQL that wraps your SELECT query in a statement in order to create a new or update an existing relation. +dbt introduces the concept of a model. This is defined as a SQL statement, potentially joining many tables. A model can be "materialized" in a number of ways. A materialization represents a build strategy for the model's select query. The code behind a materialization is boilerplate SQL that wraps your SELECT query in a statement in order to create a new or update an existing relation. dbt provides 4 types of materialization: @@ -40,7 +40,7 @@ dbt provides 4 types of materialization: * **ephemeral**: The model is not directly built in the database but is instead pulled into dependent models as common table expressions. * **incremental**: The model is initially materialized as a table, and in subsequent runs, dbt inserts new rows and updates changed rows in the table. -Additional syntax and clauses define how these models should be updated if their underlying data changes. dbt generally recommends starting with the view materialization until performance becomes a concern. The table materialization provides a query time performance improvement by capturing the results of the model’s query as a table at the expense of increased storage. The incremental approach builds on this further to allow subsequent updates to the underlying data to be captured in the target table. +Additional syntax and clauses define how these models should be updated if their underlying data changes. dbt generally recommends starting with the view materialization until performance becomes a concern. The table materialization provides a query time performance improvement by capturing the results of the model's query as a table at the expense of increased storage. The incremental approach builds on this further to allow subsequent updates to the underlying data to be captured in the target table. The[ current plugin](https://github.com/silentsokolov/dbt-clickhouse) for ClickHouse supports the **view**, **table,**, **ephemeral** and **incremental** materializations. The plugin also supports dbt[ snapshots](https://docs.getdbt.com/docs/building-a-dbt-project/snapshots#check-strategy) and [seeds](https://docs.getdbt.com/docs/building-a-dbt-project/seeds) which we explore in this guide. @@ -515,7 +515,7 @@ In the previous example, our model was materialized as a view. While this might The previous example created a table to materialize the model. This table will be reconstructed for each dbt execution. This may be infeasible and extremely costly for larger result sets or complex transformations. To address this challenge and reduce the build time, dbt offers Incremental materializations. This allows dbt to insert or update records into a table since the last execution, making it appropriate for event-style data. Under the hood a temporary table is created with all the updated records and then all the untouched records as well as the updated records are inserted into a new target table. This results in similar [limitations](#limitations) for large result sets as for the table model. -To overcome these limitations for large sets, the plugin supports ‘inserts_only‘ mode, where all the updates are inserted into the target table without creating a temporary table (more about it below). +To overcome these limitations for large sets, the plugin supports 'inserts_only' mode, where all the updates are inserted into the target table without creating a temporary table (more about it below). To illustrate this example, we will add the actor "Clicky McClickHouse", who will appear in an incredible 910 movies - ensuring he has appeared in more films than even [Mel Blanc](https://en.wikipedia.org/wiki/Mel_Blanc). @@ -689,7 +689,7 @@ To illustrate this example, we will add the actor "Clicky McClickHouse", who wil ### Internals {#internals} -We can identify the statements executed to achieve the above incremental update by querying ClickHouse’s query log. +We can identify the statements executed to achieve the above incremental update by querying ClickHouse's query log. ```sql SELECT event_time, query FROM system.query_log WHERE type='QueryStart' AND query LIKE '%dbt%' @@ -713,7 +713,7 @@ This strategy may encounter challenges on very large models. For further details ### Append Strategy (inserts-only mode) {#append-strategy-inserts-only-mode} To overcome the limitations of large datasets in incremental models, the plugin uses the dbt configuration parameter `incremental_strategy`. This can be set to the value `append`. When set, updated rows are inserted directly into the target table (a.k.a `imdb_dbt.actor_summary`) and no temporary table is created. -Note: Append only mode requires your data to be immutable or for duplicates to be acceptable. If you want an incremental table model that supports altered rows don’t use this mode! +Note: Append only mode requires your data to be immutable or for duplicates to be acceptable. If you want an incremental table model that supports altered rows don't use this mode! To illustrate this mode, we will add another new actor and re-execute dbt run with `incremental_strategy='append'`. @@ -723,7 +723,7 @@ To illustrate this mode, we will add another new actor and re-execute dbt run wi {{ config(order_by='(updated_at, id, name)', engine='MergeTree()', materialized='incremental', unique_key='id', incremental_strategy='append') }} ``` -2. Let’s add another famous actor - Danny DeBito +2. Let's add another famous actor - Danny DeBito ```sql INSERT INTO imdb.actors VALUES (845467, 'Danny', 'DeBito', 'M'); diff --git a/docs/integrations/data-ingestion/google-dataflow/templates/bigquery-to-clickhouse.md b/docs/integrations/data-ingestion/google-dataflow/templates/bigquery-to-clickhouse.md index 7a2af502e28..02c2e9f8f58 100644 --- a/docs/integrations/data-ingestion/google-dataflow/templates/bigquery-to-clickhouse.md +++ b/docs/integrations/data-ingestion/google-dataflow/templates/bigquery-to-clickhouse.md @@ -137,7 +137,7 @@ job: ### Monitor the Job {#monitor-the-job} Navigate to the [Dataflow Jobs tab](https://console.cloud.google.com/dataflow/jobs) in your Google Cloud Console to -monitor the status of the job. You’ll find the job details, including progress and any errors: +monitor the status of the job. You'll find the job details, including progress and any errors: diff --git a/docs/integrations/data-ingestion/kafka/kafka-table-engine.md b/docs/integrations/data-ingestion/kafka/kafka-table-engine.md index d325d9b4bd3..a627de3d09b 100644 --- a/docs/integrations/data-ingestion/kafka/kafka-table-engine.md +++ b/docs/integrations/data-ingestion/kafka/kafka-table-engine.md @@ -148,7 +148,7 @@ The dataset contains 200,000 rows, so it should be ingested in just a few second ##### 5. Create the Kafka table engine {#5-create-the-kafka-table-engine} -The below example creates a table engine with the same schema as the merge tree table. This isn’t strictly required, as you can have an alias or ephemeral columns in the target table. The settings are important; however - note the use of `JSONEachRow` as the data type for consuming JSON from a Kafka topic. The values `github` and `clickhouse` represent the name of the topic and consumer group names, respectively. The topics can actually be a list of values. +The below example creates a table engine with the same schema as the merge tree table. This isn't strictly required, as you can have an alias or ephemeral columns in the target table. The settings are important; however - note the use of `JSONEachRow` as the data type for consuming JSON from a Kafka topic. The values `github` and `clickhouse` represent the name of the topic and consumer group names, respectively. The topics can actually be a list of values. ```sql CREATE TABLE github_queue @@ -231,7 +231,7 @@ ATTACH TABLE github_queue; ##### Adding Kafka Metadata {#adding-kafka-metadata} -It can be useful to keep track of the metadata from the original Kafka messages after it's been ingested into ClickHouse. For example, we may want to know how much of a specific topic or partition we have consumed. For this purpose, the Kafka table engine exposes several [virtual columns](../../../engines/table-engines/index.md#table_engines-virtual_columns). These can be persisted as columns in our target table by modifying our schema and materialized view’s select statement. +It can be useful to keep track of the metadata from the original Kafka messages after it's been ingested into ClickHouse. For example, we may want to know how much of a specific topic or partition we have consumed. For this purpose, the Kafka table engine exposes several [virtual columns](../../../engines/table-engines/index.md#table_engines-virtual_columns). These can be persisted as columns in our target table by modifying our schema and materialized view's select statement. First, we perform the stop operation described above before adding columns to our target table. @@ -310,7 +310,7 @@ Errors such as authentication issues are not reported in responses to Kafka engi Kafka is often used as a "dumping ground" for data. This leads to topics containing mixed message formats and inconsistent field names. Avoid this and utilize Kafka features such Kafka Streams or ksqlDB to ensure messages are well-formed and consistent before insertion into Kafka. If these options are not possible, ClickHouse has some features that can help. * Treat the message field as strings. Functions can be used in the materialized view statement to perform cleansing and casting if required. This should not represent a production solution but might assist in one-off ingestion. -* If you’re consuming JSON from a topic, using the JSONEachRow format, use the setting [`input_format_skip_unknown_fields`](/operations/settings/formats#input_format_skip_unknown_fields). When writing data, by default, ClickHouse throws an exception if input data contains columns that do not exist in the target table. However, if this option is enabled, these excess columns will be ignored. Again this is not a production-level solution and might confuse others. +* If you're consuming JSON from a topic, using the JSONEachRow format, use the setting [`input_format_skip_unknown_fields`](/operations/settings/formats#input_format_skip_unknown_fields). When writing data, by default, ClickHouse throws an exception if input data contains columns that do not exist in the target table. However, if this option is enabled, these excess columns will be ignored. Again this is not a production-level solution and might confuse others. * Consider the setting `kafka_skip_broken_messages`. This requires the user to specify the level of tolerance per block for malformed messages - considered in the context of kafka_max_block_size. If this tolerance is exceeded (measured in absolute messages) the usual exception behaviour will revert, and other messages will be skipped. ##### Delivery Semantics and challenges with duplicates {#delivery-semantics-and-challenges-with-duplicates} @@ -319,7 +319,7 @@ The Kafka table engine has at-least-once semantics. Duplicates are possible in s ##### Quorum based Inserts {#quorum-based-inserts} -You may need [quorum-based inserts](/operations/settings/settings#insert_quorum) for cases where higher delivery guarantees are required in ClickHouse. This can’t be set on the materialized view or the target table. It can, however, be set for user profiles e.g. +You may need [quorum-based inserts](/operations/settings/settings#insert_quorum) for cases where higher delivery guarantees are required in ClickHouse. This can't be set on the materialized view or the target table. It can, however, be set for user profiles e.g. ```xml @@ -479,7 +479,7 @@ Consider the following when looking to increase Kafka Engine table throughput pe * The number of consumers for a table engine can be increased using kafka_num_consumers. However, by default, inserts will be linearized in a single thread unless kafka_thread_per_consumer is changed from the default value of 1. Set this to 1 to ensure flushes are performed in parallel. Note that creating a Kafka engine table with N consumers (and kafka_thread_per_consumer=1) is logically equivalent to creating N Kafka engines, each with a materialized view and kafka_thread_per_consumer=0. * Increasing consumers is not a free operation. Each consumer maintains its own buffers and threads, increasing the overhead on the server. Be conscious of the overhead of consumers and scale linearly across your cluster first and if possible. * If the throughput of Kafka messages is variable and delays are acceptable, consider increasing the stream_flush_interval_ms to ensure larger blocks are flushed. -* [background_message_broker_schedule_pool_size](/operations/server-configuration-parameters/settings#background_message_broker_schedule_pool_size) sets the number of threads performing background tasks. These threads are used for Kafka streaming. This setting is applied at the ClickHouse server start and can’t be changed in a user session, defaulting to 16. If you see timeouts in the logs, it may be appropriate to increase this. +* [background_message_broker_schedule_pool_size](/operations/server-configuration-parameters/settings#background_message_broker_schedule_pool_size) sets the number of threads performing background tasks. These threads are used for Kafka streaming. This setting is applied at the ClickHouse server start and can't be changed in a user session, defaulting to 16. If you see timeouts in the logs, it may be appropriate to increase this. * For communication with Kafka, the librdkafka library is used, which itself creates threads. Large numbers of Kafka tables, or consumers, can thus result in large numbers of context switches. Either distribute this load across the cluster, only replicating the target tables if possible, or consider using a table engine to read from multiple topics - a list of values is supported. Multiple materialized views can be read from a single table, each filtering to the data from a specific topic. Any settings changes should be tested. We recommend monitoring Kafka consumer lags to ensure you are properly scaled. diff --git a/docs/integrations/data-ingestion/s3/index.md b/docs/integrations/data-ingestion/s3/index.md index e20ad5f6510..ae19651496d 100644 --- a/docs/integrations/data-ingestion/s3/index.md +++ b/docs/integrations/data-ingestion/s3/index.md @@ -213,7 +213,7 @@ clickhouse-local --query "SELECT * FROM s3('https://datasets-documentation.s3.eu ### Inserting Data from S3 {#inserting-data-from-s3} To exploit the full capabilities of ClickHouse, we next read and insert the data into our instance. -We combine our `s3` function with a simple `INSERT` statement to achieve this. Note that we aren’t required to list our columns because our target table provides the required structure. This requires the columns to appear in the order specified in the table DDL statement: columns are mapped according to their position in the `SELECT` clause. The insertion of all 10m rows can take a few minutes depending on the ClickHouse instance. Below we insert 1M rows to ensure a prompt response. Adjust the `LIMIT` clause or column selection to import subsets as required: +We combine our `s3` function with a simple `INSERT` statement to achieve this. Note that we aren't required to list our columns because our target table provides the required structure. This requires the columns to appear in the order specified in the table DDL statement: columns are mapped according to their position in the `SELECT` clause. The insertion of all 10m rows can take a few minutes depending on the ClickHouse instance. Below we insert 1M rows to ensure a prompt response. Adjust the `LIMIT` clause or column selection to import subsets as required: ```sql @@ -254,7 +254,7 @@ FROM trips LIMIT 10000; ``` -Note here how the format of the file is inferred from the extension. We also don’t need to specify the columns in the `s3` function - this can be inferred from the `SELECT`. +Note here how the format of the file is inferred from the extension. We also don't need to specify the columns in the `s3` function - this can be inferred from the `SELECT`. ### Splitting Large Files {#splitting-large-files} diff --git a/docs/integrations/data-visualization/embeddable-and-clickhouse.md b/docs/integrations/data-visualization/embeddable-and-clickhouse.md index 7a1eb26bcb0..ba8d5c44558 100644 --- a/docs/integrations/data-visualization/embeddable-and-clickhouse.md +++ b/docs/integrations/data-visualization/embeddable-and-clickhouse.md @@ -17,7 +17,7 @@ In [Embeddable](https://embeddable.com/) you define [Data Models](https://docs.e The end result is the ability to deliver fast, interactive customer-facing analytics directly in your product; designed by your product team; built by your engineering team; maintained by your customer-facing and data teams. Exactly the way it should be. -Built-in row-level security means that every user only ever sees exactly the data they’re allowed to see. And two levels of fully-configurable caching mean you can deliver fast, real time analytics at scale. +Built-in row-level security means that every user only ever sees exactly the data they're allowed to see. And two levels of fully-configurable caching mean you can deliver fast, real time analytics at scale. ## 1. Gather your connection details {#1-gather-your-connection-details} diff --git a/docs/integrations/data-visualization/tableau/tableau-analysis-tips.md b/docs/integrations/data-visualization/tableau/tableau-analysis-tips.md index 1e3b76c8929..dc12501081a 100644 --- a/docs/integrations/data-visualization/tableau/tableau-analysis-tips.md +++ b/docs/integrations/data-visualization/tableau/tableau-analysis-tips.md @@ -35,7 +35,7 @@ ClickHouse has a huge number of functions that can be used for data analysis — - **`FORMAT_READABLE_QUANTITY([my_integer])`** *(added in v0.2.1)* — Returns a rounded number with a suffix (thousand, million, billion, etc.) as a string. It is useful for reading big numbers by human. Equivalent of [`formatReadableQuantity()`](/sql-reference/functions/other-functions#formatreadablequantity). - **`FORMAT_READABLE_TIMEDELTA([my_integer_timedelta_sec], [optional_max_unit])`** *(added in v0.2.1)* — Accepts the time delta in seconds. Returns a time delta with (year, month, day, hour, minute, second) as a string. `optional_max_unit` is maximum unit to show. Acceptable values: `seconds`, `minutes`, `hours`, `days`, `months`, `years`. Equivalent of [`formatReadableTimeDelta()`](/sql-reference/functions/other-functions/#formatreadabletimedelta). - **`GET_SETTING([my_setting_name])`** *(added in v0.2.1)* — Returns the current value of a custom setting. Equivalent of [`getSetting()`](/sql-reference/functions/other-functions#getsetting). -- **`HEX([my_string])`** *(added in v0.2.1)* — Returns a string containing the argument’s hexadecimal representation. Equivalent of [`hex()`](/sql-reference/functions/encoding-functions/#hex). +- **`HEX([my_string])`** *(added in v0.2.1)* — Returns a string containing the argument's hexadecimal representation. Equivalent of [`hex()`](/sql-reference/functions/encoding-functions/#hex). - **`KURTOSIS([my_number])`** — Computes the sample kurtosis of a sequence. Equivalent of [`kurtSamp()`](/sql-reference/aggregate-functions/reference/kurtsamp). - **`KURTOSISP([my_number])`** — Computes the kurtosis of a sequence. The equivalent of [`kurtPop()`](/sql-reference/aggregate-functions/reference/kurtpop). - **`MEDIAN_EXACT([my_number])`** *(added in v0.1.3)* — Exactly computes the median of a numeric data sequence. Equivalent of [`quantileExact(0.5)(...)`](/sql-reference/aggregate-functions/reference/quantileexact/#quantileexact). diff --git a/docs/integrations/index.mdx b/docs/integrations/index.mdx index d9a4791c6ea..bd4c0489660 100644 --- a/docs/integrations/index.mdx +++ b/docs/integrations/index.mdx @@ -236,7 +236,7 @@ We are actively compiling this list of ClickHouse integrations below, so it's no |QuickSight||Data visualization|Amazon QuickSight powers data-driven organizations with unified business intelligence (BI).|[Documentation](/integrations/quicksight)| |RabbitMQ||Data ingestion|Allows ClickHouse to connect [RabbitMQ](https://www.rabbitmq.com/).|[Documentation](/engines/table-engines/integrations/rabbitmq)| |Redis||Data ingestion|Allows ClickHouse to use [Redis](https://redis.io/) as a dictionary source.|[Documentation](/sql-reference/dictionaries/index.md#redis)| -|Redpanda||Data ingestion|Redpanda is the streaming data platform for developers. It’s API-compatible with Apache Kafka, but 10x faster, much easier to use, and more cost effective|[Blog](https://redpanda.com/blog/real-time-olap-database-clickhouse-redpanda)| +|Redpanda||Data ingestion|Redpanda is the streaming data platform for developers. It's API-compatible with Apache Kafka, but 10x faster, much easier to use, and more cost effective|[Blog](https://redpanda.com/blog/real-time-olap-database-clickhouse-redpanda)| |Rust||Language client|A typed client for ClickHouse|[Documentation](/integrations/language-clients/rust.md)| |SQLite||Data ingestion|Allows to import and export data to SQLite and supports queries to SQLite tables directly from ClickHouse.|[Documentation](/engines/table-engines/integrations/sqlite)| |Superset||Data visualization|Explore and visualize your ClickHouse data with Apache Superset.|[Documentation](/integrations/data-visualization/superset-and-clickhouse.md)| diff --git a/docs/integrations/sql-clients/sql-console.md b/docs/integrations/sql-clients/sql-console.md index 3b281565ebd..fcb93b5cee8 100644 --- a/docs/integrations/sql-clients/sql-console.md +++ b/docs/integrations/sql-clients/sql-console.md @@ -74,7 +74,7 @@ Click on a table in the list to open it in a new tab. In the Table View, data ca ### Inspecting Cell Data {#inspecting-cell-data} -The Cell Inspector tool can be used to view large amounts of data contained within a single cell. To open it, right-click on a cell and select ‘Inspect Cell’. The contents of the cell inspector can be copied by clicking the copy icon in the top right corner of the inspector contents. +The Cell Inspector tool can be used to view large amounts of data contained within a single cell. To open it, right-click on a cell and select 'Inspect Cell'. The contents of the cell inspector can be copied by clicking the copy icon in the top right corner of the inspector contents. @@ -82,38 +82,38 @@ The Cell Inspector tool can be used to view large amounts of data contained with ### Sorting a table {#sorting-a-table} -To sort a table in the SQL console, open a table and select the ‘Sort’ button in the toolbar. This button will open a menu that will allow you to configure your sort. You can choose a column by which to sort and configure the ordering of the sort (ascending or descending). Select ‘Apply’ or press Enter to sort your table +To sort a table in the SQL console, open a table and select the 'Sort' button in the toolbar. This button will open a menu that will allow you to configure your sort. You can choose a column by which to sort and configure the ordering of the sort (ascending or descending). Select 'Apply' or press Enter to sort your table -The SQL console also allows you to add multiple sorts to a table. Click the ‘Sort’ button again to add another sort. Note: sorts are applied in the order that they appear in the sort pane (top to bottom). To remove a sort, simply click the ‘x’ button next to the sort. +The SQL console also allows you to add multiple sorts to a table. Click the 'Sort' button again to add another sort. Note: sorts are applied in the order that they appear in the sort pane (top to bottom). To remove a sort, simply click the 'x' button next to the sort. ### Filtering a table {#filtering-a-table} -To filter a table in the SQL console, open a table and select the ‘Filter’ button. Just like sorting, this button will open a menu that will allow you to configure your filter. You can choose a column by which to filter and select the necessary criteria. The SQL console intelligently displays filter options that correspond to the type of data contained in the column. +To filter a table in the SQL console, open a table and select the 'Filter' button. Just like sorting, this button will open a menu that will allow you to configure your filter. You can choose a column by which to filter and select the necessary criteria. The SQL console intelligently displays filter options that correspond to the type of data contained in the column. -When you’re happy with your filter, you can select ‘Apply’ to filter your data. You can also add additional filters as shown below. +When you're happy with your filter, you can select 'Apply' to filter your data. You can also add additional filters as shown below. -Similar to the sort functionality, click the ‘x’ button next to a filter to remove it. +Similar to the sort functionality, click the 'x' button next to a filter to remove it. ### Filtering and sorting together {#filtering-and-sorting-together} -The SQL console allows you to filter and sort a table at the same time. To do this, add all desired filters and sorts using the steps described above and click the ‘Apply’ button. +The SQL console allows you to filter and sort a table at the same time. To do this, add all desired filters and sorts using the steps described above and click the 'Apply' button. ### Creating a query from filters and sorts {#creating-a-query-from-filters-and-sorts} -The SQL console can convert your sorts and filters directly into queries with one click. Simply select the ‘Create Query’ button from the toolbar with the sort and filter parameters of your choosing. After clicking ‘Create query’, a new query tab will open pre-populated with the SQL command corresponding to the data contained in your table view. +The SQL console can convert your sorts and filters directly into queries with one click. Simply select the 'Create Query' button from the toolbar with the sort and filter parameters of your choosing. After clicking 'Create query', a new query tab will open pre-populated with the SQL command corresponding to the data contained in your table view. :::note -Filters and sorts are not mandatory when using the ‘Create Query’ feature. +Filters and sorts are not mandatory when using the 'Create Query' feature. ::: You can learn more about querying in the SQL console by reading the (link) query documentation. @@ -124,14 +124,14 @@ You can learn more about querying in the SQL console by reading the (link) query There are two ways to create a new query in the SQL console. -- Click the ‘+’ button in the tab bar -- Select the ‘New Query’ button from the left sidebar query list +- Click the '+' button in the tab bar +- Select the 'New Query' button from the left sidebar query list ### Running a Query {#running-a-query} -To run a query, type your SQL command(s) into the SQL Editor and click the ‘Run’ button or use the shortcut `cmd / ctrl + enter`. To write and run multiple commands sequentially, make sure to add a semicolon after each command. +To run a query, type your SQL command(s) into the SQL Editor and click the 'Run' button or use the shortcut `cmd / ctrl + enter`. To write and run multiple commands sequentially, make sure to add a semicolon after each command. Query Execution Options By default, clicking the run button will run all commands contained in the SQL Editor. The SQL console supports two other query execution options: @@ -139,17 +139,17 @@ By default, clicking the run button will run all commands contained in the SQL E - Run selected command(s) - Run command at the cursor -To run selected command(s), highlight the desired command or sequence of commands and click the ‘Run’ button (or use the `cmd / ctrl + enter` shortcut). You can also select ‘Run selected’ from the SQL Editor context menu (opened by right-clicking anywhere within the editor) when a selection is present. +To run selected command(s), highlight the desired command or sequence of commands and click the 'Run' button (or use the `cmd / ctrl + enter` shortcut). You can also select 'Run selected' from the SQL Editor context menu (opened by right-clicking anywhere within the editor) when a selection is present. Running the command at the current cursor position can be achieved in two ways: -- Select ‘At Cursor’ from the extended run options menu (or use the corresponding `cmd / ctrl + shift + enter` keyboard shortcut +- Select 'At Cursor' from the extended run options menu (or use the corresponding `cmd / ctrl + shift + enter` keyboard shortcut - - Selecting ‘Run at cursor’ from the SQL Editor context menu + - Selecting 'Run at cursor' from the SQL Editor context menu @@ -159,13 +159,13 @@ The command present at the cursor position will flash yellow on execution. ### Canceling a Query {#canceling-a-query} -While a query is running, the ‘Run’ button in the Query Editor toolbar will be replaced with a ‘Cancel’ button. Simply click this button or press `Esc` to cancel the query. Note: Any results that have already been returned will persist after cancellation. +While a query is running, the 'Run' button in the Query Editor toolbar will be replaced with a 'Cancel' button. Simply click this button or press `Esc` to cancel the query. Note: Any results that have already been returned will persist after cancellation. ### Saving a Query {#saving-a-query} -If not previously named, your query should be called ‘Untitled Query’. Click on the query name to change it. Renaming a query will cause the query to be saved. +If not previously named, your query should be called 'Untitled Query'. Click on the query name to change it. Renaming a query will cause the query to be saved. @@ -301,11 +301,11 @@ Keep in mind that GenAI is an experimental feature. Use caution when running Gen ### Searching query results {#searching-query-results} -After a query is executed, you can quickly search through the returned result set using the search input in the result pane. This feature assists in previewing the results of an additional `WHERE` clause or simply checking to ensure that specific data is included in the result set. After inputting a value into the search input, the result pane will update and return records containing an entry that matches the inputted value. In this example, we’ll look for all instances of `breakfast` in the `hackernews` table for comments that contain `ClickHouse` (case-insensitive): +After a query is executed, you can quickly search through the returned result set using the search input in the result pane. This feature assists in previewing the results of an additional `WHERE` clause or simply checking to ensure that specific data is included in the result set. After inputting a value into the search input, the result pane will update and return records containing an entry that matches the inputted value. In this example, we'll look for all instances of `breakfast` in the `hackernews` table for comments that contain `ClickHouse` (case-insensitive): -Note: Any field matching the inputted value will be returned. For example, the third record in the above screenshot does not match ‘breakfast’ in the `by` field, but the `text` field does: +Note: Any field matching the inputted value will be returned. For example, the third record in the above screenshot does not match 'breakfast' in the `by` field, but the `text` field does: @@ -321,13 +321,13 @@ Selecting a page size will immediately apply pagination to the result set and na ### Exporting query result data {#exporting-query-result-data} -Query result sets can be easily exported to CSV format directly from the SQL console. To do so, open the `•••` menu on the right side of the result pane toolbar and select ‘Download as CSV’. +Query result sets can be easily exported to CSV format directly from the SQL console. To do so, open the `•••` menu on the right side of the result pane toolbar and select 'Download as CSV'. ## Visualizing Query Data {#visualizing-query-data} -Some data can be more easily interpreted in chart form. You can quickly create visualizations from query result data directly from the SQL console in just a few clicks. As an example, we’ll use a query that calculates weekly statistics for NYC taxi trips: +Some data can be more easily interpreted in chart form. You can quickly create visualizations from query result data directly from the SQL console in just a few clicks. As an example, we'll use a query that calculates weekly statistics for NYC taxi trips: ```sql select @@ -345,19 +345,19 @@ order by -Without visualization, these results are difficult to interpret. Let’s turn them into a chart. +Without visualization, these results are difficult to interpret. Let's turn them into a chart. ### Creating charts {#creating-charts} -To begin building your visualization, select the ‘Chart’ option from the query result pane toolbar. A chart configuration pane will appear: +To begin building your visualization, select the 'Chart' option from the query result pane toolbar. A chart configuration pane will appear: -We’ll start by creating a simple bar chart tracking `trip_total` by `week`. To accomplish this, we’ll drag the `week` field to the x-axis and the `trip_total` field to the y-axis: +We'll start by creating a simple bar chart tracking `trip_total` by `week`. To accomplish this, we'll drag the `week` field to the x-axis and the `trip_total` field to the y-axis: -Most chart types support multiple fields on numeric axes. To demonstrate, we’ll drag the fare_total field onto the y-axis: +Most chart types support multiple fields on numeric axes. To demonstrate, we'll drag the fare_total field onto the y-axis: @@ -371,7 +371,7 @@ Chart titles match the name of the query supplying the data. Updating the name o -A number of more advanced chart characteristics can also be adjusted in the ‘Advanced’ section of the chart configuration pane. To begin, we’ll adjust the following settings: +A number of more advanced chart characteristics can also be adjusted in the 'Advanced' section of the chart configuration pane. To begin, we'll adjust the following settings: - Subtitle - Axis titles @@ -401,7 +401,7 @@ A dialog will open, allowing you to share the query with all members of a team. -In some scenarios, it may be necessary to adjust the axis scales for each field independently. This can also be accomplished in the ‘Advanced’ section of the chart configuration pane by specifying min and max values for an axis range. As an example, the above chart looks good, but in order to demonstrate the correlation between our `trip_total` and `fare_total` fields, the axis ranges need some adjustment: +In some scenarios, it may be necessary to adjust the axis scales for each field independently. This can also be accomplished in the 'Advanced' section of the chart configuration pane by specifying min and max values for an axis range. As an example, the above chart looks good, but in order to demonstrate the correlation between our `trip_total` and `fare_total` fields, the axis ranges need some adjustment: diff --git a/docs/intro.md b/docs/intro.md index aab2c772667..f7473e88d93 100644 --- a/docs/intro.md +++ b/docs/intro.md @@ -54,7 +54,7 @@ As you can see in the stats section in the above diagram, the query processed 10 **Row-oriented DBMS** -In a row-oriented database, even though the query above only processes a few out of the existing columns, the system still needs to load the data from other existing columns from disk to memory. The reason for that is that data is stored on disk in chunks called [blocks](https://en.wikipedia.org/wiki/Block_(data_storage)) (usually fixed sizes, e.g., 4 KB or 8 KB). Blocks are the smallest units of data read from disk to memory. When an application or database requests data, the operating system’s disk I/O subsystem reads the required blocks from the disk. Even if only part of a block is needed, the entire block is read into memory (this is due to disk and file system design): +In a row-oriented database, even though the query above only processes a few out of the existing columns, the system still needs to load the data from other existing columns from disk to memory. The reason for that is that data is stored on disk in chunks called [blocks](https://en.wikipedia.org/wiki/Block_(data_storage)) (usually fixed sizes, e.g., 4 KB or 8 KB). Blocks are the smallest units of data read from disk to memory. When an application or database requests data, the operating system's disk I/O subsystem reads the required blocks from the disk. Even if only part of a block is needed, the entire block is read into memory (this is due to disk and file system design): @@ -83,7 +83,7 @@ ClickHouse provides ways to trade accuracy for performance. For example, some of ## Adaptive join algorithms {#adaptive-join-algorithms} -ClickHouse chooses the join algorithm adaptively, it starts with fast hash joins and falls back to merge joins if there’s more than one large table. +ClickHouse chooses the join algorithm adaptively, it starts with fast hash joins and falls back to merge joins if there's more than one large table. ## Superior query performance {#superior-query-performance} @@ -164,7 +164,7 @@ The higher the load on the system, the more important it is to customize the sys - Queries extract a large number of rows, but only a small subset of columns. - For simple queries, latencies around 50ms are allowed. - There is one large table per query; all tables are small, except for one. -- A query result is significantly smaller than the source data. In other words, data is filtered or aggregated, so the result fits in a single server’s RAM. +- A query result is significantly smaller than the source data. In other words, data is filtered or aggregated, so the result fits in a single server's RAM. - Queries are relatively rare (usually hundreds of queries per server or less per second). - Inserts happen in fairly large batches (\> 1000 rows), not by single rows. - Transactions are not necessary. diff --git a/docs/managing-data/core-concepts/merges.md b/docs/managing-data/core-concepts/merges.md index 2b975ffd693..c58de7a4f1e 100644 --- a/docs/managing-data/core-concepts/merges.md +++ b/docs/managing-data/core-concepts/merges.md @@ -142,7 +142,7 @@ The diagram below illustrates how parts in a standard [MergeTree](/engines/table The DDL statement in the diagram above creates a `MergeTree` table with a sorting key `(town, street)`, [meaning](/parts#what-are-table-parts-in-clickhouse) data on disk is sorted by these columns, and a sparse primary index is generated accordingly. -The ① decompressed, pre-sorted table columns are ② merged while preserving the table’s global sorting order defined by the table’s sorting key, ③ a new sparse primary index is generated, and ④ the merged column files and index are compressed and stored as a new data part on disk. +The ① decompressed, pre-sorted table columns are ② merged while preserving the table's global sorting order defined by the table's sorting key, ③ a new sparse primary index is generated, and ④ the merged column files and index are compressed and stored as a new data part on disk. ### Replacing merges {#replacing-merges} diff --git a/docs/managing-data/core-concepts/parts.md b/docs/managing-data/core-concepts/parts.md index dc243a29654..61f7e8ff4a1 100644 --- a/docs/managing-data/core-concepts/parts.md +++ b/docs/managing-data/core-concepts/parts.md @@ -41,21 +41,21 @@ A data part is created whenever a set of rows is inserted into the table. The fo When a ClickHouse server processes the example insert with 4 rows (e.g., via an [INSERT INTO statement](/sql-reference/statements/insert-into)) sketched in the diagram above, it performs several steps: -① **Sorting**: The rows are sorted by the table’s sorting key `(town, street)`, and a [sparse primary index](/guides/best-practices/sparse-primary-indexes) is generated for the sorted rows. +① **Sorting**: The rows are sorted by the table's sorting key `(town, street)`, and a [sparse primary index](/guides/best-practices/sparse-primary-indexes) is generated for the sorted rows. ② **Splitting**: The sorted data is split into columns. ③ **Compression**: Each column is [compressed](https://clickhouse.com/blog/optimize-clickhouse-codecs-compression-schema). -④ **Writing to Disk**: The compressed columns are saved as binary column files within a new directory representing the insert’s data part. The sparse primary index is also compressed and stored in the same directory. +④ **Writing to Disk**: The compressed columns are saved as binary column files within a new directory representing the insert's data part. The sparse primary index is also compressed and stored in the same directory. -Depending on the table’s specific engine, additional transformations [may](/operations/settings/settings) take place alongside sorting. +Depending on the table's specific engine, additional transformations [may](/operations/settings/settings) take place alongside sorting. Data parts are self-contained, including all metadata needed to interpret their contents without requiring a central catalog. Beyond the sparse primary index, parts contain additional metadata, such as secondary [data skipping indexes](/optimize/skipping-indexes), [column statistics](https://clickhouse.com/blog/clickhouse-release-23-11#column-statistics-for-prewhere), checksums, min-max indexes (if [partitioning](/partitions) is used), and [more](https://github.com/ClickHouse/ClickHouse/blob/a065b11d591f22b5dd50cb6224fab2ca557b4989/src/Storages/MergeTree/MergeTreeData.h#L104). ## Part merges {#part-merges} -To manage the number of parts per table, a [background merge](/merges) job periodically combines smaller parts into larger ones until they reach a [configurable](/operations/settings/merge-tree-settings#max_bytes_to_merge_at_max_space_in_pool) compressed size (typically ~150 GB). Merged parts are marked as inactive and deleted after a [configurable](/operations/settings/merge-tree-settings#old_parts_lifetime) time interval. Over time, this process creates a hierarchical structure of merged parts, which is why it’s called a MergeTree table: +To manage the number of parts per table, a [background merge](/merges) job periodically combines smaller parts into larger ones until they reach a [configurable](/operations/settings/merge-tree-settings#max_bytes_to_merge_at_max_space_in_pool) compressed size (typically ~150 GB). Merged parts are marked as inactive and deleted after a [configurable](/operations/settings/merge-tree-settings#old_parts_lifetime) time interval. Over time, this process creates a hierarchical structure of merged parts, which is why it's called a MergeTree table: diff --git a/docs/managing-data/core-concepts/shards.md b/docs/managing-data/core-concepts/shards.md index 96e41b60b37..431c0d06270 100644 --- a/docs/managing-data/core-concepts/shards.md +++ b/docs/managing-data/core-concepts/shards.md @@ -14,12 +14,12 @@ import Image from '@theme/IdealImage';
:::note -This topic doesn’t apply to ClickHouse Cloud, where [Parallel Replicas](/docs/deployment-guides/parallel-replicas) function like multiple shards in traditional shared-nothing ClickHouse clusters, and object storage [replaces](https://clickhouse.com/blog/clickhouse-cloud-boosts-performance-with-sharedmergetree-and-lightweight-updates#shared-object-storage-for-data-availability) replicas, ensuring high availability and fault tolerance. +This topic doesn't apply to ClickHouse Cloud, where [Parallel Replicas](/docs/deployment-guides/parallel-replicas) function like multiple shards in traditional shared-nothing ClickHouse clusters, and object storage [replaces](https://clickhouse.com/blog/clickhouse-cloud-boosts-performance-with-sharedmergetree-and-lightweight-updates#shared-object-storage-for-data-availability) replicas, ensuring high availability and fault tolerance. ::: ## What are table shards in ClickHouse? {#what-are-table-shards-in-clickhouse} -In traditional [shared-nothing](https://en.wikipedia.org/wiki/Shared-nothing_architecture) ClickHouse clusters, sharding is used when ① the data is too large for a single server or ② a single server is too slow for processing the data. The next figure illustrates case ①, where the [uk_price_paid_simple](/parts) table exceeds a single machine’s capacity: +In traditional [shared-nothing](https://en.wikipedia.org/wiki/Shared-nothing_architecture) ClickHouse clusters, sharding is used when ① the data is too large for a single server or ② a single server is too slow for processing the data. The next figure illustrates case ①, where the [uk_price_paid_simple](/parts) table exceeds a single machine's capacity: @@ -31,7 +31,7 @@ In such a case the data can be split over multiple ClickHouse servers in the for
-Each shard holds a subset of the data and functions as a regular ClickHouse table that can be queried independently. However, queries will only process that subset, which may be a valid use case depending on data distribution. Typically, a [distributed table](/docs/engines/table-engines/special/distributed) (often per server) provides a unified view of the full dataset. It doesn’t store data itself but forwards **SELECT** queries to all shards, assembles the results, and routes **INSERTS** to distribute data evenly. +Each shard holds a subset of the data and functions as a regular ClickHouse table that can be queried independently. However, queries will only process that subset, which may be a valid use case depending on data distribution. Typically, a [distributed table](/docs/engines/table-engines/special/distributed) (often per server) provides a unified view of the full dataset. It doesn't store data itself but forwards **SELECT** queries to all shards, assembles the results, and routes **INSERTS** to distribute data evenly. ## Distributed table creation {#distributed-table-creation} diff --git a/docs/materialized-view/refreshable-materialized-view.md b/docs/materialized-view/refreshable-materialized-view.md index b81ba3d9d80..1d797380ac7 100644 --- a/docs/materialized-view/refreshable-materialized-view.md +++ b/docs/materialized-view/refreshable-materialized-view.md @@ -8,7 +8,7 @@ keywords: ['refreshable materialized view', 'refresh', 'materialized views', 'sp import refreshableMaterializedViewDiagram from '@site/static/images/materialized-view/refreshable-materialized-view-diagram.png'; import Image from '@theme/IdealImage'; -[Refreshable materialized views](/sql-reference/statements/create/view#refreshable-materialized-view) are conceptually similar to materialized views in traditional OLTP databases, storing the result of a specified query for quick retrieval and reducing the need to repeatedly execute resource-intensive queries. Unlike ClickHouse’s [incremental materialized views](/materialized-view/incremental-materialized-view), this requires the periodic execution of the query over the full dataset - the results of which are stored in a target table for querying. This result set should, in theory, be smaller than the original dataset, allowing the subsequent query to execute faster. +[Refreshable materialized views](/sql-reference/statements/create/view#refreshable-materialized-view) are conceptually similar to materialized views in traditional OLTP databases, storing the result of a specified query for quick retrieval and reducing the need to repeatedly execute resource-intensive queries. Unlike ClickHouse's [incremental materialized views](/materialized-view/incremental-materialized-view), this requires the periodic execution of the query over the full dataset - the results of which are stored in a target table for querying. This result set should, in theory, be smaller than the original dataset, allowing the subsequent query to execute faster. The diagram explains how Refreshable Materialized Views work: @@ -86,7 +86,7 @@ Once you've done that, you can use [When was a refreshable materialized view las The `APPEND` functionality allows you to add new rows to the end of the table instead of replacing the whole view. -One use of this feature is to capture snapshots of values at a point in time. For example, let’s imagine that we have an `events` table populated by a stream of messages from [Kafka](https://kafka.apache.org/), [Redpanda](https://www.redpanda.com/), or another streaming data platform. +One use of this feature is to capture snapshots of values at a point in time. For example, let's imagine that we have an `events` table populated by a stream of messages from [Kafka](https://kafka.apache.org/), [Redpanda](https://www.redpanda.com/), or another streaming data platform. ```sql SELECT * @@ -134,7 +134,7 @@ LIMIT 10 └──────┴─────────┘ ``` -Let’s say we want to capture the count for each `uuid` every 10 seconds and store it in a new table called `events_snapshot`. The schema of `events_snapshot` would look like this: +Let's say we want to capture the count for each `uuid` every 10 seconds and store it in a new table called `events_snapshot`. The schema of `events_snapshot` would look like this: ```sql CREATE TABLE events_snapshot ( @@ -365,7 +365,7 @@ FROM imdb.movies LIMIT 10000, 910; ``` -Less than 60 seconds later, our target table is updated to reflect the prolific nature of Clicky’s acting: +Less than 60 seconds later, our target table is updated to reflect the prolific nature of Clicky's acting: ```sql SELECT * diff --git a/docs/migrations/bigquery/equivalent-concepts.md b/docs/migrations/bigquery/equivalent-concepts.md index 6fe5f98e5c0..99209efe020 100644 --- a/docs/migrations/bigquery/equivalent-concepts.md +++ b/docs/migrations/bigquery/equivalent-concepts.md @@ -112,7 +112,7 @@ Like BigQuery, ClickHouse uses table partitioning to enhance the performance and With clustering, BigQuery automatically sorts table data based on the values of a few specified columns and colocates them in optimally sized blocks. Clustering improves query performance, allowing BigQuery to better estimate the cost of running the query. With clustered columns, queries also eliminate scans of unnecessary data. -In ClickHouse, data is automatically [clustered on disk](/guides/best-practices/sparse-primary-indexes#optimal-compression-ratio-of-data-files) based on a table’s primary key columns and logically organized in blocks that can be quickly located or pruned by queries utilizing the primary index data structure. +In ClickHouse, data is automatically [clustered on disk](/guides/best-practices/sparse-primary-indexes#optimal-compression-ratio-of-data-files) based on a table's primary key columns and logically organized in blocks that can be quickly located or pruned by queries utilizing the primary index data structure. ## Materialized views {#materialized-views} @@ -150,7 +150,7 @@ Compared to BigQuery, ClickHouse supports significantly more file formats and da ## SQL language features {#sql-language-features} -ClickHouse provides standard SQL with many extensions and improvements that make it more friendly for analytical tasks. E.g. ClickHouse SQL [supports lambda functions](/sql-reference/functions/overview#arrow-operator-and-lambda) and higher order functions, so you don’t have to unnest/explode arrays when applying transformations. This is a big advantage over other systems like BigQuery. +ClickHouse provides standard SQL with many extensions and improvements that make it more friendly for analytical tasks. E.g. ClickHouse SQL [supports lambda functions](/sql-reference/functions/overview#arrow-operator-and-lambda) and higher order functions, so you don't have to unnest/explode arrays when applying transformations. This is a big advantage over other systems like BigQuery. ## Arrays {#arrays} diff --git a/docs/migrations/postgres/data-modeling-techniques.md b/docs/migrations/postgres/data-modeling-techniques.md index e4f653f1b36..4263ac571bf 100644 --- a/docs/migrations/postgres/data-modeling-techniques.md +++ b/docs/migrations/postgres/data-modeling-techniques.md @@ -93,7 +93,7 @@ Users should consider partitioning a data management technique. It is ideal when ## Materialized views vs projections {#materialized-views-vs-projections} -Postgres allows for the creation of multiple indices on a single table, enabling optimization for a variety of access patterns. This flexibility allows administrators and developers to tailor database performance to specific queries and operational needs. ClickHouse’s concept of projections, while not fully analogous to this, allows users to specify multiple `ORDER BY` clauses for a table. +Postgres allows for the creation of multiple indices on a single table, enabling optimization for a variety of access patterns. This flexibility allows administrators and developers to tailor database performance to specific queries and operational needs. ClickHouse's concept of projections, while not fully analogous to this, allows users to specify multiple `ORDER BY` clauses for a table. In ClickHouse [data modeling docs](/data-modeling/schema-design), we explore how materialized views can be used in ClickHouse to pre-compute aggregations, transform rows, and optimize queries for different access patterns. diff --git a/docs/migrations/postgres/replacing-merge-tree.md b/docs/migrations/postgres/replacing-merge-tree.md index d2258375810..9b8f3f79ca9 100644 --- a/docs/migrations/postgres/replacing-merge-tree.md +++ b/docs/migrations/postgres/replacing-merge-tree.md @@ -314,7 +314,7 @@ As shown, partitioning has significantly improved query performance in this case ## Merge Behavior Considerations {#merge-behavior-considerations} -ClickHouse’s merge selection mechanism goes beyond simple merging of parts. Below, we examine this behavior in the context of ReplacingMergeTree, including configuration options for enabling more aggressive merging of older data and considerations for larger parts. +ClickHouse's merge selection mechanism goes beyond simple merging of parts. Below, we examine this behavior in the context of ReplacingMergeTree, including configuration options for enabling more aggressive merging of older data and considerations for larger parts. ### Merge Selection Logic {#merge-selection-logic} diff --git a/docs/use-cases/data_lake/glue_catalog.md b/docs/use-cases/data_lake/glue_catalog.md index 38edaa69115..9f01454e198 100644 --- a/docs/use-cases/data_lake/glue_catalog.md +++ b/docs/use-cases/data_lake/glue_catalog.md @@ -66,7 +66,7 @@ SHOW TABLES; ``` You can see above that some tables above are not Iceberg tables, for instance -`iceberg-benchmark.hitsparquet`. You won’t be able to query these as only Iceberg +`iceberg-benchmark.hitsparquet`. You won't be able to query these as only Iceberg is currently supported. To query a table: @@ -76,7 +76,7 @@ SELECT count(*) FROM `iceberg-benchmark.hitsiceberg`; ``` :::note -Backticks are required because ClickHouse doesn’t support more than one namespace. +Backticks are required because ClickHouse doesn't support more than one namespace. ::: To inspect the table DDL, run the following query: diff --git a/docs/use-cases/observability/integrating-opentelemetry.md b/docs/use-cases/observability/integrating-opentelemetry.md index 7a88cb0fbff..1567fbd62b4 100644 --- a/docs/use-cases/observability/integrating-opentelemetry.md +++ b/docs/use-cases/observability/integrating-opentelemetry.md @@ -605,7 +605,7 @@ We recommend users use the [batch processor](https://github.com/open-telemetry/o Typically, users are forced to send smaller batches when the throughput of a collector is low, and yet they still expect data to reach ClickHouse within a minimum end-to-end latency. In this case, small batches are sent when the `timeout` of the batch processor expires. This can cause problems and is when asynchronous inserts are required. This case typically arises when **collectors in the agent role are configured to send directly to ClickHouse**. Gateways, by acting as aggregators, can alleviate this problem - see [Scaling with Gateways](#scaling-with-gateways). -If large batches cannot be guaranteed, users can delegate batching to ClickHouse using [Asynchronous Inserts](/cloud/bestpractices/asynchronous-inserts). With asynchronous inserts, data is inserted into a buffer first and then written to the database storage later or asynchronously respectively. +If large batches cannot be guaranteed, users can delegate batching to ClickHouse using [Asynchronous Inserts](/best-practices/selecting-an-insert-strategy#asynchronous-inserts). With asynchronous inserts, data is inserted into a buffer first and then written to the database storage later or asynchronously respectively. diff --git a/docs/use-cases/observability/introduction.md b/docs/use-cases/observability/introduction.md index caaf0daa365..125af1f10ca 100644 --- a/docs/use-cases/observability/introduction.md +++ b/docs/use-cases/observability/introduction.md @@ -35,7 +35,7 @@ Due to its performance and cost efficiency, ClickHouse has become the de facto s More specifically, the following means ClickHouse is ideally suited for the storage of observability data: -- **Compression** - Observability data typically contains fields for which the values are taken from a distinct set e.g. HTTP codes or service names. ClickHouse’s column-oriented storage, where values are stored sorted, means this data compresses extremely well - especially when combined with a range of specialized codecs for time-series data. Unlike other data stores, which require as much storage as the original data size of the data, typically in JSON format, ClickHouse compresses logs and traces on average up to 14x. Beyond providing significant storage savings for large Observability installations, this compression assists in accelerating queries as less data needs to be read from disk. +- **Compression** - Observability data typically contains fields for which the values are taken from a distinct set e.g. HTTP codes or service names. ClickHouse's column-oriented storage, where values are stored sorted, means this data compresses extremely well - especially when combined with a range of specialized codecs for time-series data. Unlike other data stores, which require as much storage as the original data size of the data, typically in JSON format, ClickHouse compresses logs and traces on average up to 14x. Beyond providing significant storage savings for large Observability installations, this compression assists in accelerating queries as less data needs to be read from disk. - **Fast Aggregations** - Observability solutions typically heavily involve the visualization of data through charts e.g. lines showing error rates or bar charts showing traffic sources. Aggregations, or GROUP BYs, are fundamental to powering these charts which must also be fast and responsive when applying filters in workflows for issue diagnosis. ClickHouse's column-oriented format combined with a vectorized query execution engine is ideal for fast aggregations, with sparse indexing allowing rapid filtering of data in response to users' actions. - **Fast Linear scans** - While alternative technologies rely on inverted indices for fast querying of logs, these invariably result in high disk and resource utilization. While ClickHouse provides inverted indices as an additional optional index type, linear scans are highly parallelized and use all of the available cores on a machine (unless configured otherwise). This potentially allows 10s of GB/s per second (compressed) to be scanned for matches with [highly optimized text-matching operators](/sql-reference/functions/string-search-functions). - **Familiarity of SQL** - SQL is the ubiquitous language with which all engineers are familiar. With over 50 years of development, it has proven itself as the de facto language for data analytics and remains the [3rd most popular programming language](https://clickhouse.com/blog/the-state-of-sql-based-observability#lingua-franca). Observability is just another data problem for which SQL is ideal. diff --git a/docs/use-cases/observability/schema-design.md b/docs/use-cases/observability/schema-design.md index 0078d65a174..5da95e6a6b5 100644 --- a/docs/use-cases/observability/schema-design.md +++ b/docs/use-cases/observability/schema-design.md @@ -659,7 +659,7 @@ select count() from geoip_url; └─────────┘ ``` -Because our `ip_trie` dictionary requires IP address ranges to be expressed in CIDR notation, we’ll need to transform `ip_range_start` and `ip_range_end`. +Because our `ip_trie` dictionary requires IP address ranges to be expressed in CIDR notation, we'll need to transform `ip_range_start` and `ip_range_end`. This CIDR for each range can be succinctly computed with the following query: @@ -1058,7 +1058,7 @@ We can imagine this might be a common line chart users plot with Grafana. This q This query would be 10x faster if we used the `otel_logs_v2` table, which results from our earlier materialized view, which extracts the size key from the `LogAttributes` map. We use the raw data here for illustrative purposes only and would recommend using the earlier view if this is a common query. ::: -We need a table to receive the results if we want to compute this at insert time using a Materialized view. This table should only keep 1 row per hour. If an update is received for an existing hour, the other columns should be merged into the existing hour’s row. For this merge of incremental states to happen, partial states must be stored for the other columns. +We need a table to receive the results if we want to compute this at insert time using a Materialized view. This table should only keep 1 row per hour. If an update is received for an existing hour, the other columns should be merged into the existing hour's row. For this merge of incremental states to happen, partial states must be stored for the other columns. This requires a special engine type in ClickHouse: The SummingMergeTree. This replaces all the rows with the same ordering key with one row which contains summed values for the numeric columns. The following table will merge any rows with the same date, summing any numerical columns. diff --git a/docs/use-cases/time-series/basic-operations.md b/docs/use-cases/time-series/basic-operations.md index dcb711759ce..823ef0f172b 100644 --- a/docs/use-cases/time-series/basic-operations.md +++ b/docs/use-cases/time-series/basic-operations.md @@ -14,7 +14,7 @@ This section covers the fundamental operations commonly used when working with t Common operations include grouping data by time intervals, handling gaps in time series data, and calculating changes between time periods. These operations can be performed using standard SQL syntax combined with ClickHouse's built-in time functions. -We’re going to explore ClickHouse time-series querying capabilities with the Wikistat (Wikipedia pageviews data) dataset: +We're going to explore ClickHouse time-series querying capabilities with the Wikistat (Wikipedia pageviews data) dataset: ```sql CREATE TABLE wikistat @@ -29,7 +29,7 @@ ENGINE = MergeTree ORDER BY (time); ``` -Let’s populate this table with 1 billion records: +Let's populate this table with 1 billion records: ```sql INSERT INTO wikistat @@ -62,7 +62,7 @@ LIMIT 5; └────────────┴──────────┘ ``` -We’ve used the [`toDate()`](/sql-reference/functions/type-conversion-functions#todate) function here, which converts the specified time to a date type. Alternatively, we can batch by an hour and filter on the specific date: +We've used the [`toDate()`](/sql-reference/functions/type-conversion-functions#todate) function here, which converts the specified time to a date type. Alternatively, we can batch by an hour and filter on the specific date: ```sql @@ -93,7 +93,7 @@ You can also group by year, quarter, month, or day. We can even group by arbitrary intervals, e.g., 5 minutes using the [`toStartOfInterval()`](/docs/sql-reference/functions/date-time-functions#tostartofinterval) function. -Let’s say we want to group by 4-hour intervals. +Let's say we want to group by 4-hour intervals. We can specify the grouping interval using the [`INTERVAL`](/docs/sql-reference/data-types/special-data-types/interval) clause: ```sql @@ -135,7 +135,7 @@ Either way, we get the following results: ## Filling empty groups {#time-series-filling-empty-groups} -In a lot of cases we deal with sparse data with some absent intervals. This results in empty buckets. Let’s take the following example where we group data by 1-hour intervals. This will output the following stats with some hours missing values: +In a lot of cases we deal with sparse data with some absent intervals. This results in empty buckets. Let's take the following example where we group data by 1-hour intervals. This will output the following stats with some hours missing values: ```sql SELECT @@ -215,10 +215,10 @@ ORDER BY hour ASC WITH FILL STEP toIntervalHour(1); ## Rolling time windows {#time-series-rolling-time-windows} -Sometimes, we don’t want to deal with the start of intervals (like the start of the day or an hour) but window intervals. -Let’s say we want to understand the total hits for a window, not based on days but on a 24-hour period offset from 6 pm. +Sometimes, we don't want to deal with the start of intervals (like the start of the day or an hour) but window intervals. +Let's say we want to understand the total hits for a window, not based on days but on a 24-hour period offset from 6 pm. -We can use the [`date_diff()`](/docs/sql-reference/functions/date-time-functions#date_diff) function to calculate the difference between a reference time and each record’s time. +We can use the [`date_diff()`](/docs/sql-reference/functions/date-time-functions#date_diff) function to calculate the difference between a reference time and each record's time. In this case, the `day` column will represent the difference in days (e.g., 1 day ago, 2 days ago, etc.): ```sql diff --git a/docs/use-cases/time-series/query-performance.md b/docs/use-cases/time-series/query-performance.md index cd4feb47f0a..3c05bbf6f15 100644 --- a/docs/use-cases/time-series/query-performance.md +++ b/docs/use-cases/time-series/query-performance.md @@ -15,7 +15,7 @@ We'll see how these approaches can reduce query times from seconds to millisecon ## Optimize ORDER BY keys {#time-series-optimize-order-by} Before attempting other optimizations, you should optimize their ordering key to ensure ClickHouse produces the fastest possible results. -Choosing the key right largely depends on the queries you’re going to run. Suppose most of our queries filter by `project` and `subproject` columns. +Choosing the key right largely depends on the queries you're going to run. Suppose most of our queries filter by `project` and `subproject` columns. In this case, its a good idea to add them to the ordering key - as well as the time column since we query on time as well: Let's create another version of the table that has the same column types as `wikistat`, but is ordered by `(project, subproject, time)`. @@ -33,7 +33,7 @@ ENGINE = MergeTree ORDER BY (project, subproject, time); ``` -Let’s now compare multiple queries to get an idea of how essential our ordering key expression is to performance. Note that we have haven't applied our previous data type and codec optimizations, so any query performance differences are only based on the sort order. +Let's now compare multiple queries to get an idea of how essential our ordering key expression is to performance. Note that we have haven't applied our previous data type and codec optimizations, so any query performance differences are only based on the sort order.
@@ -170,7 +170,7 @@ GROUP BY path, month; This destination table will only be populated when new records are inserted into the `wikistat` table, so we need to do some [backfilling](/docs/data-modeling/backfilling). -The easiest way to do this is using an [`INSERT INTO SELECT`](/docs/sql-reference/statements/insert-into#inserting-the-results-of-select) statement to insert directly into the materialized view’s target table [using](https://github.com/ClickHouse/examples/tree/main/ClickHouse_vs_ElasticSearch/DataAnalytics#variant-1---directly-inserting-into-the-target-table-by-using-the-materialized-views-transformation-query) the view's SELECT query (transformation) : +The easiest way to do this is using an [`INSERT INTO SELECT`](/docs/sql-reference/statements/insert-into#inserting-the-results-of-select) statement to insert directly into the materialized view's target table [using](https://github.com/ClickHouse/examples/tree/main/ClickHouse_vs_ElasticSearch/DataAnalytics#variant-1---directly-inserting-into-the-target-table-by-using-the-materialized-views-transformation-query) the view's SELECT query (transformation) : ```sql INSERT INTO wikistat_top @@ -189,7 +189,7 @@ Depending on the cardinality of the raw data set (we have 1 billion rows!), this * Using an INSERT INTO SELECT query, copying all data from the raw data set into that temporary table * Dropping the temporary table and the temporary materialized view. -With that approach, rows from the raw data set are copied block-wise into the temporary table (which doesn’t store any of these rows), and for each block of rows, a partial state is calculated and written to the target table, where these states are incrementally merged in the background. +With that approach, rows from the raw data set are copied block-wise into the temporary table (which doesn't store any of these rows), and for each block of rows, a partial state is calculated and written to the target table, where these states are incrementally merged in the background. ```sql diff --git a/docs/use-cases/time-series/storage-efficiency.md b/docs/use-cases/time-series/storage-efficiency.md index addbc10fa85..5f039915dc2 100644 --- a/docs/use-cases/time-series/storage-efficiency.md +++ b/docs/use-cases/time-series/storage-efficiency.md @@ -14,7 +14,7 @@ This section demonstrates practical techniques to reduce storage requirements wh ## Type optimization {#time-series-type-optimization} The general approach to optimizing storage efficiency is using optimal data types. -Let’s take the `project` and `subproject` columns. These columns are of type String, but have a relatively small amount of unique values: +Let's take the `project` and `subproject` columns. These columns are of type String, but have a relatively small amount of unique values: ```sql SELECT @@ -38,7 +38,7 @@ MODIFY COLUMN `project` LowCardinality(String), MODIFY COLUMN `subproject` LowCardinality(String) ``` -We’ve also used UInt64 type for the hits column, which takes 8 bytes, but has a relatively small max value: +We've also used UInt64 type for the hits column, which takes 8 bytes, but has a relatively small max value: ```sql SELECT max(hits) @@ -70,7 +70,7 @@ ALTER TABLE wikistat MODIFY COLUMN `time` CODEC(Delta, ZSTD); ``` -We’ve used the Delta codec for time column, which is a good fit for time series data. +We've used the Delta codec for time column, which is a good fit for time series data. The right ordering key can also save disk space. Since we usually want to filter by a path, we will add `path` to the sorting key. diff --git a/docs/whats-new/changelog/2017.md b/docs/whats-new/changelog/2017.md index 02b6978f55f..65fbd254699 100644 --- a/docs/whats-new/changelog/2017.md +++ b/docs/whats-new/changelog/2017.md @@ -38,7 +38,7 @@ This release contains bug fixes for the previous release 1.1.54310: - Max size of the IP trie dictionary is increased to 128M entries. - Added the getSizeOfEnumType function. - Added the sumWithOverflow aggregate function. -- Added support for the Cap’n Proto input format. +- Added support for the Cap'n Proto input format. - You can now customize compression level when using the zstd algorithm. #### Backward Incompatible Changes: {#backward-incompatible-changes} @@ -115,13 +115,13 @@ This release contains bug fixes for the previous release 1.1.54310: - Support for `DROP TABLE` for temporary tables. - Support for reading `DateTime` values in Unix timestamp format from the `CSV` and `JSONEachRow` formats. - Lagging replicas in distributed queries are now excluded by default (the default threshold is 5 minutes). -- FIFO locking is used during ALTER: an ALTER query isn’t blocked indefinitely for continuously running queries. +- FIFO locking is used during ALTER: an ALTER query isn't blocked indefinitely for continuously running queries. - Option to set `umask` in the config file. - Improved performance for queries with `DISTINCT` . #### Bug Fixes: {#bug-fixes-3} -- Improved the process for deleting old nodes in ZooKeeper. Previously, old nodes sometimes didn’t get deleted if there were very frequent inserts, which caused the server to be slow to shut down, among other things. +- Improved the process for deleting old nodes in ZooKeeper. Previously, old nodes sometimes didn't get deleted if there were very frequent inserts, which caused the server to be slow to shut down, among other things. - Fixed randomization when choosing hosts for the connection to ZooKeeper. - Fixed the exclusion of lagging replicas in distributed queries if the replica is localhost. - Fixed an error where a data part in a `ReplicatedMergeTree` table could be broken after running `ALTER MODIFY` on an element in a `Nested` structure. @@ -152,7 +152,7 @@ This release contains bug fixes for the previous release 1.1.54310: This release contains bug fixes for the previous release 1.1.54276: - Fixed `DB::Exception: Assertion violation: !_path.empty()` when inserting into a Distributed table. -- Fixed parsing when inserting in RowBinary format if input data starts with’;’. +- Fixed parsing when inserting in RowBinary format if input data starts with';'. - Errors during runtime compilation of certain aggregate functions (e.g. `groupArray()`). ### ClickHouse Release 1.1.54276, 2017-08-16 {#clickhouse-release-1-1-54276-2017-08-16} diff --git a/docs/whats-new/changelog/2018.md b/docs/whats-new/changelog/2018.md index f92512499ec..8f730d6baf7 100644 --- a/docs/whats-new/changelog/2018.md +++ b/docs/whats-new/changelog/2018.md @@ -56,7 +56,7 @@ description: 'Changelog for 2018' - Fixed bugs in some cases of `VIEW` and subqueries that omit the database. [Winter Zhang](https://github.com/ClickHouse/ClickHouse/pull/3521) - Fixed a race condition when simultaneously reading from a `MATERIALIZED VIEW` and deleting a `MATERIALIZED VIEW` due to not locking the internal `MATERIALIZED VIEW`. [#3404](https://github.com/ClickHouse/ClickHouse/pull/3404) [#3694](https://github.com/ClickHouse/ClickHouse/pull/3694) - Fixed the error `Lock handler cannot be nullptr.` [#3689](https://github.com/ClickHouse/ClickHouse/pull/3689) -- Fixed query processing when the `compile_expressions` option is enabled (it’s enabled by default). Nondeterministic constant expressions like the `now` function are no longer unfolded. [#3457](https://github.com/ClickHouse/ClickHouse/pull/3457) +- Fixed query processing when the `compile_expressions` option is enabled (it's enabled by default). Nondeterministic constant expressions like the `now` function are no longer unfolded. [#3457](https://github.com/ClickHouse/ClickHouse/pull/3457) - Fixed a crash when specifying a non-constant scale argument in `toDecimal32/64/128` functions. - Fixed an error when trying to insert an array with `NULL` elements in the `Values` format into a column of type `Array` without `Nullable` (if `input_format_values_interpret_expressions` = 1). [#3487](https://github.com/ClickHouse/ClickHouse/pull/3487) [#3503](https://github.com/ClickHouse/ClickHouse/pull/3503) - Fixed continuous error logging in `DDLWorker` if ZooKeeper is not available. [8f50c620](https://github.com/ClickHouse/ClickHouse/commit/8f50c620334988b28018213ec0092fe6423847e2) @@ -101,7 +101,7 @@ description: 'Changelog for 2018' - The `system.metrics` table now has the `VersionInteger` metric, and `system.build_options` has the added line `VERSION_INTEGER`, which contains the numeric form of the ClickHouse version, such as `18016000`. [#3644](https://github.com/ClickHouse/ClickHouse/pull/3644) - Removed the ability to compare the `Date` type with a number to avoid potential errors like `date = 2018-12-17`, where quotes around the date are omitted by mistake. [#3687](https://github.com/ClickHouse/ClickHouse/pull/3687) - Fixed the behavior of stateful functions like `rowNumberInAllBlocks`. They previously output a result that was one number larger due to starting during query analysis. [Amos Bird](https://github.com/ClickHouse/ClickHouse/pull/3729) -- If the `force_restore_data` file can’t be deleted, an error message is displayed. [Amos Bird](https://github.com/ClickHouse/ClickHouse/pull/3794) +- If the `force_restore_data` file can't be deleted, an error message is displayed. [Amos Bird](https://github.com/ClickHouse/ClickHouse/pull/3794) #### Build Improvements: {#build-improvements-1} @@ -267,13 +267,13 @@ description: 'Changelog for 2018' - Fixed an issue with `Dictionary` tables for `range_hashed` dictionaries. This error occurred in version 18.12.17. [#1702](https://github.com/ClickHouse/ClickHouse/pull/1702) - Fixed an error when loading `range_hashed` dictionaries (the message `Unsupported type Nullable (...)`). This error occurred in version 18.12.17. [#3362](https://github.com/ClickHouse/ClickHouse/pull/3362) - Fixed errors in the `pointInPolygon` function due to the accumulation of inaccurate calculations for polygons with a large number of vertices located close to each other. [#3331](https://github.com/ClickHouse/ClickHouse/pull/3331) [#3341](https://github.com/ClickHouse/ClickHouse/pull/3341) -- If after merging data parts, the checksum for the resulting part differs from the result of the same merge in another replica, the result of the merge is deleted and the data part is downloaded from the other replica (this is the correct behavior). But after downloading the data part, it couldn’t be added to the working set because of an error that the part already exists (because the data part was deleted with some delay after the merge). This led to cyclical attempts to download the same data. [#3194](https://github.com/ClickHouse/ClickHouse/pull/3194) +- If after merging data parts, the checksum for the resulting part differs from the result of the same merge in another replica, the result of the merge is deleted and the data part is downloaded from the other replica (this is the correct behavior). But after downloading the data part, it couldn't be added to the working set because of an error that the part already exists (because the data part was deleted with some delay after the merge). This led to cyclical attempts to download the same data. [#3194](https://github.com/ClickHouse/ClickHouse/pull/3194) - Fixed incorrect calculation of total memory consumption by queries (because of incorrect calculation, the `max_memory_usage_for_all_queries` setting worked incorrectly and the `MemoryTracking` metric had an incorrect value). This error occurred in version 18.12.13. [Marek Vavruša](https://github.com/ClickHouse/ClickHouse/pull/3344) - Fixed the functionality of `CREATE TABLE ... ON CLUSTER ... AS SELECT ...` This error occurred in version 18.12.13. [#3247](https://github.com/ClickHouse/ClickHouse/pull/3247) - Fixed unnecessary preparation of data structures for `JOIN`s on the server that initiates the query if the `JOIN` is only performed on remote servers. [#3340](https://github.com/ClickHouse/ClickHouse/pull/3340) - Fixed bugs in the `Kafka` engine: deadlocks after exceptions when starting to read data, and locks upon completion [Marek Vavruša](https://github.com/ClickHouse/ClickHouse/pull/3215). - For `Kafka` tables, the optional `schema` parameter was not passed (the schema of the `Cap'n'Proto` format). [Vojtech Splichal](https://github.com/ClickHouse/ClickHouse/pull/3150) -- If the ensemble of ZooKeeper servers has servers that accept the connection but then immediately close it instead of responding to the handshake, ClickHouse chooses to connect another server. Previously, this produced the error `Cannot read all data. Bytes read: 0. Bytes expected: 4.` and the server couldn’t start. [8218cf3a](https://github.com/ClickHouse/ClickHouse/commit/8218cf3a5f39a43401953769d6d12a0bb8d29da9) +- If the ensemble of ZooKeeper servers has servers that accept the connection but then immediately close it instead of responding to the handshake, ClickHouse chooses to connect another server. Previously, this produced the error `Cannot read all data. Bytes read: 0. Bytes expected: 4.` and the server couldn't start. [8218cf3a](https://github.com/ClickHouse/ClickHouse/commit/8218cf3a5f39a43401953769d6d12a0bb8d29da9) - If the ensemble of ZooKeeper servers contains servers for which the DNS query returns an error, these servers are ignored. [17b8e209](https://github.com/ClickHouse/ClickHouse/commit/17b8e209221061325ad7ba0539f03c6e65f87f29) - Fixed type conversion between `Date` and `DateTime` when inserting data in the `VALUES` format (if `input_format_values_interpret_expressions = 1`). Previously, the conversion was performed between the numerical value of the number of days in Unix Epoch time and the Unix timestamp, which led to unexpected results. [#3229](https://github.com/ClickHouse/ClickHouse/pull/3229) - Corrected type conversion between `Decimal` and integer numbers. [#3211](https://github.com/ClickHouse/ClickHouse/pull/3211) @@ -319,7 +319,7 @@ description: 'Changelog for 2018' - `Merge` now works correctly on `Distributed` tables. [Winter Zhang](https://github.com/ClickHouse/ClickHouse/pull/3159) - Fixed incompatibility (unnecessary dependency on the `glibc` version) that made it impossible to run ClickHouse on `Ubuntu Precise` and older versions. The incompatibility arose in version 18.12.13. [#3130](https://github.com/ClickHouse/ClickHouse/pull/3130) - Fixed errors in the `enable_optimize_predicate_expression` setting. [Winter Zhang](https://github.com/ClickHouse/ClickHouse/pull/3107) -- Fixed a minor issue with backwards compatibility that appeared when working with a cluster of replicas on versions earlier than 18.12.13 and simultaneously creating a new replica of a table on a server with a newer version (shown in the message `Can not clone replica, because the ... updated to new ClickHouse version`, which is logical, but shouldn’t happen). [#3122](https://github.com/ClickHouse/ClickHouse/pull/3122) +- Fixed a minor issue with backwards compatibility that appeared when working with a cluster of replicas on versions earlier than 18.12.13 and simultaneously creating a new replica of a table on a server with a newer version (shown in the message `Can not clone replica, because the ... updated to new ClickHouse version`, which is logical, but shouldn't happen). [#3122](https://github.com/ClickHouse/ClickHouse/pull/3122) #### Backward Incompatible Changes: {#backward-incompatible-changes-2} @@ -330,13 +330,13 @@ description: 'Changelog for 2018' #### New Features: {#new-features-3} - Added support for `ALTER UPDATE` queries. [#3035](https://github.com/ClickHouse/ClickHouse/pull/3035) -- Added the `allow_ddl` option, which restricts the user’s access to DDL queries. [#3104](https://github.com/ClickHouse/ClickHouse/pull/3104) +- Added the `allow_ddl` option, which restricts the user's access to DDL queries. [#3104](https://github.com/ClickHouse/ClickHouse/pull/3104) - Added the `min_merge_bytes_to_use_direct_io` option for `MergeTree` engines, which allows you to set a threshold for the total size of the merge (when above the threshold, data part files will be handled using O_DIRECT). [#3117](https://github.com/ClickHouse/ClickHouse/pull/3117) - The `system.merges` system table now contains the `partition_id` column. [#3099](https://github.com/ClickHouse/ClickHouse/pull/3099) #### Improvements {#improvements-3} -- If a data part remains unchanged during mutation, it isn’t downloaded by replicas. [#3103](https://github.com/ClickHouse/ClickHouse/pull/3103) +- If a data part remains unchanged during mutation, it isn't downloaded by replicas. [#3103](https://github.com/ClickHouse/ClickHouse/pull/3103) - Autocomplete is available for names of settings when working with `clickhouse-client`. [#3106](https://github.com/ClickHouse/ClickHouse/pull/3106) #### Bug Fixes: {#bug-fixes-12} @@ -385,7 +385,7 @@ description: 'Changelog for 2018' - Improved parsing performance for text formats (`CSV`, `TSV`). [Amos Bird](https://github.com/ClickHouse/ClickHouse/pull/2977) [#2980](https://github.com/ClickHouse/ClickHouse/pull/2980) - Improved performance of reading strings and arrays in binary formats. [Amos Bird](https://github.com/ClickHouse/ClickHouse/pull/2955) - Increased performance and reduced memory consumption for queries to `system.tables` and `system.columns` when there is a very large number of tables on a single server. [#2953](https://github.com/ClickHouse/ClickHouse/pull/2953) -- Fixed a performance problem in the case of a large stream of queries that result in an error (the `_dl_addr` function is visible in `perf top`, but the server isn’t using much CPU). [#2938](https://github.com/ClickHouse/ClickHouse/pull/2938) +- Fixed a performance problem in the case of a large stream of queries that result in an error (the `_dl_addr` function is visible in `perf top`, but the server isn't using much CPU). [#2938](https://github.com/ClickHouse/ClickHouse/pull/2938) - Conditions are cast into the View (when `enable_optimize_predicate_expression` is enabled). [Winter Zhang](https://github.com/ClickHouse/ClickHouse/pull/2907) - Improvements to the functionality for the `UUID` data type. [#3074](https://github.com/ClickHouse/ClickHouse/pull/3074) [#2985](https://github.com/ClickHouse/ClickHouse/pull/2985) - The `UUID` data type is supported in The-Alchemist dictionaries. [#2822](https://github.com/ClickHouse/ClickHouse/pull/2822) @@ -434,7 +434,7 @@ description: 'Changelog for 2018' - Safe use of ODBC data sources. Interaction with ODBC drivers uses a separate `clickhouse-odbc-bridge` process. Errors in third-party ODBC drivers no longer cause problems with server stability or vulnerabilities. [#2828](https://github.com/ClickHouse/ClickHouse/pull/2828) [#2879](https://github.com/ClickHouse/ClickHouse/pull/2879) [#2886](https://github.com/ClickHouse/ClickHouse/pull/2886) [#2893](https://github.com/ClickHouse/ClickHouse/pull/2893) [#2921](https://github.com/ClickHouse/ClickHouse/pull/2921) - Fixed incorrect validation of the file path in the `catBoostPool` table function. [#2894](https://github.com/ClickHouse/ClickHouse/pull/2894) -- The contents of system tables (`tables`, `databases`, `parts`, `columns`, `parts_columns`, `merges`, `mutations`, `replicas`, and `replication_queue`) are filtered according to the user’s configured access to databases (`allow_databases`). [Winter Zhang](https://github.com/ClickHouse/ClickHouse/pull/2856) +- The contents of system tables (`tables`, `databases`, `parts`, `columns`, `parts_columns`, `merges`, `mutations`, `replicas`, and `replication_queue`) are filtered according to the user's configured access to databases (`allow_databases`). [Winter Zhang](https://github.com/ClickHouse/ClickHouse/pull/2856) #### Backward Incompatible Changes: {#backward-incompatible-changes-3} @@ -587,7 +587,7 @@ description: 'Changelog for 2018' - Fixed the incorrect result of the `maxIntersection()` function when the boundaries of intervals coincided ([Michael Furmur](https://github.com/ClickHouse/ClickHouse/pull/2657)). - Fixed incorrect transformation of the OR expression chain in a function argument ([chenxing-xc](https://github.com/ClickHouse/ClickHouse/pull/2663)). - Fixed performance degradation for queries containing `IN (subquery)` expressions inside another subquery ([#2571](https://github.com/ClickHouse/ClickHouse/issues/2571)). -- Fixed incompatibility between servers with different versions in distributed queries that use a `CAST` function that isn’t in uppercase letters ([fe8c4d6](https://github.com/ClickHouse/ClickHouse/commit/fe8c4d64e434cacd4ceef34faa9005129f2190a5)). +- Fixed incompatibility between servers with different versions in distributed queries that use a `CAST` function that isn't in uppercase letters ([fe8c4d6](https://github.com/ClickHouse/ClickHouse/commit/fe8c4d64e434cacd4ceef34faa9005129f2190a5)). - Added missing quoting of identifiers for queries to an external DBMS ([#2635](https://github.com/ClickHouse/ClickHouse/issues/2635)). #### Backward Incompatible Changes: {#backward-incompatible-changes-6} @@ -636,7 +636,7 @@ description: 'Changelog for 2018' - Fixed a bug when working with ZooKeeper that could result in old nodes not being deleted if the session is interrupted. - Fixed an error in the `quantileTDigest` function for Float arguments (this bug was introduced in version 1.1.54388) ([Mikhail Surin](https://github.com/ClickHouse/ClickHouse/pull/2553)). - Fixed a bug in the index for MergeTree tables if the primary key column is located inside the function for converting types between signed and unsigned integers of the same size ([#2603](https://github.com/ClickHouse/ClickHouse/pull/2603)). -- Fixed segfault if `macros` are used but they aren’t in the config file ([#2570](https://github.com/ClickHouse/ClickHouse/pull/2570)). +- Fixed segfault if `macros` are used but they aren't in the config file ([#2570](https://github.com/ClickHouse/ClickHouse/pull/2570)). - Fixed switching to the default database when reconnecting the client ([#2583](https://github.com/ClickHouse/ClickHouse/pull/2583)). - Fixed a bug that occurred when the `use_index_for_in_with_subqueries` setting was disabled. @@ -700,7 +700,7 @@ description: 'Changelog for 2018' - Table inserts no longer occur if the insert into one of the materialized views is not possible because it has too many parts. - Corrected the discrepancy in the event counters `Query`, `SelectQuery`, and `InsertQuery`. - Expressions like `tuple IN (SELECT tuple)` are allowed if the tuple types match. -- A server with replicated tables can start even if you haven’t configured ZooKeeper. +- A server with replicated tables can start even if you haven't configured ZooKeeper. - When calculating the number of available CPU cores, limits on cgroups are now taken into account ([Atri Sharma](https://github.com/ClickHouse/ClickHouse/pull/2325)). - Added chown for config directories in the systemd config file ([Mikhail Shiryaev](https://github.com/ClickHouse/ClickHouse/pull/2421)). @@ -774,7 +774,7 @@ description: 'Changelog for 2018' - Added information about the size of data parts in uncompressed form in the system table. - Server-to-server encryption support for distributed tables (`1` in the replica config in ``). - Configuration of the table level for the `ReplicatedMergeTree` family in order to minimize the amount of data stored in Zookeeper: : `use_minimalistic_checksums_in_zookeeper = 1` -- Configuration of the `clickhouse-client` prompt. By default, server names are now output to the prompt. The server’s display name can be changed. It’s also sent in the `X-ClickHouse-Display-Name` HTTP header (Kirill Shvakov). +- Configuration of the `clickhouse-client` prompt. By default, server names are now output to the prompt. The server's display name can be changed. It's also sent in the `X-ClickHouse-Display-Name` HTTP header (Kirill Shvakov). - Multiple comma-separated `topics` can be specified for the `Kafka` engine (Tobias Adamson) - When a query is stopped by `KILL QUERY` or `replace_running_query`, the client receives the `Query was canceled` exception instead of an incomplete result. @@ -785,7 +785,7 @@ description: 'Changelog for 2018' - A `query_log` table is recreated on the fly if it was deleted manually (Kirill Shvakov). - The `lengthUTF8` function runs faster (zhang2014). - Improved performance of synchronous inserts in `Distributed` tables (`insert_distributed_sync = 1`) when there is a very large number of shards. -- The server accepts the `send_timeout` and `receive_timeout` settings from the client and applies them when connecting to the client (they are applied in reverse order: the server socket’s `send_timeout` is set to the `receive_timeout` value received from the client, and vice versa). +- The server accepts the `send_timeout` and `receive_timeout` settings from the client and applies them when connecting to the client (they are applied in reverse order: the server socket's `send_timeout` is set to the `receive_timeout` value received from the client, and vice versa). - More robust crash recovery for asynchronous insertion into `Distributed` tables. - The return type of the `countEqual` function changed from `UInt32` to `UInt64` (谢磊). @@ -872,7 +872,7 @@ description: 'Changelog for 2018' - Failover is supported in `remote` table functions for cases when some of the replicas are missing the requested table. - Configuration settings can be overridden in the command line when you run `clickhouse-server`. Example: `clickhouse-server -- --logger.level=information`. - Implemented the `empty` function from a `FixedString` argument: the function returns 1 if the string consists entirely of null bytes (zhang2014). -- Added the `listen_try`configuration parameter for listening to at least one of the listen addresses without quitting, if some of the addresses can’t be listened to (useful for systems with disabled support for IPv4 or IPv6). +- Added the `listen_try`configuration parameter for listening to at least one of the listen addresses without quitting, if some of the addresses can't be listened to (useful for systems with disabled support for IPv4 or IPv6). - Added the `VersionedCollapsingMergeTree` table engine. - Support for rows and arbitrary numeric types for the `library` dictionary source. - `MergeTree` tables can be used without a primary key (you need to specify `ORDER BY tuple()`). @@ -910,7 +910,7 @@ description: 'Changelog for 2018' - Fixed the `DROP DATABASE` query for `Dictionary` databases. - Fixed the low precision of `uniqHLL12` and `uniqCombined` functions for cardinalities greater than 100 million items (Alex Bocharov). - Fixed the calculation of implicit default values when necessary to simultaneously calculate default explicit expressions in `INSERT` queries (zhang2014). -- Fixed a rare case when a query to a `MergeTree` table couldn’t finish (chenxing-xc). +- Fixed a rare case when a query to a `MergeTree` table couldn't finish (chenxing-xc). - Fixed a crash that occurred when running a `CHECK` query for `Distributed` tables if all shards are local (chenxing.xc). - Fixed a slight performance regression with functions that use regular expressions. - Fixed a performance regression when creating multidimensional arrays from complex expressions. diff --git a/docs/whats-new/changelog/2019.md b/docs/whats-new/changelog/2019.md index 0d8ffa6245b..21de463a3e5 100644 --- a/docs/whats-new/changelog/2019.md +++ b/docs/whats-new/changelog/2019.md @@ -24,7 +24,7 @@ description: 'Changelog for 2019' - Fixed segfault when `EXISTS` query was used without `TABLE` or `DICTIONARY` qualifier, just like `EXISTS t`. [#8213](https://github.com/ClickHouse/ClickHouse/pull/8213) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Fixed return type for functions `rand` and `randConstant` in case of nullable argument. Now functions always return `UInt32` and never `Nullable(UInt32)`. [#8204](https://github.com/ClickHouse/ClickHouse/pull/8204) ([Nikolai Kochetov](https://github.com/KochetovNicolai)) - Fixed `DROP DICTIONARY IF EXISTS db.dict`, now it does not throw exception if `db` does not exist. [#8185](https://github.com/ClickHouse/ClickHouse/pull/8185) ([Vitaly Baranov](https://github.com/vitlibar)) -- If a table wasn’t completely dropped because of server crash, the server will try to restore and load it [#8176](https://github.com/ClickHouse/ClickHouse/pull/8176) ([tavplubix](https://github.com/tavplubix)) +- If a table wasn't completely dropped because of server crash, the server will try to restore and load it [#8176](https://github.com/ClickHouse/ClickHouse/pull/8176) ([tavplubix](https://github.com/tavplubix)) - Fixed a trivial count query for a distributed table if there are more than two shard local table. [#8164](https://github.com/ClickHouse/ClickHouse/pull/8164) ([小路](https://github.com/nicelulu)) - Fixed bug that lead to a data race in DB::BlockStreamProfileInfo::calculateRowsBeforeLimit() [#8143](https://github.com/ClickHouse/ClickHouse/pull/8143) ([Alexander Kazakov](https://github.com/Akazz)) - Fixed `ALTER table MOVE part` executed immediately after merging the specified part, which could cause moving a part which the specified part merged into. Now it correctly moves the specified part. [#8104](https://github.com/ClickHouse/ClickHouse/pull/8104) ([Vladimir Chebotarev](https://github.com/excitoon)) @@ -40,7 +40,7 @@ description: 'Changelog for 2019' - Fixed the bug that mutations are skipped for some attached parts due to their data_version are larger than the table mutation version. [#7812](https://github.com/ClickHouse/ClickHouse/pull/7812) ([Zhichang Yu](https://github.com/yuzhichang)) - Allow starting the server with redundant copies of parts after moving them to another device. [#7810](https://github.com/ClickHouse/ClickHouse/pull/7810) ([Vladimir Chebotarev](https://github.com/excitoon)) - Fixed the error "Sizes of columns does not match" that might appear when using aggregate function columns. [#7790](https://github.com/ClickHouse/ClickHouse/pull/7790) ([Boris Granveaud](https://github.com/bgranvea)) -- Now an exception will be thrown in case of using WITH TIES alongside LIMIT BY. And now it’s possible to use TOP with LIMIT BY. [#7637](https://github.com/ClickHouse/ClickHouse/pull/7637) ([Nikita Mikhaylov](https://github.com/nikitamikhaylov)) +- Now an exception will be thrown in case of using WITH TIES alongside LIMIT BY. And now it's possible to use TOP with LIMIT BY. [#7637](https://github.com/ClickHouse/ClickHouse/pull/7637) ([Nikita Mikhaylov](https://github.com/nikitamikhaylov)) - Fix dictionary reload if it has `invalidate_query`, which stopped updates and some exception on previous update tries. [#8029](https://github.com/ClickHouse/ClickHouse/pull/8029) ([alesapin](https://github.com/alesapin)) ### ClickHouse Release 19.17.4.11, 2019-11-22 {#clickhouse-release-v19-17-4-11-2019-11-22} @@ -102,7 +102,7 @@ description: 'Changelog for 2019' - Support parsing `(X,)` as tuple similar to python. [#7501](https://github.com/ClickHouse/ClickHouse/pull/7501), [#7562](https://github.com/ClickHouse/ClickHouse/pull/7562) ([Amos Bird](https://github.com/amosbird)) - Make `range` function behaviors almost like pythonic one. [#7518](https://github.com/ClickHouse/ClickHouse/pull/7518) ([sundyli](https://github.com/sundy-li)) - Add `constraints` columns to table `system.settings` [#7553](https://github.com/ClickHouse/ClickHouse/pull/7553) ([Vitaly Baranov](https://github.com/vitlibar)) -- Better Null format for tcp handler, so that it’s possible to use `select ignore() from table format Null` for perf measure via clickhouse-client [#7606](https://github.com/ClickHouse/ClickHouse/pull/7606) ([Amos Bird](https://github.com/amosbird)) +- Better Null format for tcp handler, so that it's possible to use `select ignore() from table format Null` for perf measure via clickhouse-client [#7606](https://github.com/ClickHouse/ClickHouse/pull/7606) ([Amos Bird](https://github.com/amosbird)) - Queries like `CREATE TABLE ... AS (SELECT (1, 2))` are parsed correctly [#7542](https://github.com/ClickHouse/ClickHouse/pull/7542) ([hcz](https://github.com/hczhcz)) #### Performance Improvement {#performance-improvement} @@ -240,7 +240,7 @@ description: 'Changelog for 2019' - Serialize NULL values correctly in min/max indexes of MergeTree parts. [#7234](https://github.com/ClickHouse/ClickHouse/pull/7234) ([Alexander Kuzmenkov](https://github.com/akuzm)) -- Don’t put virtual columns to .sql metadata when table is created as `CREATE TABLE AS`. +- Don't put virtual columns to .sql metadata when table is created as `CREATE TABLE AS`. [#7183](https://github.com/ClickHouse/ClickHouse/pull/7183) ([Ivan](https://github.com/abyss7)) - Fix segmentation fault in `ATTACH PART` query. [#7185](https://github.com/ClickHouse/ClickHouse/pull/7185) @@ -283,7 +283,7 @@ description: 'Changelog for 2019' side type. Make it work properly for compound types – Array and Tuple. [#7283](https://github.com/ClickHouse/ClickHouse/pull/7283) ([Alexander Kuzmenkov](https://github.com/akuzm)) -- Support missing inequalities for ASOF JOIN. It’s possible to join less-or-equal variant and strict +- Support missing inequalities for ASOF JOIN. It's possible to join less-or-equal variant and strict greater and less variants for ASOF column in ON syntax. [#7282](https://github.com/ClickHouse/ClickHouse/pull/7282) ([Artem Zuikov](https://github.com/4ertus2)) @@ -362,7 +362,7 @@ description: 'Changelog for 2019' - Fix undefined behavior in StoragesInfoStream. [#7384](https://github.com/ClickHouse/ClickHouse/pull/7384) ([tavplubix](https://github.com/tavplubix)) - Fixed constant expressions folding for external database engines (MySQL, ODBC, JDBC). In previous - versions it wasn’t working for multiple constant expressions and was not working at all for Date, + versions it wasn't working for multiple constant expressions and was not working at all for Date, DateTime and UUID. This fixes [#7245](https://github.com/ClickHouse/ClickHouse/issues/7245) [#7252](https://github.com/ClickHouse/ClickHouse/pull/7252) ([alexey-milovidov](https://github.com/alexey-milovidov)) @@ -432,7 +432,7 @@ description: 'Changelog for 2019' [#7351](https://github.com/ClickHouse/ClickHouse/pull/7351) ([Vasily Nemkov](https://github.com/Enmk)) - Wait for all jobs to finish on exception (fixes rare segfaults). [#7350](https://github.com/ClickHouse/ClickHouse/pull/7350) ([tavplubix](https://github.com/tavplubix)) -- Don’t push to MVs when inserting into Kafka table. +- Don't push to MVs when inserting into Kafka table. [#7265](https://github.com/ClickHouse/ClickHouse/pull/7265) ([Ivan](https://github.com/abyss7)) - Disable memory tracker for exception stack. [#7264](https://github.com/ClickHouse/ClickHouse/pull/7264) ([Nikolai Kochetov](https://github.com/KochetovNicolai)) @@ -461,7 +461,7 @@ description: 'Changelog for 2019' #### New Feature {#new-feature-3} -- Tiered storage: support to use multiple storage volumes for tables with MergeTree engine. It’s possible to store fresh data on SSD and automatically move old data to HDD. ([example](https://clickhouse.github.io/clickhouse-presentations/meetup30/new_features/#12)). [#4918](https://github.com/ClickHouse/ClickHouse/pull/4918) ([Igr](https://github.com/ObjatieGroba)) [#6489](https://github.com/ClickHouse/ClickHouse/pull/6489) ([alesapin](https://github.com/alesapin)) +- Tiered storage: support to use multiple storage volumes for tables with MergeTree engine. It's possible to store fresh data on SSD and automatically move old data to HDD. ([example](https://clickhouse.github.io/clickhouse-presentations/meetup30/new_features/#12)). [#4918](https://github.com/ClickHouse/ClickHouse/pull/4918) ([Igr](https://github.com/ObjatieGroba)) [#6489](https://github.com/ClickHouse/ClickHouse/pull/6489) ([alesapin](https://github.com/alesapin)) - Add table function `input` for reading incoming data in `INSERT SELECT` query. [#5450](https://github.com/ClickHouse/ClickHouse/pull/5450) ([palasonic1](https://github.com/palasonic1)) [#6832](https://github.com/ClickHouse/ClickHouse/pull/6832) ([Anton Popov](https://github.com/CurtizJ)) - Add a `sparse_hashed` dictionary layout, that is functionally equivalent to the `hashed` layout, but is more memory efficient. It uses about twice as less memory at the cost of slower value retrieval. [#6894](https://github.com/ClickHouse/ClickHouse/pull/6894) ([Azat Khuzhin](https://github.com/azat)) - Implement ability to define list of users for access to dictionaries. Only current connected database using. [#6907](https://github.com/ClickHouse/ClickHouse/pull/6907) ([Guillaume Tassery](https://github.com/YiuRULE)) @@ -578,7 +578,7 @@ description: 'Changelog for 2019' - Fix segfault with enabled `optimize_skip_unused_shards` and missing sharding key. [#6384](https://github.com/ClickHouse/ClickHouse/pull/6384) ([Anton Popov](https://github.com/CurtizJ)) - Fixed wrong code in mutations that may lead to memory corruption. Fixed segfault with read of address `0x14c0` that may happed due to concurrent `DROP TABLE` and `SELECT` from `system.parts` or `system.parts_columns`. Fixed race condition in preparation of mutation queries. Fixed deadlock caused by `OPTIMIZE` of Replicated tables and concurrent modification operations like ALTERs. [#6514](https://github.com/ClickHouse/ClickHouse/pull/6514) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Removed extra verbose logging in MySQL interface [#6389](https://github.com/ClickHouse/ClickHouse/pull/6389) ([alexey-milovidov](https://github.com/alexey-milovidov)) -- Return the ability to parse boolean settings from ‘true’ and ‘false’ in the configuration file. [#6278](https://github.com/ClickHouse/ClickHouse/pull/6278) ([alesapin](https://github.com/alesapin)) +- Return the ability to parse boolean settings from 'true' and 'false' in the configuration file. [#6278](https://github.com/ClickHouse/ClickHouse/pull/6278) ([alesapin](https://github.com/alesapin)) - Fix crash in `quantile` and `median` function over `Nullable(Decimal128)`. [#6378](https://github.com/ClickHouse/ClickHouse/pull/6378) ([Artem Zuikov](https://github.com/4ertus2)) - Fixed possible incomplete result returned by `SELECT` query with `WHERE` condition on primary key contained conversion to Float type. It was caused by incorrect checking of monotonicity in `toFloat` function. [#6248](https://github.com/ClickHouse/ClickHouse/issues/6248) [#6374](https://github.com/ClickHouse/ClickHouse/pull/6374) ([dimarub2000](https://github.com/dimarub2000)) - Check `max_expanded_ast_elements` setting for mutations. Clear mutations after `TRUNCATE TABLE`. [#6205](https://github.com/ClickHouse/ClickHouse/pull/6205) ([Winter Zhang](https://github.com/zhang2014)) @@ -667,9 +667,9 @@ description: 'Changelog for 2019' - Server exception got while sending insertion data is now being processed in client as well. [#5891](https://github.com/ClickHouse/ClickHouse/issues/5891) [#6711](https://github.com/ClickHouse/ClickHouse/pull/6711) ([dimarub2000](https://github.com/dimarub2000)) - Added a metric `DistributedFilesToInsert` that shows the total number of files in filesystem that are selected to send to remote servers by Distributed tables. The number is summed across all shards. [#6600](https://github.com/ClickHouse/ClickHouse/pull/6600) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Move most of JOINs prepare logic from `ExpressionAction/ExpressionAnalyzer` to `AnalyzedJoin`. [#6785](https://github.com/ClickHouse/ClickHouse/pull/6785) ([Artem Zuikov](https://github.com/4ertus2)) -- Fix TSan [warning](https://clickhouse-test-reports.s3.yandex.net/6399/c1c1d1daa98e199e620766f1bd06a5921050a00d/functional_stateful_tests_(thread).html) ‘lock-order-inversion’. [#6740](https://github.com/ClickHouse/ClickHouse/pull/6740) ([Vasily Nemkov](https://github.com/Enmk)) +- Fix TSan [warning](https://clickhouse-test-reports.s3.yandex.net/6399/c1c1d1daa98e199e620766f1bd06a5921050a00d/functional_stateful_tests_(thread).html) 'lock-order-inversion'. [#6740](https://github.com/ClickHouse/ClickHouse/pull/6740) ([Vasily Nemkov](https://github.com/Enmk)) - Better information messages about lack of Linux capabilities. Logging fatal errors with "fatal" level, that will make it easier to find in `system.text_log`. [#6441](https://github.com/ClickHouse/ClickHouse/pull/6441) ([alexey-milovidov](https://github.com/alexey-milovidov)) -- When enable dumping temporary data to the disk to restrict memory usage during `GROUP BY`, `ORDER BY`, it didn’t check the free disk space. The fix add a new setting `min_free_disk_space`, when the free disk space it smaller then the threshold, the query will stop and throw `ErrorCodes::NOT_ENOUGH_SPACE`. [#6678](https://github.com/ClickHouse/ClickHouse/pull/6678) ([Weiqing Xu](https://github.com/weiqxu)) [#6691](https://github.com/ClickHouse/ClickHouse/pull/6691) ([alexey-milovidov](https://github.com/alexey-milovidov)) +- When enable dumping temporary data to the disk to restrict memory usage during `GROUP BY`, `ORDER BY`, it didn't check the free disk space. The fix add a new setting `min_free_disk_space`, when the free disk space it smaller then the threshold, the query will stop and throw `ErrorCodes::NOT_ENOUGH_SPACE`. [#6678](https://github.com/ClickHouse/ClickHouse/pull/6678) ([Weiqing Xu](https://github.com/weiqxu)) [#6691](https://github.com/ClickHouse/ClickHouse/pull/6691) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Removed recursive rwlock by thread. It makes no sense, because threads are reused between queries. `SELECT` query may acquire a lock in one thread, hold a lock from another thread and exit from first thread. In the same time, first thread can be reused by `DROP` query. This will lead to false "Attempt to acquire exclusive lock recursively" messages. [#6771](https://github.com/ClickHouse/ClickHouse/pull/6771) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Split `ExpressionAnalyzer.appendJoin()`. Prepare a place in `ExpressionAnalyzer` for `MergeJoin`. [#6524](https://github.com/ClickHouse/ClickHouse/pull/6524) ([Artem Zuikov](https://github.com/4ertus2)) - Added `mysql_native_password` authentication plugin to MySQL compatibility server. [#6194](https://github.com/ClickHouse/ClickHouse/pull/6194) ([Yuriy Baranov](https://github.com/yurriy)) @@ -706,7 +706,7 @@ description: 'Changelog for 2019' #### Build/Testing/Packaging Improvement {#buildtestingpackaging-improvement-4} -- Remove Compiler (runtime template instantiation) because we’ve win over it’s performance. [#6646](https://github.com/ClickHouse/ClickHouse/pull/6646) ([alexey-milovidov](https://github.com/alexey-milovidov)) +- Remove Compiler (runtime template instantiation) because we've win over it's performance. [#6646](https://github.com/ClickHouse/ClickHouse/pull/6646) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Added performance test to show degradation of performance in gcc-9 in more isolated way. [#6302](https://github.com/ClickHouse/ClickHouse/pull/6302) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Added table function `numbers_mt`, which is multi-threaded version of `numbers`. Updated performance tests with hash functions. [#6554](https://github.com/ClickHouse/ClickHouse/pull/6554) ([Nikolai Kochetov](https://github.com/KochetovNicolai)) - Comparison mode in `clickhouse-benchmark` [#6220](https://github.com/ClickHouse/ClickHouse/issues/6220) [#6343](https://github.com/ClickHouse/ClickHouse/pull/6343) ([dimarub2000](https://github.com/dimarub2000)) @@ -743,7 +743,7 @@ description: 'Changelog for 2019' - Support for Oracle Linux in official RPM packages. [#6356](https://github.com/ClickHouse/ClickHouse/issues/6356) [#6585](https://github.com/ClickHouse/ClickHouse/pull/6585) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Changed json perftests from `once` to `loop` type. [#6536](https://github.com/ClickHouse/ClickHouse/pull/6536) ([Nikolai Kochetov](https://github.com/KochetovNicolai)) - `odbc-bridge.cpp` defines `main()` so it should not be included in `clickhouse-lib`. [#6538](https://github.com/ClickHouse/ClickHouse/pull/6538) ([Orivej Desh](https://github.com/orivej)) -- Test for crash in `FULL|RIGHT JOIN` with nulls in right table’s keys. [#6362](https://github.com/ClickHouse/ClickHouse/pull/6362) ([Artem Zuikov](https://github.com/4ertus2)) +- Test for crash in `FULL|RIGHT JOIN` with nulls in right table's keys. [#6362](https://github.com/ClickHouse/ClickHouse/pull/6362) ([Artem Zuikov](https://github.com/4ertus2)) - Added a test for the limit on expansion of aliases just in case. [#6442](https://github.com/ClickHouse/ClickHouse/pull/6442) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Switched from `boost::filesystem` to `std::filesystem` where appropriate. [#6253](https://github.com/ClickHouse/ClickHouse/pull/6253) [#6385](https://github.com/ClickHouse/ClickHouse/pull/6385) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Added RPM packages to website. [#6251](https://github.com/ClickHouse/ClickHouse/pull/6251) ([alexey-milovidov](https://github.com/alexey-milovidov)) @@ -762,7 +762,7 @@ description: 'Changelog for 2019' - Enable back the check of undefined symbols while linking. [#6453](https://github.com/ClickHouse/ClickHouse/pull/6453) ([Ivan](https://github.com/abyss7)) - Avoid rebuilding `hyperscan` every day. [#6307](https://github.com/ClickHouse/ClickHouse/pull/6307) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Fixed UBSan report in `ProtobufWriter`. [#6163](https://github.com/ClickHouse/ClickHouse/pull/6163) ([alexey-milovidov](https://github.com/alexey-milovidov)) -- Don’t allow to use query profiler with sanitizers because it is not compatible. [#6769](https://github.com/ClickHouse/ClickHouse/pull/6769) ([alexey-milovidov](https://github.com/alexey-milovidov)) +- Don't allow to use query profiler with sanitizers because it is not compatible. [#6769](https://github.com/ClickHouse/ClickHouse/pull/6769) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Add test for reloading a dictionary after fail by timer. [#6114](https://github.com/ClickHouse/ClickHouse/pull/6114) ([Vitaly Baranov](https://github.com/vitlibar)) - Fix inconsistency in `PipelineExecutor::prepareProcessor` argument type. [#6494](https://github.com/ClickHouse/ClickHouse/pull/6494) ([Nikolai Kochetov](https://github.com/KochetovNicolai)) - Added a test for bad URIs. [#6493](https://github.com/ClickHouse/ClickHouse/pull/6493) ([alexey-milovidov](https://github.com/alexey-milovidov)) @@ -772,7 +772,7 @@ description: 'Changelog for 2019' - Fixed tests affected by slow stack traces printing. [#6315](https://github.com/ClickHouse/ClickHouse/pull/6315) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Add a test case for crash in `groupUniqArray` fixed in [#6029](https://github.com/ClickHouse/ClickHouse/pull/6029). [#4402](https://github.com/ClickHouse/ClickHouse/issues/4402) [#6129](https://github.com/ClickHouse/ClickHouse/pull/6129) ([akuzm](https://github.com/akuzm)) - Fixed indices mutations tests. [#6645](https://github.com/ClickHouse/ClickHouse/pull/6645) ([Nikita Vasilev](https://github.com/nikvas0)) -- In performance test, do not read query log for queries we didn’t run. [#6427](https://github.com/ClickHouse/ClickHouse/pull/6427) ([akuzm](https://github.com/akuzm)) +- In performance test, do not read query log for queries we didn't run. [#6427](https://github.com/ClickHouse/ClickHouse/pull/6427) ([akuzm](https://github.com/akuzm)) - Materialized view now could be created with any low cardinality types regardless to the setting about suspicious low cardinality types. [#6428](https://github.com/ClickHouse/ClickHouse/pull/6428) ([Olga Khvostikova](https://github.com/stavrolia)) - Updated tests for `send_logs_level` setting. [#6207](https://github.com/ClickHouse/ClickHouse/pull/6207) ([Nikolai Kochetov](https://github.com/KochetovNicolai)) - Fix build under gcc-8.2. [#6196](https://github.com/ClickHouse/ClickHouse/pull/6196) ([Max Akhmedov](https://github.com/zlobober)) @@ -1087,21 +1087,21 @@ description: 'Changelog for 2019' - Update librdkafka to version 1.1.0 [#5872](https://github.com/ClickHouse/ClickHouse/pull/5872) ([Ivan](https://github.com/abyss7)) - Add global timeout for integration tests and disable some of them in tests code. [#5741](https://github.com/ClickHouse/ClickHouse/pull/5741) ([alesapin](https://github.com/alesapin)) - Fix some ThreadSanitizer failures. [#5854](https://github.com/ClickHouse/ClickHouse/pull/5854) ([akuzm](https://github.com/akuzm)) -- The `--no-undefined` option forces the linker to check all external names for existence while linking. It’s very useful to track real dependencies between libraries in the split build mode. [#5855](https://github.com/ClickHouse/ClickHouse/pull/5855) ([Ivan](https://github.com/abyss7)) +- The `--no-undefined` option forces the linker to check all external names for existence while linking. It's very useful to track real dependencies between libraries in the split build mode. [#5855](https://github.com/ClickHouse/ClickHouse/pull/5855) ([Ivan](https://github.com/abyss7)) - Added performance test for [#5797](https://github.com/ClickHouse/ClickHouse/issues/5797) [#5914](https://github.com/ClickHouse/ClickHouse/pull/5914) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Fixed compatibility with gcc-7. [#5840](https://github.com/ClickHouse/ClickHouse/pull/5840) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Added support for gcc-9. This fixes [#5717](https://github.com/ClickHouse/ClickHouse/issues/5717) [#5774](https://github.com/ClickHouse/ClickHouse/pull/5774) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Fixed error when libunwind can be linked incorrectly. [#5948](https://github.com/ClickHouse/ClickHouse/pull/5948) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Fixed a few warnings found by PVS-Studio. [#5921](https://github.com/ClickHouse/ClickHouse/pull/5921) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Added initial support for `clang-tidy` static analyzer. [#5806](https://github.com/ClickHouse/ClickHouse/pull/5806) ([alexey-milovidov](https://github.com/alexey-milovidov)) -- Convert BSD/Linux endian macros( ‘be64toh’ and ‘htobe64’) to the Mac OS X equivalents [#5785](https://github.com/ClickHouse/ClickHouse/pull/5785) ([Fu Chen](https://github.com/fredchenbj)) +- Convert BSD/Linux endian macros( 'be64toh' and 'htobe64') to the Mac OS X equivalents [#5785](https://github.com/ClickHouse/ClickHouse/pull/5785) ([Fu Chen](https://github.com/fredchenbj)) - Improved integration tests guide. [#5796](https://github.com/ClickHouse/ClickHouse/pull/5796) ([Vladimir Chebotarev](https://github.com/excitoon)) - Fixing build at macosx + gcc9 [#5822](https://github.com/ClickHouse/ClickHouse/pull/5822) ([filimonov](https://github.com/filimonov)) - Fix a hard-to-spot typo: aggreAGte -\> aggregate. [#5753](https://github.com/ClickHouse/ClickHouse/pull/5753) ([akuzm](https://github.com/akuzm)) - Fix freebsd build [#5760](https://github.com/ClickHouse/ClickHouse/pull/5760) ([proller](https://github.com/proller)) - Add link to experimental YouTube channel to website [#5845](https://github.com/ClickHouse/ClickHouse/pull/5845) ([Ivan Blinkov](https://github.com/blinkov)) - CMake: add option for coverage flags: WITH_COVERAGE [#5776](https://github.com/ClickHouse/ClickHouse/pull/5776) ([proller](https://github.com/proller)) -- Fix initial size of some inline PODArray’s. [#5787](https://github.com/ClickHouse/ClickHouse/pull/5787) ([akuzm](https://github.com/akuzm)) +- Fix initial size of some inline PODArray's. [#5787](https://github.com/ClickHouse/ClickHouse/pull/5787) ([akuzm](https://github.com/akuzm)) - clickhouse-server.postinst: fix os detection for centos 6 [#5788](https://github.com/ClickHouse/ClickHouse/pull/5788) ([proller](https://github.com/proller)) - Added Arch linux package generation. [#5719](https://github.com/ClickHouse/ClickHouse/pull/5719) ([Vladimir Chebotarev](https://github.com/excitoon)) - Split Common/config.h by libs (dbms) [#5715](https://github.com/ClickHouse/ClickHouse/pull/5715) ([proller](https://github.com/proller)) @@ -1126,7 +1126,7 @@ description: 'Changelog for 2019' - Add new column codec: `T64`. Made for (U)IntX/EnumX/Data(Time)/DecimalX columns. It should be good for columns with constant or small range values. Codec itself allows enlarge or shrink data type without re-compression. [#5557](https://github.com/ClickHouse/ClickHouse/pull/5557) ([Artem Zuikov](https://github.com/4ertus2)) - Add database engine `MySQL` that allow to view all the tables in remote MySQL server [#5599](https://github.com/ClickHouse/ClickHouse/pull/5599) ([Winter Zhang](https://github.com/zhang2014)) -- `bitmapContains` implementation. It’s 2x faster than `bitmapHasAny` if the second bitmap contains one element. [#5535](https://github.com/ClickHouse/ClickHouse/pull/5535) ([Zhichang Yu](https://github.com/yuzhichang)) +- `bitmapContains` implementation. It's 2x faster than `bitmapHasAny` if the second bitmap contains one element. [#5535](https://github.com/ClickHouse/ClickHouse/pull/5535) ([Zhichang Yu](https://github.com/yuzhichang)) - Support for `crc32` function (with behaviour exactly as in MySQL or PHP). Do not use it if you need a hash function. [#5661](https://github.com/ClickHouse/ClickHouse/pull/5661) ([Remen Ivan](https://github.com/BHYCHIK)) - Implemented `SYSTEM START/STOP DISTRIBUTED SENDS` queries to control asynchronous inserts into `Distributed` tables. [#4935](https://github.com/ClickHouse/ClickHouse/pull/4935) ([Winter Zhang](https://github.com/zhang2014)) @@ -1135,18 +1135,18 @@ description: 'Changelog for 2019' - Ignore query execution limits and max parts size for merge limits while executing mutations. [#5659](https://github.com/ClickHouse/ClickHouse/pull/5659) ([Anton Popov](https://github.com/CurtizJ)) - Fix bug which may lead to deduplication of normal blocks (extremely rare) and insertion of duplicate blocks (more often). [#5549](https://github.com/ClickHouse/ClickHouse/pull/5549) ([alesapin](https://github.com/alesapin)) - Fix of function `arrayEnumerateUniqRanked` for arguments with empty arrays [#5559](https://github.com/ClickHouse/ClickHouse/pull/5559) ([proller](https://github.com/proller)) -- Don’t subscribe to Kafka topics without intent to poll any messages. [#5698](https://github.com/ClickHouse/ClickHouse/pull/5698) ([Ivan](https://github.com/abyss7)) +- Don't subscribe to Kafka topics without intent to poll any messages. [#5698](https://github.com/ClickHouse/ClickHouse/pull/5698) ([Ivan](https://github.com/abyss7)) - Make setting `join_use_nulls` get no effect for types that cannot be inside Nullable [#5700](https://github.com/ClickHouse/ClickHouse/pull/5700) ([Olga Khvostikova](https://github.com/stavrolia)) - Fixed `Incorrect size of index granularity` errors [#5720](https://github.com/ClickHouse/ClickHouse/pull/5720) ([coraxster](https://github.com/coraxster)) - Fix Float to Decimal convert overflow [#5607](https://github.com/ClickHouse/ClickHouse/pull/5607) ([coraxster](https://github.com/coraxster)) -- Flush buffer when `WriteBufferFromHDFS`’s destructor is called. This fixes writing into `HDFS`. [#5684](https://github.com/ClickHouse/ClickHouse/pull/5684) ([Xindong Peng](https://github.com/eejoin)) +- Flush buffer when `WriteBufferFromHDFS`'s destructor is called. This fixes writing into `HDFS`. [#5684](https://github.com/ClickHouse/ClickHouse/pull/5684) ([Xindong Peng](https://github.com/eejoin)) #### Improvement {#improvement-7} - Treat empty cells in `CSV` as default values when the setting `input_format_defaults_for_omitted_fields` is enabled. [#5625](https://github.com/ClickHouse/ClickHouse/pull/5625) ([akuzm](https://github.com/akuzm)) - Non-blocking loading of external dictionaries. [#5567](https://github.com/ClickHouse/ClickHouse/pull/5567) ([Vitaly Baranov](https://github.com/vitlibar)) - Network timeouts can be dynamically changed for already established connections according to the settings. [#4558](https://github.com/ClickHouse/ClickHouse/pull/4558) ([Konstantin Podshumok](https://github.com/podshumok)) -- Using "public_suffix_list" for functions `firstSignificantSubdomain`, `cutToFirstSignificantSubdomain`. It’s using a perfect hash table generated by `gperf` with a list generated from the file: https://publicsuffix.org/list/public_suffix_list.dat. (for example, now we recognize the domain `ac.uk` as non-significant). [#5030](https://github.com/ClickHouse/ClickHouse/pull/5030) ([Guillaume Tassery](https://github.com/YiuRULE)) +- Using "public_suffix_list" for functions `firstSignificantSubdomain`, `cutToFirstSignificantSubdomain`. It's using a perfect hash table generated by `gperf` with a list generated from the file: https://publicsuffix.org/list/public_suffix_list.dat. (for example, now we recognize the domain `ac.uk` as non-significant). [#5030](https://github.com/ClickHouse/ClickHouse/pull/5030) ([Guillaume Tassery](https://github.com/YiuRULE)) - Adopted `IPv6` data type in system tables; unified client info columns in `system.processes` and `system.query_log` [#5640](https://github.com/ClickHouse/ClickHouse/pull/5640) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Using sessions for connections with MySQL compatibility protocol. #5476 [#5646](https://github.com/ClickHouse/ClickHouse/pull/5646) ([Yuriy Baranov](https://github.com/yurriy)) - Support more `ALTER` queries `ON CLUSTER`. [#5593](https://github.com/ClickHouse/ClickHouse/pull/5593) [#5613](https://github.com/ClickHouse/ClickHouse/pull/5613) ([sundyli](https://github.com/sundy-li)) @@ -1210,7 +1210,7 @@ description: 'Changelog for 2019' - Fix bad alloc when truncate Join storage [#5437](https://github.com/ClickHouse/ClickHouse/pull/5437) ([TCeason](https://github.com/TCeason)) - In recent versions of package tzdata some of files are symlinks now. The current mechanism for detecting default timezone gets broken and gives wrong names for some timezones. Now at least we force the timezone name to the contents of TZ if provided. [#5443](https://github.com/ClickHouse/ClickHouse/pull/5443) ([Ivan](https://github.com/abyss7)) - Fix some extremely rare cases with MultiVolnitsky searcher when the constant needles in sum are at least 16KB long. The algorithm missed or overwrote the previous results which can lead to the incorrect result of `multiSearchAny`. [#5588](https://github.com/ClickHouse/ClickHouse/pull/5588) ([Danila Kutenin](https://github.com/danlark1)) -- Fix the issue when settings for ExternalData requests couldn’t use ClickHouse settings. Also, for now, settings `date_time_input_format` and `low_cardinality_allow_in_native_format` cannot be used because of the ambiguity of names (in external data it can be interpreted as table format and in the query it can be a setting). [#5455](https://github.com/ClickHouse/ClickHouse/pull/5455) ([Danila Kutenin](https://github.com/danlark1)) +- Fix the issue when settings for ExternalData requests couldn't use ClickHouse settings. Also, for now, settings `date_time_input_format` and `low_cardinality_allow_in_native_format` cannot be used because of the ambiguity of names (in external data it can be interpreted as table format and in the query it can be a setting). [#5455](https://github.com/ClickHouse/ClickHouse/pull/5455) ([Danila Kutenin](https://github.com/danlark1)) - Fix bug when parts were removed only from FS without dropping them from Zookeeper. [#5520](https://github.com/ClickHouse/ClickHouse/pull/5520) ([alesapin](https://github.com/alesapin)) - Remove debug logging from MySQL protocol [#5478](https://github.com/ClickHouse/ClickHouse/pull/5478) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Skip ZNONODE during DDL query processing [#5489](https://github.com/ClickHouse/ClickHouse/pull/5489) ([Azat Khuzhin](https://github.com/azat)) @@ -1265,7 +1265,7 @@ description: 'Changelog for 2019' - Added `max_parts_in_total` setting for MergeTree family of tables (default: 100 000) that prevents unsafe specification of partition key #5166. [#5171](https://github.com/ClickHouse/ClickHouse/pull/5171) ([alexey-milovidov](https://github.com/alexey-milovidov)) - `clickhouse-obfuscator`: derive seed for individual columns by combining initial seed with column name, not column position. This is intended to transform datasets with multiple related tables, so that tables will remain JOINable after transformation. [#5178](https://github.com/ClickHouse/ClickHouse/pull/5178) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Added functions `JSONExtractRaw`, `JSONExtractKeyAndValues`. Renamed functions `jsonExtract` to `JSONExtract`. When something goes wrong these functions return the correspondent values, not `NULL`. Modified function `JSONExtract`, now it gets the return type from its last parameter and does not inject nullables. Implemented fallback to RapidJSON in case AVX2 instructions are not available. Simdjson library updated to a new version. [#5235](https://github.com/ClickHouse/ClickHouse/pull/5235) ([Vitaly Baranov](https://github.com/vitlibar)) -- Now `if` and `multiIf` functions do not rely on the condition’s `Nullable`, but rely on the branches for sql compatibility. [#5238](https://github.com/ClickHouse/ClickHouse/pull/5238) ([Jian Wu](https://github.com/janplus)) +- Now `if` and `multiIf` functions do not rely on the condition's `Nullable`, but rely on the branches for sql compatibility. [#5238](https://github.com/ClickHouse/ClickHouse/pull/5238) ([Jian Wu](https://github.com/janplus)) - `In` predicate now generates `Null` result from `Null` input like the `Equal` function. [#5152](https://github.com/ClickHouse/ClickHouse/pull/5152) ([Jian Wu](https://github.com/janplus)) - Check the time limit every (flush_interval / poll_timeout) number of rows from Kafka. This allows to break the reading from Kafka consumer more frequently and to check the time limits for the top-level streams [#5249](https://github.com/ClickHouse/ClickHouse/pull/5249) ([Ivan](https://github.com/abyss7)) - Link rdkafka with bundled SASL. It should allow to use SASL SCRAM authentication [#5253](https://github.com/ClickHouse/ClickHouse/pull/5253) ([Ivan](https://github.com/abyss7)) @@ -1273,7 +1273,7 @@ description: 'Changelog for 2019' - clickhouse-server: more informative listen error messages. [#5268](https://github.com/ClickHouse/ClickHouse/pull/5268) ([proller](https://github.com/proller)) - Support dictionaries in clickhouse-copier for functions in `` [#5270](https://github.com/ClickHouse/ClickHouse/pull/5270) ([proller](https://github.com/proller)) - Add new setting `kafka_commit_every_batch` to regulate Kafka committing policy. - It allows to set commit mode: after every batch of messages is handled, or after the whole block is written to the storage. It’s a trade-off between losing some messages or reading them twice in some extreme situations. [#5308](https://github.com/ClickHouse/ClickHouse/pull/5308) ([Ivan](https://github.com/abyss7)) + It allows to set commit mode: after every batch of messages is handled, or after the whole block is written to the storage. It's a trade-off between losing some messages or reading them twice in some extreme situations. [#5308](https://github.com/ClickHouse/ClickHouse/pull/5308) ([Ivan](https://github.com/abyss7)) - Make `windowFunnel` support other Unsigned Integer Types. [#5320](https://github.com/ClickHouse/ClickHouse/pull/5320) ([sundyli](https://github.com/sundy-li)) - Allow to shadow virtual column `_table` in Merge engine. [#5325](https://github.com/ClickHouse/ClickHouse/pull/5325) ([Ivan](https://github.com/abyss7)) - Make `sequenceMatch` aggregate functions support other unsigned Integer types [#5339](https://github.com/ClickHouse/ClickHouse/pull/5339) ([sundyli](https://github.com/sundy-li)) @@ -1289,7 +1289,7 @@ description: 'Changelog for 2019' - Parallelize processing of parts of non-replicated MergeTree tables in ALTER MODIFY query. [#4639](https://github.com/ClickHouse/ClickHouse/pull/4639) ([Ivan Kush](https://github.com/IvanKush)) - Optimizations in regular expressions extraction. [#5193](https://github.com/ClickHouse/ClickHouse/pull/5193) [#5191](https://github.com/ClickHouse/ClickHouse/pull/5191) ([Danila Kutenin](https://github.com/danlark1)) -- Do not add right join key column to join result if it’s used only in join on section. [#5260](https://github.com/ClickHouse/ClickHouse/pull/5260) ([Artem Zuikov](https://github.com/4ertus2)) +- Do not add right join key column to join result if it's used only in join on section. [#5260](https://github.com/ClickHouse/ClickHouse/pull/5260) ([Artem Zuikov](https://github.com/4ertus2)) - Freeze the Kafka buffer after first empty response. It avoids multiple invokations of `ReadBuffer::next()` for empty result in some row-parsing streams. [#5283](https://github.com/ClickHouse/ClickHouse/pull/5283) ([Ivan](https://github.com/abyss7)) - `concat` function optimization for multiple arguments. [#5357](https://github.com/ClickHouse/ClickHouse/pull/5357) ([Danila Kutenin](https://github.com/danlark1)) - Query optimisation. Allow push down IN statement while rewriting commа/cross join into inner one. [#5396](https://github.com/ClickHouse/ClickHouse/pull/5396) ([Artem Zuikov](https://github.com/4ertus2)) @@ -1343,10 +1343,10 @@ description: 'Changelog for 2019' #### Bug Fixes {#bug-fixes-1} - Fix segfault on `minmax` INDEX with Null value. [#5246](https://github.com/ClickHouse/ClickHouse/pull/5246) ([Nikita Vasilev](https://github.com/nikvas0)) -- Mark all input columns in LIMIT BY as required output. It fixes ‘Not found column’ error in some distributed queries. [#5407](https://github.com/ClickHouse/ClickHouse/pull/5407) ([Constantin S. Pan](https://github.com/kvap)) -- Fix "Column ‘0’ already exists" error in `SELECT .. PREWHERE` on column with DEFAULT [#5397](https://github.com/ClickHouse/ClickHouse/pull/5397) ([proller](https://github.com/proller)) +- Mark all input columns in LIMIT BY as required output. It fixes 'Not found column' error in some distributed queries. [#5407](https://github.com/ClickHouse/ClickHouse/pull/5407) ([Constantin S. Pan](https://github.com/kvap)) +- Fix "Column '0' already exists" error in `SELECT .. PREWHERE` on column with DEFAULT [#5397](https://github.com/ClickHouse/ClickHouse/pull/5397) ([proller](https://github.com/proller)) - Fix `ALTER MODIFY TTL` query on `ReplicatedMergeTree`. [#5539](https://github.com/ClickHouse/ClickHouse/pull/5539/commits) ([Anton Popov](https://github.com/CurtizJ)) -- Don’t crash the server when Kafka consumers have failed to start. [#5285](https://github.com/ClickHouse/ClickHouse/pull/5285) ([Ivan](https://github.com/abyss7)) +- Don't crash the server when Kafka consumers have failed to start. [#5285](https://github.com/ClickHouse/ClickHouse/pull/5285) ([Ivan](https://github.com/abyss7)) - Fixed bitmap functions produce wrong result. [#5359](https://github.com/ClickHouse/ClickHouse/pull/5359) ([Andy Yang](https://github.com/andyyzh)) - Fix element_count for hashed dictionary (do not include duplicates) [#5440](https://github.com/ClickHouse/ClickHouse/pull/5440) ([Azat Khuzhin](https://github.com/azat)) - Use contents of environment variable TZ as the name for timezone. It helps to correctly detect default timezone in some cases.[#5443](https://github.com/ClickHouse/ClickHouse/pull/5443) ([Ivan](https://github.com/abyss7)) @@ -1449,7 +1449,7 @@ description: 'Changelog for 2019' - TTL expressions for columns and tables. [#4212](https://github.com/ClickHouse/ClickHouse/pull/4212) ([Anton Popov](https://github.com/CurtizJ)) - Added support for `brotli` compression for HTTP responses (Accept-Encoding: br) [#4388](https://github.com/ClickHouse/ClickHouse/pull/4388) ([Mikhail](https://github.com/fandyushin)) - Added new function `isValidUTF8` for checking whether a set of bytes is correctly utf-8 encoded. [#4934](https://github.com/ClickHouse/ClickHouse/pull/4934) ([Danila Kutenin](https://github.com/danlark1)) -- Add new load balancing policy `first_or_random` which sends queries to the first specified host and if it’s inaccessible send queries to random hosts of shard. Useful for cross-replication topology setups. [#5012](https://github.com/ClickHouse/ClickHouse/pull/5012) ([nvartolomei](https://github.com/nvartolomei)) +- Add new load balancing policy `first_or_random` which sends queries to the first specified host and if it's inaccessible send queries to random hosts of shard. Useful for cross-replication topology setups. [#5012](https://github.com/ClickHouse/ClickHouse/pull/5012) ([nvartolomei](https://github.com/nvartolomei)) #### Experimental Features {#experimental-features-1} @@ -1477,7 +1477,7 @@ description: 'Changelog for 2019' - Fixed potential null pointer dereference in `clickhouse-copier`. [#4900](https://github.com/ClickHouse/ClickHouse/pull/4900) ([proller](https://github.com/proller)) - Fixed error on query with JOIN + ARRAY JOIN [#4938](https://github.com/ClickHouse/ClickHouse/pull/4938) ([Artem Zuikov](https://github.com/4ertus2)) - Fixed hanging on start of the server when a dictionary depends on another dictionary via a database with engine=Dictionary. [#4962](https://github.com/ClickHouse/ClickHouse/pull/4962) ([Vitaly Baranov](https://github.com/vitlibar)) -- Partially fix distributed_product_mode = local. It’s possible to allow columns of local tables in where/having/order by/... via table aliases. Throw exception if table does not have alias. There’s not possible to access to the columns without table aliases yet. [#4986](https://github.com/ClickHouse/ClickHouse/pull/4986) ([Artem Zuikov](https://github.com/4ertus2)) +- Partially fix distributed_product_mode = local. It's possible to allow columns of local tables in where/having/order by/... via table aliases. Throw exception if table does not have alias. There's not possible to access to the columns without table aliases yet. [#4986](https://github.com/ClickHouse/ClickHouse/pull/4986) ([Artem Zuikov](https://github.com/4ertus2)) - Fix potentially wrong result for `SELECT DISTINCT` with `JOIN` [#5001](https://github.com/ClickHouse/ClickHouse/pull/5001) ([Artem Zuikov](https://github.com/4ertus2)) - Fixed very rare data race condition that could happen when executing a query with UNION ALL involving at least two SELECTs from system.columns, system.tables, system.parts, system.parts_tables or tables of Merge family and performing ALTER of columns of the related tables concurrently. [#5189](https://github.com/ClickHouse/ClickHouse/pull/5189) ([alexey-milovidov](https://github.com/alexey-milovidov)) @@ -1538,7 +1538,7 @@ description: 'Changelog for 2019' #### Bug Fix {#bug-fix-26} - Avoid `std::terminate` in case of memory allocation failure. Now `std::bad_alloc` exception is thrown as expected. [#4665](https://github.com/ClickHouse/ClickHouse/pull/4665) ([alexey-milovidov](https://github.com/alexey-milovidov)) -- Fixes `capnproto` reading from buffer. Sometimes files wasn’t loaded successfully by HTTP. [#4674](https://github.com/ClickHouse/ClickHouse/pull/4674) ([Vladislav](https://github.com/smirnov-vs)) +- Fixes `capnproto` reading from buffer. Sometimes files wasn't loaded successfully by HTTP. [#4674](https://github.com/ClickHouse/ClickHouse/pull/4674) ([Vladislav](https://github.com/smirnov-vs)) - Fix error `Unknown log entry type: 0` after `OPTIMIZE TABLE FINAL` query. [#4683](https://github.com/ClickHouse/ClickHouse/pull/4683) ([Amos Bird](https://github.com/amosbird)) - Wrong arguments to `hasAny` or `hasAll` functions may lead to segfault. [#4698](https://github.com/ClickHouse/ClickHouse/pull/4698) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Deadlock may happen while executing `DROP DATABASE dictionary` query. [#4701](https://github.com/ClickHouse/ClickHouse/pull/4701) ([alexey-milovidov](https://github.com/alexey-milovidov)) @@ -1550,7 +1550,7 @@ description: 'Changelog for 2019' - Fixed TSan report on shutdown due to race condition in system logs usage. Fixed potential use-after-free on shutdown when part_log is enabled. [#4758](https://github.com/ClickHouse/ClickHouse/pull/4758) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Fix recheck parts in `ReplicatedMergeTreeAlterThread` in case of error. [#4772](https://github.com/ClickHouse/ClickHouse/pull/4772) ([Nikolai Kochetov](https://github.com/KochetovNicolai)) - Arithmetic operations on intermediate aggregate function states were not working for constant arguments (such as subquery results). [#4776](https://github.com/ClickHouse/ClickHouse/pull/4776) ([alexey-milovidov](https://github.com/alexey-milovidov)) -- Always backquote column names in metadata. Otherwise it’s impossible to create a table with column named `index` (server won’t restart due to malformed `ATTACH` query in metadata). [#4782](https://github.com/ClickHouse/ClickHouse/pull/4782) ([alexey-milovidov](https://github.com/alexey-milovidov)) +- Always backquote column names in metadata. Otherwise it's impossible to create a table with column named `index` (server won't restart due to malformed `ATTACH` query in metadata). [#4782](https://github.com/ClickHouse/ClickHouse/pull/4782) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Fix crash in `ALTER ... MODIFY ORDER BY` on `Distributed` table. [#4790](https://github.com/ClickHouse/ClickHouse/pull/4790) ([TCeason](https://github.com/TCeason)) - Fix segfault in `JOIN ON` with enabled `enable_optimize_predicate_expression`. [#4794](https://github.com/ClickHouse/ClickHouse/pull/4794) ([Winter Zhang](https://github.com/zhang2014)) - Fix bug with adding an extraneous row after consuming a protobuf message from Kafka. [#4808](https://github.com/ClickHouse/ClickHouse/pull/4808) ([Vitaly Baranov](https://github.com/vitlibar)) @@ -1608,7 +1608,7 @@ description: 'Changelog for 2019' #### Bug Fixes {#bug-fixes-7} - Avoid `std::terminate` in case of memory allocation failure. Now `std::bad_alloc` exception is thrown as expected. [#4665](https://github.com/ClickHouse/ClickHouse/pull/4665) ([alexey-milovidov](https://github.com/alexey-milovidov)) -- Fixes `capnproto` reading from buffer. Sometimes files wasn’t loaded successfully by HTTP. [#4674](https://github.com/ClickHouse/ClickHouse/pull/4674) ([Vladislav](https://github.com/smirnov-vs)) +- Fixes `capnproto` reading from buffer. Sometimes files wasn't loaded successfully by HTTP. [#4674](https://github.com/ClickHouse/ClickHouse/pull/4674) ([Vladislav](https://github.com/smirnov-vs)) - Fix error `Unknown log entry type: 0` after `OPTIMIZE TABLE FINAL` query. [#4683](https://github.com/ClickHouse/ClickHouse/pull/4683) ([Amos Bird](https://github.com/amosbird)) - Wrong arguments to `hasAny` or `hasAll` functions may lead to segfault. [#4698](https://github.com/ClickHouse/ClickHouse/pull/4698) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Deadlock may happen while executing `DROP DATABASE dictionary` query. [#4701](https://github.com/ClickHouse/ClickHouse/pull/4701) ([alexey-milovidov](https://github.com/alexey-milovidov)) @@ -1620,7 +1620,7 @@ description: 'Changelog for 2019' - Fixed TSan report on shutdown due to race condition in system logs usage. Fixed potential use-after-free on shutdown when part_log is enabled. [#4758](https://github.com/ClickHouse/ClickHouse/pull/4758) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Fix recheck parts in `ReplicatedMergeTreeAlterThread` in case of error. [#4772](https://github.com/ClickHouse/ClickHouse/pull/4772) ([Nikolai Kochetov](https://github.com/KochetovNicolai)) - Arithmetic operations on intermediate aggregate function states were not working for constant arguments (such as subquery results). [#4776](https://github.com/ClickHouse/ClickHouse/pull/4776) ([alexey-milovidov](https://github.com/alexey-milovidov)) -- Always backquote column names in metadata. Otherwise it’s impossible to create a table with column named `index` (server won’t restart due to malformed `ATTACH` query in metadata). [#4782](https://github.com/ClickHouse/ClickHouse/pull/4782) ([alexey-milovidov](https://github.com/alexey-milovidov)) +- Always backquote column names in metadata. Otherwise it's impossible to create a table with column named `index` (server won't restart due to malformed `ATTACH` query in metadata). [#4782](https://github.com/ClickHouse/ClickHouse/pull/4782) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Fix crash in `ALTER ... MODIFY ORDER BY` on `Distributed` table. [#4790](https://github.com/ClickHouse/ClickHouse/pull/4790) ([TCeason](https://github.com/TCeason)) - Fix segfault in `JOIN ON` with enabled `enable_optimize_predicate_expression`. [#4794](https://github.com/ClickHouse/ClickHouse/pull/4794) ([Winter Zhang](https://github.com/zhang2014)) - Fix bug with adding an extraneous row after consuming a protobuf message from Kafka. [#4808](https://github.com/ClickHouse/ClickHouse/pull/4808) ([Vitaly Baranov](https://github.com/vitlibar)) @@ -1678,7 +1678,7 @@ description: 'Changelog for 2019' - Combine rules for graphite rollup from dedicated aggregation and retention patterns. [#4426](https://github.com/ClickHouse/ClickHouse/pull/4426) ([Mikhail f. Shiryaev](https://github.com/Felixoid)) - Added `max_execution_speed` and `max_execution_speed_bytes` to limit resource usage. Added `min_execution_speed_bytes` setting to complement the `min_execution_speed`. [#4430](https://github.com/ClickHouse/ClickHouse/pull/4430) ([Winter Zhang](https://github.com/zhang2014)) - Implemented function `flatten`. [#4555](https://github.com/ClickHouse/ClickHouse/pull/4555) [#4409](https://github.com/ClickHouse/ClickHouse/pull/4409) ([alexey-milovidov](https://github.com/alexey-milovidov), [kzon](https://github.com/kzon)) -- Added functions `arrayEnumerateDenseRanked` and `arrayEnumerateUniqRanked` (it’s like `arrayEnumerateUniq` but allows to fine tune array depth to look inside multidimensional arrays). [#4475](https://github.com/ClickHouse/ClickHouse/pull/4475) ([proller](https://github.com/proller)) [#4601](https://github.com/ClickHouse/ClickHouse/pull/4601) ([alexey-milovidov](https://github.com/alexey-milovidov)) +- Added functions `arrayEnumerateDenseRanked` and `arrayEnumerateUniqRanked` (it's like `arrayEnumerateUniq` but allows to fine tune array depth to look inside multidimensional arrays). [#4475](https://github.com/ClickHouse/ClickHouse/pull/4475) ([proller](https://github.com/proller)) [#4601](https://github.com/ClickHouse/ClickHouse/pull/4601) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Multiple JOINS with some restrictions: no asterisks, no complex aliases in ON/WHERE/GROUP BY/… [#4462](https://github.com/ClickHouse/ClickHouse/pull/4462) ([Artem Zuikov](https://github.com/4ertus2)) #### Bug Fixes {#bug-fixes-11} @@ -1713,7 +1713,7 @@ description: 'Changelog for 2019' #### Performance Improvements {#performance-improvements-3} - Improved heuristics of "move to PREWHERE" optimization. [#4405](https://github.com/ClickHouse/ClickHouse/pull/4405) ([alexey-milovidov](https://github.com/alexey-milovidov)) -- Use proper lookup tables that uses HashTable’s API for 8-bit and 16-bit keys. [#4536](https://github.com/ClickHouse/ClickHouse/pull/4536) ([Amos Bird](https://github.com/amosbird)) +- Use proper lookup tables that uses HashTable's API for 8-bit and 16-bit keys. [#4536](https://github.com/ClickHouse/ClickHouse/pull/4536) ([Amos Bird](https://github.com/amosbird)) - Improved performance of string comparison. [#4564](https://github.com/ClickHouse/ClickHouse/pull/4564) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Cleanup distributed DDL queue in a separate thread so that it does not slow down the main loop that processes distributed DDL tasks. [#4502](https://github.com/ClickHouse/ClickHouse/pull/4502) ([Alex Zatelepin](https://github.com/ztlpn)) - When `min_bytes_to_use_direct_io` is set to 1, not every file was opened with O_DIRECT mode because the data size to read was sometimes underestimated by the size of one compressed block. [#4526](https://github.com/ClickHouse/ClickHouse/pull/4526) ([alexey-milovidov](https://github.com/alexey-milovidov)) @@ -1756,7 +1756,7 @@ description: 'Changelog for 2019' #### Bug Fixes {#bug-fixes-14} - When there are more than 1000 threads in a thread pool, `std::terminate` may happen on thread exit. [Azat Khuzhin](https://github.com/azat) [#4485](https://github.com/ClickHouse/ClickHouse/pull/4485) [#4505](https://github.com/ClickHouse/ClickHouse/pull/4505) ([alexey-milovidov](https://github.com/alexey-milovidov)) -- Now it’s possible to create `ReplicatedMergeTree*` tables with comments on columns without defaults and tables with columns codecs without comments and defaults. Also fix comparison of codecs. [#4523](https://github.com/ClickHouse/ClickHouse/pull/4523) ([alesapin](https://github.com/alesapin)) +- Now it's possible to create `ReplicatedMergeTree*` tables with comments on columns without defaults and tables with columns codecs without comments and defaults. Also fix comparison of codecs. [#4523](https://github.com/ClickHouse/ClickHouse/pull/4523) ([alesapin](https://github.com/alesapin)) - Fixed crash on JOIN with array or tuple. [#4552](https://github.com/ClickHouse/ClickHouse/pull/4552) ([Artem Zuikov](https://github.com/4ertus2)) - Fixed crash in clickhouse-copier with the message `ThreadStatus not created`. [#4540](https://github.com/ClickHouse/ClickHouse/pull/4540) ([Artem Zuikov](https://github.com/4ertus2)) - Fixed hangup on server shutdown if distributed DDLs were used. [#4472](https://github.com/ClickHouse/ClickHouse/pull/4472) ([Alex Zatelepin](https://github.com/ztlpn)) @@ -1818,7 +1818,7 @@ description: 'Changelog for 2019' - Added `Protobuf` output format. [#4005](https://github.com/ClickHouse/ClickHouse/pull/4005) [#4158](https://github.com/ClickHouse/ClickHouse/pull/4158) ([Vitaly Baranov](https://github.com/vitlibar)) - Added brotli support for HTTP interface for data import (INSERTs). [#4235](https://github.com/ClickHouse/ClickHouse/pull/4235) ([Mikhail](https://github.com/fandyushin)) - Added hints while user make typo in function name or type in command line client. [#4239](https://github.com/ClickHouse/ClickHouse/pull/4239) ([Danila Kutenin](https://github.com/danlark1)) -- Added `Query-Id` to Server’s HTTP Response header. [#4231](https://github.com/ClickHouse/ClickHouse/pull/4231) ([Mikhail](https://github.com/fandyushin)) +- Added `Query-Id` to Server's HTTP Response header. [#4231](https://github.com/ClickHouse/ClickHouse/pull/4231) ([Mikhail](https://github.com/fandyushin)) #### Experimental Features {#experimental-features-2} @@ -1902,7 +1902,7 @@ description: 'Changelog for 2019' - Added `--help/-h` option to `clickhouse-server`. [#4233](https://github.com/ClickHouse/ClickHouse/pull/4233) ([Yuriy Baranov](https://github.com/yurriy)) - Added support for scalar subqueries with aggregate function state result. [#4348](https://github.com/ClickHouse/ClickHouse/pull/4348) ([Nikolai Kochetov](https://github.com/KochetovNicolai)) - Improved server shutdown time and ALTERs waiting time. [#4372](https://github.com/ClickHouse/ClickHouse/pull/4372) ([alexey-milovidov](https://github.com/alexey-milovidov)) -- Added info about the replicated_can_become_leader setting to system.replicas and add logging if the replica won’t try to become leader. [#4379](https://github.com/ClickHouse/ClickHouse/pull/4379) ([Alex Zatelepin](https://github.com/ztlpn)) +- Added info about the replicated_can_become_leader setting to system.replicas and add logging if the replica won't try to become leader. [#4379](https://github.com/ClickHouse/ClickHouse/pull/4379) ([Alex Zatelepin](https://github.com/ztlpn)) ## ClickHouse Release 19.1 {#clickhouse-release-19-1} @@ -1992,7 +1992,7 @@ This release contains exactly the same set of patches as 19.3.6. - Make `compiled_expression_cache_size` setting limited by default to lower memory consumption. [#4041](https://github.com/ClickHouse/ClickHouse/pull/4041) ([alesapin](https://github.com/alesapin)) - Fix a bug that led to hangups in threads that perform ALTERs of Replicated tables and in the thread that updates configuration from ZooKeeper. [#2947](https://github.com/ClickHouse/ClickHouse/issues/2947) [#3891](https://github.com/ClickHouse/ClickHouse/issues/3891) [#3934](https://github.com/ClickHouse/ClickHouse/pull/3934) ([Alex Zatelepin](https://github.com/ztlpn)) - Fixed a race condition when executing a distributed ALTER task. The race condition led to more than one replica trying to execute the task and all replicas except one failing with a ZooKeeper error. [#3904](https://github.com/ClickHouse/ClickHouse/pull/3904) ([Alex Zatelepin](https://github.com/ztlpn)) -- Fix a bug when `from_zk` config elements weren’t refreshed after a request to ZooKeeper timed out. [#2947](https://github.com/ClickHouse/ClickHouse/issues/2947) [#3947](https://github.com/ClickHouse/ClickHouse/pull/3947) ([Alex Zatelepin](https://github.com/ztlpn)) +- Fix a bug when `from_zk` config elements weren't refreshed after a request to ZooKeeper timed out. [#2947](https://github.com/ClickHouse/ClickHouse/issues/2947) [#3947](https://github.com/ClickHouse/ClickHouse/pull/3947) ([Alex Zatelepin](https://github.com/ztlpn)) - Fix bug with wrong prefix for IPv4 subnet masks. [#3945](https://github.com/ClickHouse/ClickHouse/pull/3945) ([alesapin](https://github.com/alesapin)) - Fixed crash (`std::terminate`) in rare cases when a new thread cannot be created due to exhausted resources. [#3956](https://github.com/ClickHouse/ClickHouse/pull/3956) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Fix bug when in `remote` table function execution when wrong restrictions were used for in `getStructureOfRemoteTable`. [#4009](https://github.com/ClickHouse/ClickHouse/pull/4009) ([alesapin](https://github.com/alesapin)) @@ -2004,7 +2004,7 @@ This release contains exactly the same set of patches as 19.3.6. - Fix UB in StorageMerge. [#3910](https://github.com/ClickHouse/ClickHouse/pull/3910) ([Amos Bird](https://github.com/amosbird)) - Fixed segfault in functions `addDays`, `subtractDays`. [#3913](https://github.com/ClickHouse/ClickHouse/pull/3913) ([alexey-milovidov](https://github.com/alexey-milovidov)) - Fixed error: functions `round`, `floor`, `trunc`, `ceil` may return bogus result when executed on integer argument and large negative scale. [#3914](https://github.com/ClickHouse/ClickHouse/pull/3914) ([alexey-milovidov](https://github.com/alexey-milovidov)) -- Fixed a bug induced by ‘kill query sync’ which leads to a core dump. [#3916](https://github.com/ClickHouse/ClickHouse/pull/3916) ([muVulDeePecker](https://github.com/fancyqlx)) +- Fixed a bug induced by 'kill query sync' which leads to a core dump. [#3916](https://github.com/ClickHouse/ClickHouse/pull/3916) ([muVulDeePecker](https://github.com/fancyqlx)) - Fix bug with long delay after empty replication queue. [#3928](https://github.com/ClickHouse/ClickHouse/pull/3928) [#3932](https://github.com/ClickHouse/ClickHouse/pull/3932) ([alesapin](https://github.com/alesapin)) - Fixed excessive memory usage in case of inserting into table with `LowCardinality` primary key. [#3955](https://github.com/ClickHouse/ClickHouse/pull/3955) ([KochetovNicolai](https://github.com/KochetovNicolai)) - Fixed `LowCardinality` serialization for `Native` format in case of empty arrays. [#3907](https://github.com/ClickHouse/ClickHouse/issues/3907) [#4011](https://github.com/ClickHouse/ClickHouse/pull/4011) ([KochetovNicolai](https://github.com/KochetovNicolai)) @@ -2021,7 +2021,7 @@ This release contains exactly the same set of patches as 19.3.6. - Support for `IF NOT EXISTS` in `ALTER TABLE ADD COLUMN` statements along with `IF EXISTS` in `DROP/MODIFY/CLEAR/COMMENT COLUMN`. [#3900](https://github.com/ClickHouse/ClickHouse/pull/3900) ([Boris Granveaud](https://github.com/bgranvea)) - Function `parseDateTimeBestEffort`: support for formats `DD.MM.YYYY`, `DD.MM.YY`, `DD-MM-YYYY`, `DD-Mon-YYYY`, `DD/Month/YYYY` and similar. [#3922](https://github.com/ClickHouse/ClickHouse/pull/3922) ([alexey-milovidov](https://github.com/alexey-milovidov)) - `CapnProtoInputStream` now support jagged structures. [#4063](https://github.com/ClickHouse/ClickHouse/pull/4063) ([Odin Hultgren Van Der Horst](https://github.com/Miniwoffer)) -- Usability improvement: added a check that server process is started from the data directory’s owner. Do not allow to start server from root if the data belongs to non-root user. [#3785](https://github.com/ClickHouse/ClickHouse/pull/3785) ([sergey-v-galtsev](https://github.com/sergey-v-galtsev)) +- Usability improvement: added a check that server process is started from the data directory's owner. Do not allow to start server from root if the data belongs to non-root user. [#3785](https://github.com/ClickHouse/ClickHouse/pull/3785) ([sergey-v-galtsev](https://github.com/sergey-v-galtsev)) - Better logic of checking required columns during analysis of queries with JOINs. [#3930](https://github.com/ClickHouse/ClickHouse/pull/3930) ([Artem Zuikov](https://github.com/4ertus2)) - Decreased the number of connections in case of large number of Distributed tables in a single server. [#3726](https://github.com/ClickHouse/ClickHouse/pull/3726) ([Winter Zhang](https://github.com/zhang2014)) - Supported totals row for `WITH TOTALS` query for ODBC driver. [#3836](https://github.com/ClickHouse/ClickHouse/pull/3836) ([Maksim Koritckiy](https://github.com/nightweb)) @@ -2036,7 +2036,7 @@ This release contains exactly the same set of patches as 19.3.6. - Add a MergeTree setting `use_minimalistic_part_header_in_zookeeper`. If enabled, Replicated tables will store compact part metadata in a single part znode. This can dramatically reduce ZooKeeper snapshot size (especially if the tables have a lot of columns). Note that after enabling this setting you will not be able to downgrade to a version that does not support it. [#3960](https://github.com/ClickHouse/ClickHouse/pull/3960) ([Alex Zatelepin](https://github.com/ztlpn)) - Add an DFA-based implementation for functions `sequenceMatch` and `sequenceCount` in case pattern does not contain time. [#4004](https://github.com/ClickHouse/ClickHouse/pull/4004) ([Léo Ercolanelli](https://github.com/ercolanelli-leo)) - Performance improvement for integer numbers serialization. [#3968](https://github.com/ClickHouse/ClickHouse/pull/3968) ([Amos Bird](https://github.com/amosbird)) -- Zero left padding PODArray so that -1 element is always valid and zeroed. It’s used for branchless calculation of offsets. [#3920](https://github.com/ClickHouse/ClickHouse/pull/3920) ([Amos Bird](https://github.com/amosbird)) +- Zero left padding PODArray so that -1 element is always valid and zeroed. It's used for branchless calculation of offsets. [#3920](https://github.com/ClickHouse/ClickHouse/pull/3920) ([Amos Bird](https://github.com/amosbird)) - Reverted `jemalloc` version which lead to performance degradation. [#4018](https://github.com/ClickHouse/ClickHouse/pull/4018) ([alexey-milovidov](https://github.com/alexey-milovidov)) #### Backward Incompatible Changes {#backward-incompatible-changes-2} diff --git a/docs/whats-new/changelog/2022.md b/docs/whats-new/changelog/2022.md index fb16ec2eef5..1b9fc200cc1 100644 --- a/docs/whats-new/changelog/2022.md +++ b/docs/whats-new/changelog/2022.md @@ -216,7 +216,7 @@ Refer to this issue on GitHub for more details: https://github.com/ClickHouse/Cl * Additional bound check was added to LZ4 decompression routine to fix misbehaviour in case of malformed input. [#42868](https://github.com/ClickHouse/ClickHouse/pull/42868) ([Nikita Taranov](https://github.com/nickitat)). * Fix rare possible hang on query cancellation. [#42874](https://github.com/ClickHouse/ClickHouse/pull/42874) ([Azat Khuzhin](https://github.com/azat)). * Fix incorrect behavior with multiple disjuncts in hash join, close [#42832](https://github.com/ClickHouse/ClickHouse/issues/42832). [#42876](https://github.com/ClickHouse/ClickHouse/pull/42876) ([Vladimir C](https://github.com/vdimir)). -* A null pointer will be generated when select if as from ‘three table join’ , For example, this SQL query: [#42883](https://github.com/ClickHouse/ClickHouse/pull/42883) ([zzsmdfj](https://github.com/zzsmdfj)). +* A null pointer will be generated when select if as from 'three table join' , For example, this SQL query: [#42883](https://github.com/ClickHouse/ClickHouse/pull/42883) ([zzsmdfj](https://github.com/zzsmdfj)). * Fix memory sanitizer report in Cluster Discovery, close [#42763](https://github.com/ClickHouse/ClickHouse/issues/42763). [#42905](https://github.com/ClickHouse/ClickHouse/pull/42905) ([Vladimir C](https://github.com/vdimir)). * Improve DateTime schema inference in case of empty string. [#42911](https://github.com/ClickHouse/ClickHouse/pull/42911) ([Kruglov Pavel](https://github.com/Avogar)). * Fix rare NOT_FOUND_COLUMN_IN_BLOCK error when projection is possible to use but there is no projection available. This fixes [#42771](https://github.com/ClickHouse/ClickHouse/issues/42771) . The bug was introduced in https://github.com/ClickHouse/ClickHouse/pull/25563. [#42938](https://github.com/ClickHouse/ClickHouse/pull/42938) ([Amos Bird](https://github.com/amosbird)). @@ -413,7 +413,7 @@ Refer to this issue on GitHub for more details: https://github.com/ClickHouse/Cl #### Improvement {#improvement-3} * During startup and ATTACH call, `ReplicatedMergeTree` tables will be readonly until the ZooKeeper connection is made and the setup is finished. [#40148](https://github.com/ClickHouse/ClickHouse/pull/40148) ([Antonio Andelic](https://github.com/antonio2368)). -* Add `enable_extended_results_for_datetime_functions` option to return results of type Date32 for functions toStartOfYear, toStartOfISOYear, toStartOfQuarter, toStartOfMonth, toStartOfWeek, toMonday and toLastDayOfMonth when argument is Date32 or DateTime64, otherwise results of Date type are returned. For compatibility reasons default value is ‘0’. [#41214](https://github.com/ClickHouse/ClickHouse/pull/41214) ([Roman Vasin](https://github.com/rvasin)). +* Add `enable_extended_results_for_datetime_functions` option to return results of type Date32 for functions toStartOfYear, toStartOfISOYear, toStartOfQuarter, toStartOfMonth, toStartOfWeek, toMonday and toLastDayOfMonth when argument is Date32 or DateTime64, otherwise results of Date type are returned. For compatibility reasons default value is '0'. [#41214](https://github.com/ClickHouse/ClickHouse/pull/41214) ([Roman Vasin](https://github.com/rvasin)). * For security and stability reasons, CatBoost models are no longer evaluated within the ClickHouse server. Instead, the evaluation is now done in the clickhouse-library-bridge, a separate process that loads the catboost library and communicates with the server process via HTTP. Function `modelEvaluate()` was replaced by `catboostEvaluate()`. [#40897](https://github.com/ClickHouse/ClickHouse/pull/40897) ([Robert Schulze](https://github.com/rschu1ze)). [#39629](https://github.com/ClickHouse/ClickHouse/pull/39629) ([Robert Schulze](https://github.com/rschu1ze)). * Add more metrics for on-disk temporary data, close [#40206](https://github.com/ClickHouse/ClickHouse/issues/40206). [#40239](https://github.com/ClickHouse/ClickHouse/pull/40239) ([Vladimir C](https://github.com/vdimir)). * Add config option `warning_supress_regexp`, close [#40330](https://github.com/ClickHouse/ClickHouse/issues/40330). [#40548](https://github.com/ClickHouse/ClickHouse/pull/40548) ([Vladimir C](https://github.com/vdimir)). diff --git a/docs/whats-new/changelog/2024.md b/docs/whats-new/changelog/2024.md index a5cfb7edebd..d74962ebe45 100644 --- a/docs/whats-new/changelog/2024.md +++ b/docs/whats-new/changelog/2024.md @@ -1140,7 +1140,7 @@ description: 'Changelog for 2024' #### Improvement {#improvement-7} * Allow using `clickhouse-local` and its shortcuts `clickhouse` and `ch` with a query or queries file as a positional argument. Examples: `ch "SELECT 1"`, `ch --param_test Hello "SELECT {test:String}"`, `ch query.sql`. This closes [#62361](https://github.com/ClickHouse/ClickHouse/issues/62361). [#63081](https://github.com/ClickHouse/ClickHouse/pull/63081) ([Alexey Milovidov](https://github.com/alexey-milovidov)). * Enable plain_rewritable metadata for local and Azure (azure_blob_storage) object storages. [#63365](https://github.com/ClickHouse/ClickHouse/pull/63365) ([Julia Kartseva](https://github.com/jkartseva)). -* Support English-style Unicode quotes, e.g. “Hello”, ‘world’. This is questionable in general but helpful when you type your query in a word processor, such as Google Docs. This closes [#58634](https://github.com/ClickHouse/ClickHouse/issues/58634). [#63381](https://github.com/ClickHouse/ClickHouse/pull/63381) ([Alexey Milovidov](https://github.com/alexey-milovidov)). +* Support English-style Unicode quotes, e.g. “Hello”, 'world'. This is questionable in general but helpful when you type your query in a word processor, such as Google Docs. This closes [#58634](https://github.com/ClickHouse/ClickHouse/issues/58634). [#63381](https://github.com/ClickHouse/ClickHouse/pull/63381) ([Alexey Milovidov](https://github.com/alexey-milovidov)). * Allow trailing commas in the columns list in the INSERT query. For example, `INSERT INTO test (a, b, c, ) VALUES ...`. [#63803](https://github.com/ClickHouse/ClickHouse/pull/63803) ([Alexey Milovidov](https://github.com/alexey-milovidov)). * Better exception messages for the `Regexp` format. [#63804](https://github.com/ClickHouse/ClickHouse/pull/63804) ([Alexey Milovidov](https://github.com/alexey-milovidov)). * Allow trailing commas in the `Values` format. For example, this query is allowed: `INSERT INTO test (a, b, c) VALUES (4, 5, 6,);`. [#63810](https://github.com/ClickHouse/ClickHouse/pull/63810) ([Alexey Milovidov](https://github.com/alexey-milovidov)). diff --git a/docs/whats-new/changelog/index.md b/docs/whats-new/changelog/index.md index b90f6c92f24..e887d4731da 100644 --- a/docs/whats-new/changelog/index.md +++ b/docs/whats-new/changelog/index.md @@ -204,7 +204,7 @@ title: '2025 Changelog' * Don't fail silently if a user executing `SYSTEM DROP REPLICA` doesn't have enough permissions. [#75377](https://github.com/ClickHouse/ClickHouse/pull/75377) ([Bharat Nallan](https://github.com/bharatnc)). * Add a ProfileEvent about the number of times any of the system logs have failed to flush. [#75466](https://github.com/ClickHouse/ClickHouse/pull/75466) ([Alexey Milovidov](https://github.com/alexey-milovidov)). * Add a check and extra logging for decrypting and decompressing. [#75471](https://github.com/ClickHouse/ClickHouse/pull/75471) ([Vitaly Baranov](https://github.com/vitlibar)). -* Added support for the micro sign (U+00B5) in the `parseTimeDelta` function. Now both the micro sign (U+00B5) and the Greek letter mu (U+03BC) are recognized as valid representations for microseconds, aligning ClickHouse's behavior with Go’s implementation ([see time.go](https://github.com/golang/go/blob/ad7b46ee4ac1cee5095d64b01e8cf7fcda8bee5e/src/time/time.go#L983C19-L983C20) and [time/format.go](https://github.com/golang/go/blob/ad7b46ee4ac1cee5095d64b01e8cf7fcda8bee5e/src/time/format.go#L1608-L1609)). [#75472](https://github.com/ClickHouse/ClickHouse/pull/75472) ([Vitaly Orlov](https://github.com/orloffv)). +* Added support for the micro sign (U+00B5) in the `parseTimeDelta` function. Now both the micro sign (U+00B5) and the Greek letter mu (U+03BC) are recognized as valid representations for microseconds, aligning ClickHouse's behavior with Go's implementation ([see time.go](https://github.com/golang/go/blob/ad7b46ee4ac1cee5095d64b01e8cf7fcda8bee5e/src/time/time.go#L983C19-L983C20) and [time/format.go](https://github.com/golang/go/blob/ad7b46ee4ac1cee5095d64b01e8cf7fcda8bee5e/src/time/format.go#L1608-L1609)). [#75472](https://github.com/ClickHouse/ClickHouse/pull/75472) ([Vitaly Orlov](https://github.com/orloffv)). * Replace server setting (`send_settings_to_client`) with client setting (`apply_settings_from_server`) that controls whether client-side code (e.g., parsing INSERT data and formatting query output) should use settings from server's `users.xml` and user profile. Otherwise, only settings from the client command line, session, and query are used. Note that this only applies to native client (not e.g. HTTP), and doesn't apply to most of query processing (which happens on the server). [#75478](https://github.com/ClickHouse/ClickHouse/pull/75478) ([Michael Kolupaev](https://github.com/al13n321)). * Better error messages for syntax errors. Previously, if the query was too large, and the token whose length exceeds the limit is a very large string literal, the message about the reason was lost in the middle of two examples of this very long token. Fix the issue when a query with UTF-8 was cut incorrectly in the error message. Fix excessive quoting of query fragments. This closes [#75473](https://github.com/ClickHouse/ClickHouse/issues/75473). [#75561](https://github.com/ClickHouse/ClickHouse/pull/75561) ([Alexey Milovidov](https://github.com/alexey-milovidov)). * Add profile events in storage `S3(Azure)Queue`. [#75618](https://github.com/ClickHouse/ClickHouse/pull/75618) ([Kseniia Sumarokova](https://github.com/kssenii)). diff --git a/docs/whats-new/security-changelog.md b/docs/whats-new/security-changelog.md index 24a50f24bdb..1c79dca658c 100644 --- a/docs/whats-new/security-changelog.md +++ b/docs/whats-new/security-changelog.md @@ -98,13 +98,13 @@ Credits: Kiojj (independent researcher) ### CVE-2021-43304 {#cve-2021-43304} -Heap buffer overflow in ClickHouse's LZ4 compression codec when parsing a malicious query. There is no verification that the copy operations in the LZ4::decompressImpl loop and especially the arbitrary copy operation `wildCopy(op, ip, copy_end)`, don’t exceed the destination buffer’s limits. +Heap buffer overflow in ClickHouse's LZ4 compression codec when parsing a malicious query. There is no verification that the copy operations in the LZ4::decompressImpl loop and especially the arbitrary copy operation `wildCopy(op, ip, copy_end)`, don't exceed the destination buffer's limits. Credits: JFrog Security Research Team ### CVE-2021-43305 {#cve-2021-43305} -Heap buffer overflow in ClickHouse's LZ4 compression codec when parsing a malicious query. There is no verification that the copy operations in the LZ4::decompressImpl loop and especially the arbitrary copy operation `wildCopy(op, ip, copy_end)`, don’t exceed the destination buffer’s limits. This issue is very similar to CVE-2021-43304, but the vulnerable copy operation is in a different wildCopy call. +Heap buffer overflow in ClickHouse's LZ4 compression codec when parsing a malicious query. There is no verification that the copy operations in the LZ4::decompressImpl loop and especially the arbitrary copy operation `wildCopy(op, ip, copy_end)`, don't exceed the destination buffer's limits. This issue is very similar to CVE-2021-43304, but the vulnerable copy operation is in a different wildCopy call. Credits: JFrog Security Research Team @@ -214,5 +214,5 @@ Credits: Andrey Krasichkov and Evgeny Sidorov of Yandex Information Security Tea Incorrect configuration in deb package could lead to the unauthorized use of the database. -Credits: the UK’s National Cyber Security Centre (NCSC) +Credits: the UK's National Cyber Security Centre (NCSC) diff --git a/scripts/aspell-dict-file.txt b/scripts/aspell-dict-file.txt index bf6e477335b..a5a20abb45a 100644 --- a/scripts/aspell-dict-file.txt +++ b/scripts/aspell-dict-file.txt @@ -946,4 +946,14 @@ kinesis GWLBs NLBs --docs/use-cases/data_lake/glue_catalog.md-- -Databricks \ No newline at end of file +Databricks +--docs/best-practices/_snippets/_avoid_mutations.md-- +unmutated +--docs/best-practices/selecting_an_insert_strategy.md-- +FastFormats +--docs/best-practices/using_data_skipping_indices.md-- +Probabilistically +--docs/best-practices/minimize_optimize_joins.md-- +tunable +--docs/best-practices/use_materialized_views.md-- +DAGs diff --git a/sidebars.js b/sidebars.js index fbcc6024e80..82e35928cf5 100644 --- a/sidebars.js +++ b/sidebars.js @@ -61,6 +61,25 @@ const sidebars = { "guides/developer/mutations", ], }, + { + type: "category", + label: "Best Practices", + collapsed: false, + collapsible: false, + link: { type: "doc", id: "best-practices/index" }, + items: [ + "best-practices/choosing_a_primary_key", + "best-practices/select_data_type", + "best-practices/use_materialized_views", + "best-practices/minimize_optimize_joins", + "best-practices/partionning_keys", + "best-practices/selecting_an_insert_strategy", + "best-practices/using_data_skipping_indices", + "best-practices/avoid_mutations", + "best-practices/avoid_optimize_final", + "best-practices/json_type" + ], + }, { type: "category", label: "Use Case Guides", @@ -230,14 +249,8 @@ const sidebars = { className: "top-nav-item", link: { type: "doc", id: "cloud/bestpractices/index" }, items: [ - "cloud/bestpractices/bulkinserts", - "cloud/bestpractices/asyncinserts", - "cloud/bestpractices/avoidmutations", - "cloud/bestpractices/avoidnullablecolumns", - "cloud/bestpractices/avoidoptimizefinal", - "cloud/bestpractices/partitioningkey", "cloud/bestpractices/usagelimits", - "cloud/bestpractices/multitenancy" + "cloud/bestpractices/multitenancy", ], }, { @@ -1497,6 +1510,12 @@ const sidebars = { description: "An introduction to ClickHouse", href: "/intro" }, + { + type: "link", + label: "Concepts", + description: "Core concepts to know", + href: "/concepts" + }, { type: "link", label: "Starter Guides", @@ -1505,9 +1524,9 @@ const sidebars = { }, { type: "link", - label: "Concepts", - description: "Core concepts to know", - href: "/concepts" + label: "Best Practices", + description: "Follow best practices with ClickHouse", + href: "/best-practices" }, { type: "link", diff --git a/static/images/bestpractices/async_inserts.png b/static/images/bestpractices/async_inserts.png new file mode 100644 index 00000000000..0b4032adfb9 Binary files /dev/null and b/static/images/bestpractices/async_inserts.png differ diff --git a/static/images/bestpractices/building_skipping_indices.gif b/static/images/bestpractices/building_skipping_indices.gif new file mode 100644 index 00000000000..f64532d735f Binary files /dev/null and b/static/images/bestpractices/building_skipping_indices.gif differ diff --git a/static/images/bestpractices/create_primary_key.gif b/static/images/bestpractices/create_primary_key.gif new file mode 100644 index 00000000000..6258b7c093e Binary files /dev/null and b/static/images/bestpractices/create_primary_key.gif differ diff --git a/static/images/bestpractices/incremental_materialized_view.gif b/static/images/bestpractices/incremental_materialized_view.gif new file mode 100644 index 00000000000..5d77fefb6c2 Binary files /dev/null and b/static/images/bestpractices/incremental_materialized_view.gif differ diff --git a/static/images/bestpractices/insert_process.png b/static/images/bestpractices/insert_process.png new file mode 100644 index 00000000000..b01e5c449d4 Binary files /dev/null and b/static/images/bestpractices/insert_process.png differ diff --git a/static/images/bestpractices/joins-speed-memory.png b/static/images/bestpractices/joins-speed-memory.png new file mode 100644 index 00000000000..26bd293170c Binary files /dev/null and b/static/images/bestpractices/joins-speed-memory.png differ diff --git a/static/images/bestpractices/materialized-view-diagram.png b/static/images/bestpractices/materialized-view-diagram.png new file mode 100644 index 00000000000..4836b0d8ec5 Binary files /dev/null and b/static/images/bestpractices/materialized-view-diagram.png differ diff --git a/static/images/bestpractices/merges_with_partitions.png b/static/images/bestpractices/merges_with_partitions.png new file mode 100644 index 00000000000..8b028444b18 Binary files /dev/null and b/static/images/bestpractices/merges_with_partitions.png differ diff --git a/static/images/bestpractices/partitions.png b/static/images/bestpractices/partitions.png new file mode 100644 index 00000000000..1e28431d803 Binary files /dev/null and b/static/images/bestpractices/partitions.png differ diff --git a/static/images/bestpractices/primary_key.gif b/static/images/bestpractices/primary_key.gif new file mode 100644 index 00000000000..000cbe7d6b6 Binary files /dev/null and b/static/images/bestpractices/primary_key.gif differ diff --git a/static/images/bestpractices/refreshable-materialized-view-diagram.png b/static/images/bestpractices/refreshable-materialized-view-diagram.png new file mode 100644 index 00000000000..f64760e9b94 Binary files /dev/null and b/static/images/bestpractices/refreshable-materialized-view-diagram.png differ diff --git a/static/images/bestpractices/refreshable_materialized_view.gif b/static/images/bestpractices/refreshable_materialized_view.gif new file mode 100644 index 00000000000..384f3b74076 Binary files /dev/null and b/static/images/bestpractices/refreshable_materialized_view.gif differ diff --git a/static/images/bestpractices/simple_merges.png b/static/images/bestpractices/simple_merges.png new file mode 100644 index 00000000000..6ece30bf63b Binary files /dev/null and b/static/images/bestpractices/simple_merges.png differ diff --git a/static/images/bestpractices/using_skipping_indices.gif b/static/images/bestpractices/using_skipping_indices.gif new file mode 100644 index 00000000000..17e599bc81c Binary files /dev/null and b/static/images/bestpractices/using_skipping_indices.gif differ diff --git a/vercel.json b/vercel.json index dd3b9790383..87457604d60 100644 --- a/vercel.json +++ b/vercel.json @@ -3241,6 +3241,36 @@ "destination": "/docs/cloud/reference/warehouses#what-is-compute-compute-separation", "permanent": true }, + { + "source": "/docs/cloud/bestpractices/bulk-inserts", + "destination": "/docs/best-practices/selecting-an-insert-strategy", + "permanent": true + }, + { + "source": "/docs/cloud/bestpractices/asynchronous-inserts", + "destination": "/docs/best-practices/use-materialized-views", + "permanent": true + }, + { + "source": "/docs/cloud/bestpractices/avoid-mutations", + "destination": "/docs/best-practices/avoid-mutations", + "permanent": true + }, + { + "source": "/docs/cloud/bestpractices/avoid-nullable-columns", + "destination": "/docs/best-practices/select-data-types", + "permanent": true + }, + { + "source": "/docs/cloud/bestpractices/avoid-optimize-final", + "destination": "/docs/best-practices/avoid-optimize-final", + "permanent": true + }, + { + "source": "/docs/cloud/bestpractices/low-cardinality-partitioning-key", + "destination": "/docs/best-practices/choosing-a-partitioning-key", + "permanent": true + }, { "source": "/docs/en/:path*", "destination": "/docs/:path*"