forked from ClickHouse/ClickHouse
-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Labels
Description
Describe the bug
The following query uses Iceberg metadata to find counts quickly:
SELECT count()FROM ice.`default.sorting`
The following query does not use Iceberg metadata to find counts, even though the group by key is the partition key. The query engine planner does not recognize that each partition contains only one value of created_at.
select created_at, count() from sorting group by created_at order by created_at
Counts on groups are therefore exceedingly slow, often hundreds of times slower than MergeTree.
To Reproduce
- Create a test table in MergeTree with relatively large parts. Dump the data to S3.
DROP TABLE IF EXISTS sorting;
CREATE TABLE sorting (
`id` UInt32,
`created_at` Date,
`timestamp` UInt32,
`c1` UInt32,
`c2` UInt32,
`c3` UInt32
)
ENGINE = MergeTree
PARTITION BY (created_at)
ORDER BY (id, c1, c2, c3)
;
TRUNCATE sorting
;
INSERT INTO sorting SELECT
(number % 50) AS id,
toDate(timestamp) AS created_at,
toUInt32(toDateTime('2025-01-01') + Interval (number / 50) SECOND) AS timestamp,
(number % 50) AS c1,
(number % 1000) AS c2,
(number % 250) AS c3
FROM system.numbers
LIMIT 100000000;
SELECT * FROM sorting ORDER BY timestamp,id LIMIT 100
INSERT INTO FUNCTION s3('s3://your-bucket/default/sorting/data/partition={_partition_id}.parquet', 'Parquet')
PARTITION BY (created_at)
SELECT * FROM default.sorting
ORDER BY id, c1, c2, c3
- Load the data into Iceberg using ice. Example command. Note the use of the --partition to set the partition spec.
ice insert -p default.sorting --no-copy \
's3://your-bucket/default/sorting/data/*.parquet' \
--partition='[{"column":"created_at", "transform": "day"}]'
- Run the following queries to count rows.
SELECT count()FROM ice.`default.sorting`;
SELECT created_at, count() FROM sorting GROUP BY created_at ORDER BY created_at
Expected behavior
The last two queries should complete almost immediately using Iceberg table metadata. Instead, the second query scans all rows in the table and is much slower.
Key information
Provide relevant runtime details.
- Project Antalya Build Version: 25.3.3.20186
- Cloud provider: AWS
- Kubernetes provider, EKS
- Object storage, e.g., AWS S3
- Iceberg catalog, e.g., Ice REST Catalog
Additional context
Add any other context about the problem here.