ClickHouse scans all Iceberg table rows for counts grouped by a partition key instead of using metadata

**Describe the bug**
The following query uses Iceberg metadata to find counts quickly:
```
SELECT count()FROM ice.`default.sorting`
```
The following query does not use Iceberg metadata to find counts, even though the group by key is the partition key. The query engine planner does not recognize that each partition contains only one value of created_at. 
```
select created_at, count() from sorting group by created_at order by created_at
```
Counts on groups are therefore exceedingly slow, often hundreds of times slower than MergeTree. 

**To Reproduce**
1. Create a test table in MergeTree with relatively large parts. Dump the data to S3. 
```
DROP TABLE IF EXISTS sorting;

CREATE TABLE sorting (
	`id` UInt32, 
    `created_at` Date,
    `timestamp` UInt32,
    `c1` UInt32,
    `c2` UInt32,
    `c3` UInt32
)
ENGINE = MergeTree
PARTITION BY (created_at)
ORDER BY (id, c1, c2, c3)
;

TRUNCATE sorting
;

INSERT INTO sorting SELECT
  (number % 50) AS id,
  toDate(timestamp) AS created_at,
  toUInt32(toDateTime('2025-01-01') + Interval (number / 50) SECOND) AS timestamp,
  (number % 50) AS c1,
  (number % 1000) AS c2,
  (number % 250) AS c3
FROM system.numbers
LIMIT 100000000;

SELECT * FROM sorting ORDER BY timestamp,id LIMIT 100

INSERT INTO FUNCTION s3('s3://your-bucket/default/sorting/data/partition={_partition_id}.parquet', 'Parquet')
PARTITION BY (created_at)
SELECT * FROM default.sorting
ORDER BY id, c1, c2, c3
```

2. Load the data into Iceberg using ice. Example command. Note the use of the --partition to set the partition spec. 
```
ice insert -p default.sorting --no-copy \
's3://your-bucket/default/sorting/data/*.parquet' \
--partition='[{"column":"created_at", "transform": "day"}]' 
```

3. Run the following queries to count rows. 

```
SELECT count()FROM ice.`default.sorting`;
SELECT created_at, count() FROM sorting GROUP BY created_at ORDER BY created_at
```
**Expected behavior**
The last two queries should complete almost immediately using Iceberg table metadata. Instead, the second query scans all rows in the table and is much slower. 

**Key information**
Provide relevant runtime details. 
 - Project Antalya Build Version: 25.3.3.20186
 - Cloud provider: AWS
 - Kubernetes provider, EKS
 - Object storage, e.g., AWS S3
 - Iceberg catalog, e.g., Ice REST Catalog

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ClickHouse scans all Iceberg table rows for counts grouped by a partition key instead of using metadata #1000

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ClickHouse scans all Iceberg table rows for counts grouped by a partition key instead of using metadata #1000

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions