Skip to content

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Oct 19, 2025

Description

Fixes #26983

Release notes

## Iceberg
* Collect NDV stats on all columns when replacing tables. ({issue}`26983`)

Summary by Sourcery

Enable collection of distinct value statistics on all columns when replacing Iceberg tables by extending the statistics collection API with a replace flag and updating the Iceberg connector to use it.

New Features:

  • Introduce a replace flag in the statistics collection SPI and propagate it through core, planner, tracing, and connector layers to distinguish replacement writes

Enhancements:

  • Deprecate the old getStatisticsCollectionMetadataForWrite method and add a new overload accepting the replace parameter

Tests:

  • Add a TestIcebergStatistics test to verify SHOW STATS FOR after CREATE OR REPLACE TABLE includes NDV stats for all columns

@cla-bot cla-bot bot added the cla-signed label Oct 19, 2025
Copy link

sourcery-ai bot commented Oct 19, 2025

Reviewer's Guide

Propagate a replace flag in statistics collection metadata APIs to enable collecting NDV stats on all columns during table replace operations, updating core SPI, metadata propagation, planners, tracing and wrapper layers, Iceberg connector implementation, and adding a test to validate SHOW STATS after REPLACE TABLE.

Sequence diagram for statistics collection during table replace operation

sequenceDiagram
    participant Planner
    participant MetadataManager
    participant ConnectorMetadata
    participant IcebergMetadata
    Planner->>MetadataManager: getStatisticsCollectionMetadataForWrite(..., replace=true)
    MetadataManager->>ConnectorMetadata: getStatisticsCollectionMetadataForWrite(..., replace=true)
    ConnectorMetadata->>IcebergMetadata: getStatisticsCollectionMetadataForWrite(..., replace=true)
    IcebergMetadata-->>ConnectorMetadata: TableStatisticsMetadata (NDV stats on all columns)
    ConnectorMetadata-->>MetadataManager: TableStatisticsMetadata
    MetadataManager-->>Planner: TableStatisticsMetadata
Loading

Class diagram for updated statistics collection metadata APIs

classDiagram
    class Metadata {
        +getStatisticsCollectionMetadataForWrite(Session session, CatalogHandle catalogHandle, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
    }
    class MetadataManager {
        +getStatisticsCollectionMetadataForWrite(Session session, CatalogHandle catalogHandle, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
    }
    class TracingMetadata {
        +getStatisticsCollectionMetadataForWrite(Session session, CatalogHandle catalogHandle, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
    }
    class ConnectorMetadata {
        +getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
        +getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata): TableStatisticsMetadata (deprecated)
    }
    class TracingConnectorMetadata {
        +getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
    }
    class ClassLoaderSafeConnectorMetadata {
        +getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
    }
    class IcebergMetadata {
        +getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
    }
    class LakehouseMetadata {
        +getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
    }
    MetadataManager --> Metadata
    TracingMetadata --> MetadataManager
    TracingConnectorMetadata --> ConnectorMetadata
    ClassLoaderSafeConnectorMetadata --> ConnectorMetadata
    IcebergMetadata --> ConnectorMetadata
    LakehouseMetadata --> ConnectorMetadata
Loading

File-Level Changes

Change Details Files
Introduce a boolean 'replace' parameter in getStatisticsCollectionMetadataForWrite SPI.
  • Add new default method overload in ConnectorMetadata with (session, tableMetadata, replace)
  • Deprecate the old single-arg getStatisticsCollectionMetadataForWrite method
core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorMetadata.java
Propagate the 'replace' flag through core metadata API and planner.
  • Update MetadataManager and Metadata interface to include replace parameter
  • Modify LogicalPlanner to pass create.isReplace() or false when requesting stats metadata
core/trino-main/src/main/java/io/trino/metadata/MetadataManager.java
core/trino-main/src/main/java/io/trino/sql/planner/LogicalPlanner.java
core/trino-main/src/main/java/io/trino/metadata/Metadata.java
Update tracing and classloader-safe metadata wrappers to forward the replace flag.
  • Add new overrides in TracingConnectorMetadata and TracingMetadata to accept replace
  • Update ClassLoaderSafeConnectorMetadata to delegate replace parameter
  • Extend LakehouseMetadata and AbstractMockMetadata to support replace flag
core/trino-main/src/main/java/io/trino/tracing/TracingConnectorMetadata.java
core/trino-main/src/main/java/io/trino/tracing/TracingMetadata.java
lib/trino-plugin-toolkit/src/main/java/io/trino/plugin/base/classloader/ClassLoaderSafeConnectorMetadata.java
plugin/trino-lakehouse/src/main/java/io/trino/plugin/lakehouse/LakehouseMetadata.java
core/trino-main/src/test/java/io/trino/metadata/AbstractMockMetadata.java
Enhance Iceberg connector to collect NDV stats on all columns when replacing tables.
  • Add branch in IcebergMetadata.getStatisticsCollectionMetadataForWrite to handle replace=true
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
Add integration test for SHOW STATS after CREATE OR REPLACE TABLE in Iceberg.
  • Introduce testShowStatsReplaceTable in TestIcebergStatistics verifying stats after replace
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergStatistics.java

Assessment against linked issues

Issue Objective Addressed Explanation
#26983 Ensure that extended table statistics, including NDV (number of distinct values), are collected and populated correctly when using CREATE OR REPLACE TABLE (CTAS) to replace an Iceberg table.

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@github-actions github-actions bot added iceberg Iceberg connector lakehouse labels Oct 19, 2025
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `core/trino-main/src/main/java/io/trino/tracing/TracingMetadata.java:633` </location>
<code_context>
     }

     @Override
-    public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(Session session, CatalogHandle catalogHandle, ConnectorTableMetadata tableMetadata)
+    public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(Session session, CatalogHandle catalogHandle, ConnectorTableMetadata tableMetadata, boolean replace)
     {
         CatalogMetadata catalogMetadata = getCatalogMetadataForWrite(session, catalogHandle);
</code_context>

<issue_to_address>
**suggestion:** Tracing wrapper passes 'replace' but does not log or trace its value.

If 'replace' impacts statistics collection, add its value to the tracing span to improve observability.
</issue_to_address>

### Comment 2
<location> `lib/trino-plugin-toolkit/src/main/java/io/trino/plugin/base/classloader/ClassLoaderSafeConnectorMetadata.java:158` </location>
<code_context>
     }

     @Override
-    public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata)
+    public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace)
     {
         Span span = startSpan("getStatisticsCollectionMetadataForWrite", tableMetadata.getTable());
</code_context>

<issue_to_address>
**issue:** ClassLoaderSafeConnectorMetadata now passes 'replace', but does not handle backward compatibility.

If the delegate lacks the updated method, runtime errors may occur. Please add a fallback to the deprecated method to maintain compatibility.
</issue_to_address>

### Comment 3
<location> `plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergStatistics.java:976-974` </location>
<code_context>
+    @Test
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding tests for error conditions and edge cases in table replacement.

Please add tests for cases like replacing tables with no columns, columns of varying types, duplicate column names, replacement failures, and when extended statistics are disabled.

Suggested implementation:

```java
    @Test
    public void testShowStatsReplaceTable()
    {
        try (TestTable table = newTrinoTable("show_stats_after_replace_table_", "AS SELECT 1 a, 2 b")) {
            assertThat(query("SHOW STATS FOR " + table.getName()))
                    .skippingTypesCheck()
                    .matches("""
                        VALUES
                        ('a', null, 1e0, 0e0, NULL, '1', '1'),
                        ('b', null, 1e0, 0e0, NULL, '2', '2'),
                        (NULL, NULL, NULL, NULL, 1e0, NULL, NULL)
                    """);
        }
    }

    @Test
    public void testReplaceTableWithNoColumns()
    {
        try (TestTable table = newTrinoTable("replace_table_no_columns", "(a INTEGER)")) {
            assertUpdate("REPLACE TABLE " + table.getName() + " AS SELECT"); // No columns
            assertThat(query("SHOW STATS FOR " + table.getName()))
                    .skippingTypesCheck()
                    .matches("""
                        VALUES
                        (NULL, NULL, NULL, NULL, 0e0, NULL, NULL)
                    """);
        }
    }

    @Test
    public void testReplaceTableWithVaryingColumnTypes()
    {
        try (TestTable table = newTrinoTable("replace_table_varying_types", "(a INTEGER, b VARCHAR, c DOUBLE)")) {
            assertUpdate("REPLACE TABLE " + table.getName() + " AS SELECT 1 a, 'x' b, 3.14 c");
            assertThat(query("SHOW STATS FOR " + table.getName()))
                    .skippingTypesCheck()
                    .matches("""
                        VALUES
                        ('a', null, 1e0, 0e0, NULL, '1', '1'),
                        ('b', null, 1e0, 0e0, NULL, 'x', 'x'),
                        ('c', null, 1e0, 0e0, NULL, '3.14', '3.14'),
                        (NULL, NULL, NULL, NULL, 1e0, NULL, NULL)
                    """);
        }
    }

    @Test
    public void testReplaceTableWithDuplicateColumnNames()
    {
        try (TestTable table = newTrinoTable("replace_table_duplicate_columns", "(a INTEGER, a INTEGER)")) {
            // This should fail due to duplicate column names
            assertQueryFails("REPLACE TABLE " + table.getName() + " AS SELECT 1 a, 2 a", "Duplicate column name: a");
        }
    }

    @Test
    public void testReplaceTableFailure()
    {
        try (TestTable table = newTrinoTable("replace_table_failure", "(a INTEGER)")) {
            // Invalid SQL
            assertQueryFails("REPLACE TABLE " + table.getName() + " AS SELECT invalid_column", "Column 'invalid_column' cannot be resolved");
        }
    }

    @Test
    public void testReplaceTableWithExtendedStatisticsDisabled()
    {
        // Assuming there is a session property or config to disable extended statistics
        Session session = Session.builder(getSession())
                .setCatalogSessionProperty("iceberg", "collect_extended_statistics_enabled", "false")
                .build();
        try (TestTable table = newTrinoTable("replace_table_ext_stats_disabled", "(a INTEGER)")) {
            getQueryRunner().execute(session, "REPLACE TABLE " + table.getName() + " AS SELECT 1 a");
            assertThat(query("SHOW STATS FOR " + table.getName()))
                    .skippingTypesCheck()
                    .matches("""
                        VALUES
                        ('a', null, 1e0, 0e0, NULL, '1', '1'),
                        (NULL, NULL, NULL, NULL, 1e0, NULL, NULL)
                    """);
        }
    }

```

- If your test infrastructure does not support `assertQueryFails`, you may need to use the appropriate method for asserting query failures.
- Adjust the session property name for disabling extended statistics if it differs in your codebase.
- Ensure that `newTrinoTable` and `getQueryRunner()` are available and correctly configured for these tests.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@ebyhr ebyhr force-pushed the ebi/iceberg-stats-replace-table branch from d5ddb2c to 29536fe Compare October 19, 2025 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

[Iceberg] Extended table statistics are not collected when using CTAS to replace table

1 participant