-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Collect NDV stats on all columns when replacing Iceberg tables #27014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Reviewer's GuidePropagate a replace flag in statistics collection metadata APIs to enable collecting NDV stats on all columns during table replace operations, updating core SPI, metadata propagation, planners, tracing and wrapper layers, Iceberg connector implementation, and adding a test to validate SHOW STATS after REPLACE TABLE. Sequence diagram for statistics collection during table replace operationsequenceDiagram
participant Planner
participant MetadataManager
participant ConnectorMetadata
participant IcebergMetadata
Planner->>MetadataManager: getStatisticsCollectionMetadataForWrite(..., replace=true)
MetadataManager->>ConnectorMetadata: getStatisticsCollectionMetadataForWrite(..., replace=true)
ConnectorMetadata->>IcebergMetadata: getStatisticsCollectionMetadataForWrite(..., replace=true)
IcebergMetadata-->>ConnectorMetadata: TableStatisticsMetadata (NDV stats on all columns)
ConnectorMetadata-->>MetadataManager: TableStatisticsMetadata
MetadataManager-->>Planner: TableStatisticsMetadata
Class diagram for updated statistics collection metadata APIsclassDiagram
class Metadata {
+getStatisticsCollectionMetadataForWrite(Session session, CatalogHandle catalogHandle, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
}
class MetadataManager {
+getStatisticsCollectionMetadataForWrite(Session session, CatalogHandle catalogHandle, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
}
class TracingMetadata {
+getStatisticsCollectionMetadataForWrite(Session session, CatalogHandle catalogHandle, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
}
class ConnectorMetadata {
+getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
+getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata): TableStatisticsMetadata (deprecated)
}
class TracingConnectorMetadata {
+getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
}
class ClassLoaderSafeConnectorMetadata {
+getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
}
class IcebergMetadata {
+getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
}
class LakehouseMetadata {
+getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace): TableStatisticsMetadata
}
MetadataManager --> Metadata
TracingMetadata --> MetadataManager
TracingConnectorMetadata --> ConnectorMetadata
ClassLoaderSafeConnectorMetadata --> ConnectorMetadata
IcebergMetadata --> ConnectorMetadata
LakehouseMetadata --> ConnectorMetadata
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `core/trino-main/src/main/java/io/trino/tracing/TracingMetadata.java:633` </location>
<code_context>
}
@Override
- public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(Session session, CatalogHandle catalogHandle, ConnectorTableMetadata tableMetadata)
+ public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(Session session, CatalogHandle catalogHandle, ConnectorTableMetadata tableMetadata, boolean replace)
{
CatalogMetadata catalogMetadata = getCatalogMetadataForWrite(session, catalogHandle);
</code_context>
<issue_to_address>
**suggestion:** Tracing wrapper passes 'replace' but does not log or trace its value.
If 'replace' impacts statistics collection, add its value to the tracing span to improve observability.
</issue_to_address>
### Comment 2
<location> `lib/trino-plugin-toolkit/src/main/java/io/trino/plugin/base/classloader/ClassLoaderSafeConnectorMetadata.java:158` </location>
<code_context>
}
@Override
- public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata)
+ public TableStatisticsMetadata getStatisticsCollectionMetadataForWrite(ConnectorSession session, ConnectorTableMetadata tableMetadata, boolean replace)
{
Span span = startSpan("getStatisticsCollectionMetadataForWrite", tableMetadata.getTable());
</code_context>
<issue_to_address>
**issue:** ClassLoaderSafeConnectorMetadata now passes 'replace', but does not handle backward compatibility.
If the delegate lacks the updated method, runtime errors may occur. Please add a fallback to the deprecated method to maintain compatibility.
</issue_to_address>
### Comment 3
<location> `plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergStatistics.java:976-974` </location>
<code_context>
+ @Test
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding tests for error conditions and edge cases in table replacement.
Please add tests for cases like replacing tables with no columns, columns of varying types, duplicate column names, replacement failures, and when extended statistics are disabled.
Suggested implementation:
```java
@Test
public void testShowStatsReplaceTable()
{
try (TestTable table = newTrinoTable("show_stats_after_replace_table_", "AS SELECT 1 a, 2 b")) {
assertThat(query("SHOW STATS FOR " + table.getName()))
.skippingTypesCheck()
.matches("""
VALUES
('a', null, 1e0, 0e0, NULL, '1', '1'),
('b', null, 1e0, 0e0, NULL, '2', '2'),
(NULL, NULL, NULL, NULL, 1e0, NULL, NULL)
""");
}
}
@Test
public void testReplaceTableWithNoColumns()
{
try (TestTable table = newTrinoTable("replace_table_no_columns", "(a INTEGER)")) {
assertUpdate("REPLACE TABLE " + table.getName() + " AS SELECT"); // No columns
assertThat(query("SHOW STATS FOR " + table.getName()))
.skippingTypesCheck()
.matches("""
VALUES
(NULL, NULL, NULL, NULL, 0e0, NULL, NULL)
""");
}
}
@Test
public void testReplaceTableWithVaryingColumnTypes()
{
try (TestTable table = newTrinoTable("replace_table_varying_types", "(a INTEGER, b VARCHAR, c DOUBLE)")) {
assertUpdate("REPLACE TABLE " + table.getName() + " AS SELECT 1 a, 'x' b, 3.14 c");
assertThat(query("SHOW STATS FOR " + table.getName()))
.skippingTypesCheck()
.matches("""
VALUES
('a', null, 1e0, 0e0, NULL, '1', '1'),
('b', null, 1e0, 0e0, NULL, 'x', 'x'),
('c', null, 1e0, 0e0, NULL, '3.14', '3.14'),
(NULL, NULL, NULL, NULL, 1e0, NULL, NULL)
""");
}
}
@Test
public void testReplaceTableWithDuplicateColumnNames()
{
try (TestTable table = newTrinoTable("replace_table_duplicate_columns", "(a INTEGER, a INTEGER)")) {
// This should fail due to duplicate column names
assertQueryFails("REPLACE TABLE " + table.getName() + " AS SELECT 1 a, 2 a", "Duplicate column name: a");
}
}
@Test
public void testReplaceTableFailure()
{
try (TestTable table = newTrinoTable("replace_table_failure", "(a INTEGER)")) {
// Invalid SQL
assertQueryFails("REPLACE TABLE " + table.getName() + " AS SELECT invalid_column", "Column 'invalid_column' cannot be resolved");
}
}
@Test
public void testReplaceTableWithExtendedStatisticsDisabled()
{
// Assuming there is a session property or config to disable extended statistics
Session session = Session.builder(getSession())
.setCatalogSessionProperty("iceberg", "collect_extended_statistics_enabled", "false")
.build();
try (TestTable table = newTrinoTable("replace_table_ext_stats_disabled", "(a INTEGER)")) {
getQueryRunner().execute(session, "REPLACE TABLE " + table.getName() + " AS SELECT 1 a");
assertThat(query("SHOW STATS FOR " + table.getName()))
.skippingTypesCheck()
.matches("""
VALUES
('a', null, 1e0, 0e0, NULL, '1', '1'),
(NULL, NULL, NULL, NULL, 1e0, NULL, NULL)
""");
}
}
```
- If your test infrastructure does not support `assertQueryFails`, you may need to use the appropriate method for asserting query failures.
- Adjust the session property name for disabling extended statistics if it differs in your codebase.
- Ensure that `newTrinoTable` and `getQueryRunner()` are available and correctly configured for these tests.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
...toolkit/src/main/java/io/trino/plugin/base/classloader/ClassLoaderSafeConnectorMetadata.java
Show resolved
Hide resolved
d5ddb2c
to
29536fe
Compare
Description
Fixes #26983
Release notes
Summary by Sourcery
Enable collection of distinct value statistics on all columns when replacing Iceberg tables by extending the statistics collection API with a replace flag and updating the Iceberg connector to use it.
New Features:
replace
flag in the statistics collection SPI and propagate it through core, planner, tracing, and connector layers to distinguish replacement writesEnhancements:
replace
parameterTests: