Skip to content

[Feature]: Event-Triggered Optimization of Iceberg Tables in Amoro #3774

@Jzjsnow

Description

@Jzjsnow

Description

Currently, Amoro determines Iceberg table optimizations through periodic full-refresh evaluations at the table level. While this design ensures consistent refreshes, it introduces inefficiencies for large-scale Iceberg tables with continuously growing data volumes.

From log analysis on an Amoro system with 40,000+ tables, we observed:

  • Numerous redundant scans during refresh.
  • Low evaluation efficiency, as many scans did not lead to effective optimizations.
  • Resource wastage and system delays, limiting scalability and user experience.

To address these limitations, we propose introducing an event-triggered optimization mechanism that reduces unnecessary scans and enhances evaluation efficiency, while ensuring timely optimizations.

Use case/motivation

  • Large-scale Iceberg tables (tens of thousands) suffer from frequent full-table refresh scans, leading to high resource usage and wasted computation.
  • Users need timely optimizations without the overhead of redundant scans.
  • For real-time scenarios, latency-sensitive workloads cannot tolerate inefficient refresh mechanisms.
  • The current design struggles to balance timeliness vs. resource utilization.

By moving from periodic full-refresh scans to event-driven triggers, Amoro can:

  1. Reduce full-table scans, avoiding redundant work.
  2. Increase effective scan ratio, ensuring scans more frequently lead to actual optimizations.
  3. Improve file merge efficiency, maximizing the effect of each scan.
  4. Ensure timely optimization, even while lowering scan frequency.

Describe the solution

To address pain points in Amoro's periodic optimization refresh mechanism, we propose an ​​event-triggered optimization mechanism​​:

  • Use loaded table metadata and metrics to trigger scan evaluations, reducing redundant scans.
  • Replace periodic table loading with Iceberg commit-driven triggers.

Subtasks

Metadata Metric-Driven Evaluation​

  • [Subtask]: Add support for pluggable refresh event in TableRuntimeRefreshExecutor and DefaultRefreshEvent #3775 Add support for pluggable refresh event in TableRuntimeRefreshExecutor and DefaultRefreshEvent
  • Add support for MSE-based refresh event: calculate partition MSE from loaded metadata and filter partitions based on threshold.
  • Documentation updates.
  • (optional) Bind each computed pendingInput result with snapshotId to avoid repeated scans when in pending state.
  • (optional) Cache storage for MSE calculation results.
  • (optional) Cache storage for pendingInput queue (e.g., when >100 partitions need optimization, enable queued optimization).

Iceberg Commit-Triggered Table Runtime Refresh

  • REST endpoint for reporting external table events: define interface and JSON body format.
  • Add an ExternalEventReportService to trigger TableRuntimeRefreshExecutor for refresh & evaluation based on commit operations.
  • Catalog query service API.
  • Documentation updates.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions