-
Notifications
You must be signed in to change notification settings - Fork 354
Open
Labels
Description
Search before asking
- I have searched in the issues and found no similar issues.
What would you like to be improved?
The previous implementation performed a global scan on partitioned tables, which often caused OOM issues when handling large Iceberg tables. The main reasons are:
- High memory consumption when a table contains a large number of files;
- Loading too many column stats, especially from delete files
- Lack of filtering on the tables that actually need to be processed.
How should we improve?
We propose a manifest-based, partition-aware data expiration approach:
- Identify candidate manifest files based on their partition boundaries and expire files that do not meet retention conditions;
- Iterate through manifest files sequentially to collect partition and file-level information;
- Perform expiration in a partition-by-partition manner, which allows submitting cleanup tasks per partition.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Subtasks
No response
Code of Conduct
- I agree to follow this project's Code of Conduct
vanphuoc3012, MarigWeizhi, Jzjsnow and zhoujinsong