Skip to content

refine: refine the interface of SnapshotProducer #1490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Jul 6, 2025

Which issue does this PR close?

What changes are included in this PR?

This PR refine the interface of SnapshotProducer:

  1. include the table in SnapshotProducer rather than pass using function param
  2. add SnapshotProducer as param for ManifestProcess, SnapshotProduceOperation

I find that it would convenient when working on merge append: ManifestProcess and SnapshotProduceOperation can be consider as custom extension for SnapshotProducer and they would reuse SnapshotProducer as for common usage. So we should include it in their function param. At the same time, include Table in SnapshotProducer make thing easier they can use SnapshotProducer diretcly.

Are these changes tested?

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Jul 6, 2025

@CTTY
Copy link
Contributor

CTTY commented Jul 6, 2025

Hi @ZENOTME , thanks for working on this! I went thru the change and have some general thoughts:

ManifestProcess and SnapshotProduceOperation can be consider as custom extension for SnapshotProducer and they would reuse SnapshotProducer as for common usage. So we should include it in their function param

  • This makes sense to me

At the same time, include Table in SnapshotProducer make thing easier they can use SnapshotProducer diretcly.

  • As a helper class, SnapshotProducer should only be used when a TransactionAction commits (where you are actually generating the new snapshot). Since TransactionAction::commit already takes in a &Table, I don't think having SnapshotProducer hold its own table reference adds much value, and my concern is that allowing it to take another Table may lead to error-prone usages. I believe this is related to how MergeAppendAction will be integrated with SnapshotProducer though, do you think we can have MergeAppendAction reuse the existing design?

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Jul 7, 2025

Hi @CTTY, Thanks for your review!

Since TransactionAction::commit already takes in a &Table, I don't think having SnapshotProducer hold its own table reference adds much value

According to our current implementation, SnapshotProducer will create using &Table from TransactionAction::commit, so I think there is no case that SnapshotProducer has different table with TransactionAction::commit?

#[async_trait]
impl TransactionAction for FastAppendAction {
    async fn commit(self: Arc<Self>, table: &Table) -> Result<ActionCommit> {
        // validate added files
        SnapshotProducer::validate_added_data_files(table, &self.added_data_files)?;

        // Checks duplicate files
        if self.check_duplicate {
            SnapshotProducer::validate_duplicate_files(table, &self.added_data_files).await?;
        }

        let snapshot_producer = SnapshotProducer::new(
            table,
            self.commit_uuid.unwrap_or_else(Uuid::now_v7),
            self.key_metadata.clone(),
            self.snapshot_properties.clone(),
            self.added_data_files.clone(),
        );

        snapshot_producer
            .commit(table, FastAppendOperation, DefaultManifestProcess)
            .await
    }
}

I believe this is related to how MergeAppendAction will be integrated with SnapshotProducer though, do you think we can have MergeAppendAction reuse the existing design?

Yes, we can change the interface like

pub(crate) trait ManifestProcess: Send + Sync {
    fn process_manifests(
        &self,
        table: &Table,
        snapshot_produce: &SnapshotProducer<'_>,
        manifests: Vec<ManifestFile>,
    ) -> Vec<ManifestFile>;
}

But I think the interface can be more clear if we store Table in snapshot_produce to avoid pass the Table to lots of function around.

liurenjie1024
liurenjie1024 previously approved these changes Jul 7, 2025
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ZENOTME for this pr, I'm fine with this change to make things clearer as SnapshotProducer is used as inner class.

@liurenjie1024
Copy link
Contributor

Wait for @CTTY to take another look.

@CTTY
Copy link
Contributor

CTTY commented Jul 7, 2025

so I think there is no case that SnapshotProducer has different table with TransactionAction::commit?

This is correct. In this case, can we refactor SnapshotProducer::validate_added_data_files and SnapshotProducer::validate_duplicate_files as well? If SnapshotProducer already hold a Table reference, then those functions can also be simplified.

@liurenjie1024
Copy link
Contributor

so I think there is no case that SnapshotProducer has different table with TransactionAction::commit?

This is correct. In this case, can we refactor SnapshotProducer::validate_added_data_files and SnapshotProducer::validate_duplicate_files as well? If SnapshotProducer already hold a Table reference, then those functions can also be simplified.

cc @ZENOTME Do you want to change it in this pr or do it in following pr?

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Jul 8, 2025

so I think there is no case that SnapshotProducer has different table with TransactionAction::commit?

This is correct. In this case, can we refactor SnapshotProducer::validate_added_data_files and SnapshotProducer::validate_duplicate_files as well? If SnapshotProducer already hold a Table reference, then those functions can also be simplified.

cc @ZENOTME Do you want to change it in this pr or do it in following pr?

Yes, I will change it later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants