Skip to content

Check integrity early for smaller files #164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

JoeyBF
Copy link
Collaborator

@JoeyBF JoeyBF commented Aug 11, 2024

This makes the save mechanism sturdier by catching malformed files early and overwriting them. Note that quasi-inverses are the only files that we stream in without reading completely, because they are so massive; other files fit in memory easily, so it's not a problem to load them early.

Compare with #107

@JoeyBF
Copy link
Collaborator Author

JoeyBF commented Oct 23, 2024

Any feedback for this PR?

Comment on lines +311 to +312
std::fs::remove_file(&path)
.unwrap_or_else(|e| panic!("Error when deleting {path:?}: {e}"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you should instead move these to a corrupted files directory? It might be useful to be able to look at them to find out what went wrong, this just destroys the evidence.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that sounds like a good idea. Or maybe rename them with a ".old" suffix, and silently delete the previous .old version if there is one

Comment on lines +440 to +445
fn should_check_early(&self) -> bool {
!matches!(
self.kind,
SaveKind::AugmentationQi | SaveKind::NassauQi | SaveKind::ResQi
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense as an improvement over the status quo but there may be a better design of the file format that would allow checking only the section of the file that is used? For instance you could place a checksum every 4kb or imitate the zip file format. Another possibility is there might be an existing container format that is appropriate and that could make the checksums transparent to our code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked around and didn't see a generic container that would do that for us, but one interesting option would be to always output zstd-compressed data. Zstd has options to do frame-level and block-level checksumming, and we already support reading compressed files.

Also, I just checked with our stem 200 data and, apart from the quasi-inverses, the biggest files only take a few MB, with the vast majority taking less than 1 kB. In fact, the first thing that the code does after opening those files is reading the entire contents, so we're really just prefetching

@hoodmane
Copy link
Contributor

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants