-
Notifications
You must be signed in to change notification settings - Fork 124
Ensure buffers are flushed to avoid decompression stream underread #251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
299cd5b
to
f2b80a6
Compare
Hi, and thanks for the PR! The comment from the zstd lib refers to internal buffers that need to be flushed: when no more input is required, but more output is available. I'm not against calling zstd again to make sure we fed it the entire frame, but I have a few concerns with the approach here:
What do you think? |
f2b80a6
to
ea02a36
Compare
Moving it to I do think this comment in the ZSTD documentation is relevant here. Specifically, in the test I added, while the compressed data is 21 bytes long, at the point where all data is yielded (i.e. This is the only way to successfully read up to the end of a compressed object in a stream with non-compressed data following it. If you continue to read past the end of the decompressed data, the stream doesn't end, but instead it attempts to decompress the non-compressed data, and panics ( |
Hi @gyscos. Do you have any thoughts on my updated PR? Hopefully I've addressed your concerns, but if not, please let me know what else you'd like to see addressed. |
I think i'm hitting this too, I get a "Unknown frame descriptor" with an archive that tar and file-roller can unpack. Reproducer: [package]
name = "scratch-rust"
version = "0.1.0"
edition = "2024"
[dependencies]
flate2 = { version = "1.1.0", features = ["zlib-ng"] }
tar = "0.4.44"
zstd = "0.13.3" use flate2::read::GzDecoder;
use std::fs::File;
use std::io::BufReader;
use std::path::PathBuf;
use std::{env, fs};
fn main() {
let file = PathBuf::from(env::args().skip(1).next().unwrap());
match file.extension().unwrap().to_str().unwrap() {
"gz" => {
let reader = GzDecoder::new(BufReader::new(File::open(file).unwrap()));
fs::create_dir_all("unpacked").unwrap();
tar::Archive::new(reader).unpack("unpacked").unwrap();
}
"zst" => {
let reader = zstd::Decoder::new(File::open(file).unwrap()).unwrap();
fs::create_dir_all("unpacked").unwrap();
tar::Archive::new(reader).unpack("unpacked").unwrap();
}
unknown => panic!("Unknown file type: {}", unknown),
}
} wget "https://github.com/astral-sh/python-build-standalone/releases/download/20250311/cpython-3.12.9+20250311-aarch64-unknown-linux-gnu-install_only.tar.gz"
zstd -c -d < "cpython-3.12.9+20250311-aarch64-unknown-linux-gnu-install_only.tar.gz" > "cpython-3.12.9+20250311-aarch64-unknown-linux-gnu-install_only.tar.zst"
cargo build --release
rm -rf unpacked && time target/release/scratch-rust cpython-3.12.9+20250311-aarch64-unknown-linux-gnu-install_only.tar.gz
rm -rf unpacked && time target/release/scratch-rust cpython-3.12.9+20250311-aarch64-unknown-linux-gnu-install_only.tar.zst |
I think this is a different problem.
This decompresses the given archive. It's a gzip archive, but zstd can still decompress it. It does not recompress it though. The |
When reading from a decompressor that was constructed using `Decoder::with_buffer`, the decoder may not consume all of the input by the time it returns all of the decompressed output. This means when you call `.finish()` on the decoder to get the underlying stream back, it is not pointing after the end of the compressed data. This commit adds a test that demonstrates the issue.
After the decoder stream has yielded all of the uncompressed data, it is possible for the input stream to still not be fully consumed. This means if we extract the inner stream at this point, it will not be pointing to the end of the compressed data. From the [zstd documentation](https://facebook.github.io/zstd/zstd_manual.html#Chapter9) for `ZSTD_decompressStream`: > But if `output.pos == output.size`, there might be some data left within internal buffers. > In which case, call ZSTD_decompressStream() again to flush whatever remains in the buffer. This is only necessary if the caller wants the stream back, so at that point we can force an additional call to `ZSTD_decompressStream` by reading to a zero-length buffer.
ea02a36
to
077d8f6
Compare
I've checked the test and this is still a problem and the fix still works. We've worked around it in our code, but I've rebased the changes onto latest |
When reading from a decompressor that was constructed using
Decoder::with_buffer
, the decoder may not consume all of the input by the time it returns all of the decompressed output. This means when you call.finish()
on the decoder to get the underlying stream back, it is not pointing after the end of the compressed data.This is expected from the way that the decoder makes use of
ZSTD_decompressStream
. From the zstd documentation forZSTD_decompressStream
:At the point the decoder stream has yielded all of the decompressed data, we have not done this additional call. It is only necessary if the caller wants the underlying stream back, so at that point we can force an additional call to
ZSTD_decompressStream
by reading to a zero-length buffer.This PR adds a test that demonstrates the issue, then adds the flush to prevent it.