Skip to content

Conversation

markbt
Copy link

@markbt markbt commented Oct 19, 2023

When reading from a decompressor that was constructed using Decoder::with_buffer, the decoder may not consume all of the input by the time it returns all of the decompressed output. This means when you call .finish() on the decoder to get the underlying stream back, it is not pointing after the end of the compressed data.

This is expected from the way that the decoder makes use of ZSTD_decompressStream. From the zstd documentation for ZSTD_decompressStream:

But if output.pos == output.size, there might be some data left within internal buffers.
In which case, call ZSTD_decompressStream() again to flush whatever remains in the buffer.

At the point the decoder stream has yielded all of the decompressed data, we have not done this additional call. It is only necessary if the caller wants the underlying stream back, so at that point we can force an additional call to ZSTD_decompressStream by reading to a zero-length buffer.

This PR adds a test that demonstrates the issue, then adds the flush to prevent it.

@markbt markbt force-pushed the decompress_underread branch from 299cd5b to f2b80a6 Compare October 19, 2023 08:45
@gyscos
Copy link
Owner

gyscos commented Oct 21, 2023

Hi, and thanks for the PR!

The comment from the zstd lib refers to internal buffers that need to be flushed: when no more input is required, but more output is available.
The situation here is the reverse: all the output has been received already, but some of the input (the frame footer) has not been consumed yet.

I'm not against calling zstd again to make sure we fed it the entire frame, but I have a few concerns with the approach here:

  • into_inner should be a thin function to just get the inner object in it's current condition. It's mostly aimed for advanced users who know what they're doing, for example for testing or toying with the inner data. I would like to minimize any extra logic we do there. Decoder::finish() might be a better place for it.
  • If the frame was actually finished already, trying to read again might trigger a read from the underlying stream (could be a tcp socket), potentially blocking until more data comes in - which could be forever. Any method to make sure we consumed all input should probably make sure we don't trigger an extra read here.
  • Unless you read the decoded stream to the end, it's essentially a partial read, and there is indeed no guarantee regarding how much data was consumed. It looks like "consume the rest of the frame" could be an explicit, separate action users could call. One way to do that today could be to make the Decoded single-frame (to make sure it won't try to start a new frame if the current one is already complete), then call read_to_end() or something similar.

What do you think?

@markbt markbt force-pushed the decompress_underread branch from f2b80a6 to ea02a36 Compare October 24, 2023 19:44
@markbt
Copy link
Author

markbt commented Oct 24, 2023

Moving it to Decoder::finish seems reasonable. I have updated the PR with that change.

I do think this comment in the ZSTD documentation is relevant here. Specifically, in the test I added, while the compressed data is 21 bytes long, at the point where all data is yielded (i.e. output.pos == output.size), we still have data in our input buffer (input.pos = 17, input.size = 21). The documentation isn't exactly clear, but I think "some data left within internal buffers" also means that some of the data from the input stream isn't consumed either. The key part of the additional read is to read to a zero-length buffer, which means we call into ZSTD_decodeStream with output.pos == output.size, which is what the documentation requests, and means aside from consuming input for already-yielded data, zstd will not try to read any more data as it will have nowhere to decompress it to.

This is the only way to successfully read up to the end of a compressed object in a stream with non-compressed data following it. If you continue to read past the end of the decompressed data, the stream doesn't end, but instead it attempts to decompress the non-compressed data, and panics ("Unknown frame descriptor"). For Decoder::finish to be useful at all, you need to know when the decompressed stream ends exactly and stop reading there.

@markbt
Copy link
Author

markbt commented Dec 28, 2023

Hi @gyscos. Do you have any thoughts on my updated PR? Hopefully I've addressed your concerns, but if not, please let me know what else you'd like to see addressed.

@konstin
Copy link

konstin commented Mar 14, 2025

I think i'm hitting this too, I get a "Unknown frame descriptor" with an archive that tar and file-roller can unpack.

Reproducer:

[package]
name = "scratch-rust"
version = "0.1.0"
edition = "2024"

[dependencies]
flate2 = { version = "1.1.0", features = ["zlib-ng"] }
tar = "0.4.44"
zstd = "0.13.3"
use flate2::read::GzDecoder;
use std::fs::File;
use std::io::BufReader;
use std::path::PathBuf;
use std::{env, fs};

fn main() {
    let file = PathBuf::from(env::args().skip(1).next().unwrap());
    match file.extension().unwrap().to_str().unwrap() {
        "gz" => {
            let reader = GzDecoder::new(BufReader::new(File::open(file).unwrap()));
            fs::create_dir_all("unpacked").unwrap();
            tar::Archive::new(reader).unpack("unpacked").unwrap();
        }
        "zst" => {
            let reader = zstd::Decoder::new(File::open(file).unwrap()).unwrap();
            fs::create_dir_all("unpacked").unwrap();
            tar::Archive::new(reader).unpack("unpacked").unwrap();
        }
        unknown => panic!("Unknown file type: {}", unknown),
    }
}
wget "https://github.com/astral-sh/python-build-standalone/releases/download/20250311/cpython-3.12.9+20250311-aarch64-unknown-linux-gnu-install_only.tar.gz"
zstd -c -d < "cpython-3.12.9+20250311-aarch64-unknown-linux-gnu-install_only.tar.gz" > "cpython-3.12.9+20250311-aarch64-unknown-linux-gnu-install_only.tar.zst"
cargo build --release

rm -rf unpacked && time target/release/scratch-rust cpython-3.12.9+20250311-aarch64-unknown-linux-gnu-install_only.tar.gz
rm -rf unpacked && time target/release/scratch-rust cpython-3.12.9+20250311-aarch64-unknown-linux-gnu-install_only.tar.zst

@gyscos
Copy link
Owner

gyscos commented Mar 20, 2025

I think i'm hitting this too, I get a "Unknown frame descriptor" with an archive that tar and file-roller can unpack.

I think this is a different problem.

zstd -c -d < "cpython-3.12.9+20250311-aarch64-unknown-linux-gnu-install_only.tar.gz" > "cpython-3.12.9+20250311-aarch64-unknown-linux-gnu-install_only.tar.zst"

This decompresses the given archive. It's a gzip archive, but zstd can still decompress it. It does not recompress it though. The -c flag just tells it to output the uncompressed data to stdout. So the resulting cpython-3.12.9+20250311-aarch64-unknown-linux-gnu-install_only.tar.zst file is an uncompressed tar archive, which is why trying to open it in a zstd::Decoder fails (it doesn't have the right magic byte header).

markbt added 2 commits April 11, 2025 10:42
When reading from a decompressor that was constructed using `Decoder::with_buffer`, the decoder may not consume all of the input by the time it returns all of the decompressed output.  This means when you call `.finish()` on the decoder to get the underlying stream back, it is not pointing after the end of the compressed data.

This commit adds a test that demonstrates the issue.
After the decoder stream has yielded all of the uncompressed data, it is possible for the input stream to still not be fully consumed.  This means if we extract the inner stream at this point, it will not be pointing to the end of the compressed data.

From the [zstd documentation](https://facebook.github.io/zstd/zstd_manual.html#Chapter9) for `ZSTD_decompressStream`:

> But if `output.pos == output.size`, there might be some data left within internal buffers.
> In which case, call ZSTD_decompressStream() again to flush whatever remains in the buffer.

This is only necessary if the caller wants the stream back, so at that point we can force an additional call to `ZSTD_decompressStream` by reading to a zero-length buffer.
@markbt markbt force-pushed the decompress_underread branch from ea02a36 to 077d8f6 Compare April 11, 2025 09:46
@markbt
Copy link
Author

markbt commented Apr 11, 2025

I've checked the test and this is still a problem and the fix still works. We've worked around it in our code, but I've rebased the changes onto latest main in case you want to take another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants