Skip to content

Define codec for LZW compression #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

maxrjones
Copy link
Member

@maxrjones maxrjones commented May 30, 2025

I'm proposing adding an extension for LZW compression as a start to enabling Zarr format 3 support for imagecodecs (xref #15 )

cc @@cgohlke @d-v-b

@LDeakin
Copy link
Member

LDeakin commented May 31, 2025

If this is to be called lzw, it should probably support general usage of the LZW algorithm not just the variant that imagecodecs implements.

https://github.com/cgohlke/imagecodecs/blob/8776afc59c94b2531aaecb5dd969308388c9e2d0/imagecodecs/_imagecodecs.py#L541

This implementation of the LZW decoding algorithm is described in TIFF v6 and is not compatible with old style LZW compressed files like quad-lzw.tif.

https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch

Since codes are added in a manner determined by the data, the decoder mimics building the table as it sees the resulting codes. It is critical that the encoder and decoder agree on the variety of LZW used: the size of the alphabet, the maximum table size (and code width), whether variable-width encoding is used, initial code size, and whether to use the clear and stop codes (and what values they have). Most formats that employ LZW build this information into the format specification or provide explicit fields for them in a compression header for the data.

Unfortunately, some early implementations of the encoding algorithm increase the code width and then emit ω at the new width instead of the old width, so that to the decoder it looks like the width changes one code too early. This is called "early change"

Of the graphics file formats that support LZW compression, TIFF uses early change, while GIF and most others don't.

@cgohlke
Copy link

cgohlke commented May 31, 2025

https://github.com/cgohlke/imagecodecs/blob/8776afc59c94b2531aaecb5dd969308388c9e2d0/imagecodecs/_imagecodecs.py#L541

FWIW, that's not the implementation used by the LZW codec. It is actually implemented in C and Cython at https://github.com/cgohlke/imagecodecs/blob/8776afc59c94b2531aaecb5dd969308388c9e2d0/imagecodecs/imcd.c#L1617-L2341 and https://github.com/cgohlke/imagecodecs/blob/8776afc59c94b2531aaecb5dd969308388c9e2d0/imagecodecs/_imcd.pyx#L1382-L1475.
The codec is meant to handle LZW used in TIFF but also decodes LZW found in CZI and some Adobe formats, LSB/MSB order, and early/late changes.

@normanrz
Copy link
Member

normanrz commented Jun 2, 2025

Thanks for adding this. I wonder if we should prefix this codec as imagecodecs.lzw similar to #14, #12 and #2?

@maxrjones
Copy link
Member Author

Thanks for the comments! I didn't recognize those complexities of the LZW algorithm and agree that defining a imagecodecs prefix is the most expedient solution here.

@cgohlke my understanding from https://github.com/cgohlke/imagecodecs/blob/8776afc59c94b2531aaecb5dd969308388c9e2d0/imagecodecs/numcodecs.py#L1527-L1540 was that Zarr uses the pure python version in https://github.com/cgohlke/imagecodecs/blob/8776afc59c94b2531aaecb5dd969308388c9e2d0/imagecodecs/_imagecodecs.py#L541 rather than the C and Cython version. Can you please clarify which implementation should be referenced by the Zarr extension definition?

@cgohlke
Copy link

cgohlke commented Jun 2, 2025

Can you please clarify which implementation should be referenced by the Zarr extension definition?

The imagecodecs.numcodecs.Lzw codec uses the Cython/C implementation. The imagecodecs._imagecodecs module is not public and it's pure Python LZW implementation is too slow and limited to be useful. That module is mostly used for testing and reference.

@normanrz normanrz marked this pull request as draft June 25, 2025 09:35
@normanrz
Copy link
Member

I marked this PR as draft since there still seems to be discussion around the names and content. Feel free to mark it ready for review, once you're ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants