Implement a more general index type #11

carloshorn · 2023-12-18T15:47:27Z

carloshorn
Dec 18, 2023

While looking for other seek optimisations, I found https://github.com/pauldmccarthy/indexed_gzip and https://github.com/forrestfwilliams/zran for deflate like compression and https://github.com/mxmlnkn/indexed_bzip2 for bzip2 compression.
These indexes have in common that they map uncompressed file locations to compressed file locations (which might be within a byte).
So the general index would map the uncompressed position in bytes to the compressed position in bits.

The current sozip index has the advantage of constant chunk length. Therefore it only needs to store the compressed locations and finding the chunk index is a simple integer division. However, the extra integer field should still be manageable, especially if we store the index in a compressed file inside the zip archive. Furthermore, I would not expect that the algorithm to find the index location (e.g. bisection) would add much overhead.

Optionally, the index could also contain the checksum value at the given index location and other data to initialise the de-compressor at the index location.

What do you think?

rouault · 2023-12-18T15:58:18Z

rouault
Dec 18, 2023
Maintainer

At this point, I would wish to not change the SOZip format in a non-backward compatible way. This would require a SOZip v2, and very good reasons.

0 replies

piyushrpt · 2023-12-18T16:26:47Z

piyushrpt
Dec 18, 2023

If one is allowed to modify the original compressed archives - gzip / bzip2 by adding index files to them - one may as well recompress the whole data again using sozip. The primary use of general indices like the examples you mention are that they reside outside the archives and are used when the option of modifying the original zip archives is not available. Another reason for not modifying the archives themselves is that these indices by themselves are useful if the same source data is hosted at multiple sources.

0 replies

carloshorn · 2023-12-22T21:13:13Z

carloshorn
Dec 22, 2023
Author

I see the point of having an external index as intended by the mentioned libraries. I also see the advantage of the sozip index which allows parallel compression of the chunks.
My thought was more about other compression algorithms. Deflate is the default zip option, but zip does allow other compression algorithms as well, which might better fit with the data or use case.
However, I understand @rouault's point in backward compatibility, because such a breaking change would add complexity for a potential feature with an unknown number of users.

0 replies

rouault · 2023-12-23T09:52:42Z

rouault
Dec 23, 2023
Maintainer

Deflate is the default zip option, but zip does allow other compression algorithms as well, which might better fit with the data or use case.

SOZip is really about embedding the index into the .zip and thus making it a no-brainer for users. The aim is that users don't even have to know what SOZip is when they use a SOZip-enabled software.
The SOZip concept could indeed be applied to other codecs than Deflate (and that could obviously be done in a backwards compatible way). Deflate was chosen as all ZIP implementations implements it, whereas other codecs tend to be much less supported, and thus that could make the compatibility story of SOZip harder to comprehend. Another point to consider is that due to regular flushing of the compression state to make independent decompressible chunks, the efficiency of compression decreases a bit. So I bet that other codecs than Deflate might also suffer from that, potentially reducing their gain w.r.t. deflate thinner.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement a more general index type #11

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Implement a more general index type #11

Uh oh!

carloshorn Dec 18, 2023

Replies: 4 comments

Uh oh!

rouault Dec 18, 2023 Maintainer

Uh oh!

Uh oh!

piyushrpt Dec 18, 2023

Uh oh!

carloshorn Dec 22, 2023 Author

Uh oh!

rouault Dec 23, 2023 Maintainer

carloshorn
Dec 18, 2023

rouault
Dec 18, 2023
Maintainer

piyushrpt
Dec 18, 2023

carloshorn
Dec 22, 2023
Author

rouault
Dec 23, 2023
Maintainer