Implement a more general index type #11
Replies: 4 comments
-
At this point, I would wish to not change the SOZip format in a non-backward compatible way. This would require a SOZip v2, and very good reasons. |
Beta Was this translation helpful? Give feedback.
-
If one is allowed to modify the original compressed archives - gzip / bzip2 by adding index files to them - one may as well recompress the whole data again using sozip. The primary use of general indices like the examples you mention are that they reside outside the archives and are used when the option of modifying the original zip archives is not available. Another reason for not modifying the archives themselves is that these indices by themselves are useful if the same source data is hosted at multiple sources. |
Beta Was this translation helpful? Give feedback.
-
I see the point of having an external index as intended by the mentioned libraries. I also see the advantage of the sozip index which allows parallel compression of the chunks. |
Beta Was this translation helpful? Give feedback.
-
SOZip is really about embedding the index into the .zip and thus making it a no-brainer for users. The aim is that users don't even have to know what SOZip is when they use a SOZip-enabled software. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
While looking for other seek optimisations, I found https://github.com/pauldmccarthy/indexed_gzip and https://github.com/forrestfwilliams/zran for deflate like compression and https://github.com/mxmlnkn/indexed_bzip2 for bzip2 compression.
These indexes have in common that they map uncompressed file locations to compressed file locations (which might be within a byte).
So the general index would map the uncompressed position in bytes to the compressed position in bits.
The current sozip index has the advantage of constant chunk length. Therefore it only needs to store the compressed locations and finding the chunk index is a simple integer division. However, the extra integer field should still be manageable, especially if we store the index in a compressed file inside the zip archive. Furthermore, I would not expect that the algorithm to find the index location (e.g. bisection) would add much overhead.
Optionally, the index could also contain the checksum value at the given index location and other data to initialise the de-compressor at the index location.
What do you think?
Beta Was this translation helpful? Give feedback.
All reactions