Skip to content

Feature request: Align compressed output of files on squashfs to device block size #331

@meeuw

Description

@meeuw

I have multiple large (8GB) squashfs images which are mostly similar and I would like to use reflinks (on XFS) to deduplicate the data.

Deduplication using reflinks is done using FICLONERANGE and "Disk filesystems generally require the offset and length arguments to be aligned to the fundamental block size".

To check this, I wrote the following Python script to calculate MD5 checksums of 4 kB blocks in two SquashFS images:

import hashlib
import sys
with open(sys.argv[1], 'rb') as f:
    while 1:
        buf = f.read(4096)
        if not buf: break
        print(hashlib.md5(buf).hexdigest())

I then compared the results (using sort | uniq -c | sort) to see how many blocks were identical between the two SquashFS images.

Unfortunately, I found that only a few blocks were duplicates, even though the contents of the images are mostly the same (as confirmed with a tool like rdiff).

My assumption is that mksquashfs does not align the compressed output of files to the device block size. As a result, duplicate data gets shifted relative to the block boundary, making the blocks slightly - or completely - different.

In my research, I came across a 4k-align.patch
(though I haven’t found an up-to-date version). I also noticed that gensquashfs supports an [align] option in its sort file.

What I wish mksquashfs had is an option to align files to a configurable block size, with the ability to set a minimum file size threshold. Something like:

-align MIN_SIZE,BLOCK_SIZE

-align 131072,4096
-align 0,4096

Would it be possible to add such an option to mksquashfs?

I’d be happy to create a pull request or patch for this - any hints or guidance would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions