-
Notifications
You must be signed in to change notification settings - Fork 467
[Xet] Basic shard creation #1633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
export function compute_range_verification_hash(chunkHashes: string[]): string; | ||
export function compute_file_hash(chunks_array: Array<{ hash: string; length: number }>): string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@assafvayner need those two functions from the wasm :)
(also , versions of those two or at least the last one with .update
would be nice eventually)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by .update
for those functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where you can feed it data progressively before calling finalize()
to get the hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep the xorb and range hash computation simple and take just an array of items since those have roughly reasonable limit of ~1K items
the file hash I can see the value but we don't have this feature imlpemented in xet-core yet, and might be a while (it's not simple). For now there's just a compute_file_hash
function that takes all the chunks at once but we may be able to update that later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I don't see the difference between range hash & file hash, they both have all the chunk hashes for a file no? (the only diff is that file hash has chunk lengths too)
the file hash I can see the value but we don't have this feature imlpemented in xet-core yet, and might be a while (it's not simple)
yes no problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the range hash is at most 1 xorb's worth of hashes (this is a bit odd to explain, that's why we need to write the whole spec).
let's say a file has the following structure:
xorb A chunks 0-1024 (out of 1024)
xorb B chunks 0-500 (out of 1024)
xorb A chunks 1-44
Then the range hashes for the verification section of the shard containing this file info will need to have:
range_hash(xorb_A.chunks_hashes.slice(0, 1025))
range_hash(xorb_B.chunks_hashes.slice(0, 501))
range_hash(xorb_A.chunks_hashes.slice(1, 45))
notice that all the reasonable parameters to the range_hash function are <= number of chunks in a xorb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there's 1 FileVerificationEntry
for each FileDataSequenceEntry
?
And it's like this?
FileDataSequenceEntry A
FileDataSequenceEntry B
FileDataSequenceEntry C
FileDataSequenceEntry D
FileVerificationEntry A (for FileDataSequenceEntry A)
FileVerificationEntry B
FileVerificationEntry C
FileVerificationEntry D
FileMetadataExt
?
cc @Kakulukian @assafvayner for viz, follow up to #1616
Based on https://github.com/huggingface/xet-core/blob/7e41fb0dd7cfb276222b9668d0b97a984647721e/spec/shard.md
Need to handle: