-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
Problem
With git-based registries any one commit from the registry represented an atomic picture of what all packages were up to at that point in time. This had a number of useful properties:
- It was possible to list all available crates.
- If a publish was visible, then all earlier publishes (in the servers timeline) would also be visible.
- It was possible to list all crates that had changed between two versions of the index.
- The commit hash was a unique identifier of that state of the world.
- It would've been possible to sign the commit to show it came from a crates.io server.
None of these are possible with the current design of the sparse http-based registries.
Proposed Solution
Its possible to re-obtain the list of all available crates by adding a names-file
that lists all crates alphabetically at the root of the index. For crates.io this file would be large but probably compress well enough to be usable.
If we add to that names-file
the hash of each index file, then it uniquely identifies a snapshot of the index! Of course, being mostly pseudorandom numbers it will compress really badly. At crates.io scale it will be too big to live in one file. We could split it up into several files, say one per folder in the current index structure. In order to get an atomic picture we would now need a meta-file
recording the hashes of the names-file
s. This works but would be a lot of overhead for smaller registries. (An RFC for this idea will require describing if/how the number of hash files is configurable.)
What happens if the index file is updated after the names-file
cargo was looking at? cargo will get a index file whose hash does not match the atomic picture requested. So there needs to be some kind of cash buster that makes sure that the received version of the file matches the requested version. We could put this in the name 3/f/foo
can become 3/f/foo-abcdef123456
. However this would require crates.io to have the same content at both of those locations (assuming stabilization of the current design of #2789), which is just asking for things to come out of sync. We could use a query parameter, 3/f/foo
get you the latest version no matter the hash and 3/f/foo?hash=abcdef123456
gets you the version with the hash abcdef123456
. However this requires the backing server to be able to look up versions of a file by their hash. Crates.io is currently using S3, which does not provide this functionality; you can look up old versions of a file but only by the versionID S3 generated. So let's double the size of our names-file
s by recording for each crate a cash busting string (crates.io can use S3s versionID) and the hash. With the semantics that if you ask for that crate with the cash buster appended, you should always receive a response with that hash.
Notes
If this is so much better design than the current sparse registry proposal, why should we ever stabilize the current proposal?
Fundamentally this design is going to be more work for server to implement. The current sparse registry design has exactly the same contents as a git registry, and is therefore very easy for registries to upgrade to. There will be many users for whom the additional complexity is not needed, and the current sparse registry proposal is a better fit.
I did want to open this discussion to make sure we have a path to this more atomic design that is compatible with what we are stabilizing in sparse registries. If changes need to be made to sparse registries, now is the time.