A high-performance Go utility for mirroring crates.io. Unlike older scripts that took over a week, this tool can create a complete mirror in just ~6 hours on a residential connection — and much faster on data-center bandwidth.
Traditional rsync or Python-based mirroring struggles with crates.io because of millions of tiny files, high TLS overhead, and sequential transfers. This tool fixes that by combining:
-
Massive Concurrency
- Defaults to
~32 × CPU cores
concurrent downloads. - Keeps your bandwidth saturated, even on high-latency links.
- Defaults to
-
HTTP/2 Multiplexing
- Reuses a small pool of TLS connections.
- Eliminates the connect/teardown cost for each tiny
.crate
file.
-
Checksum-Aware Resume
- Skips re-downloading files that already match known checksums.
- Makes reruns incremental instead of starting from scratch.
-
On-the-Fly Bundling (Optional)
- Streams completed files directly into rolling
.tar.zst
archives. - Reduces filesystem churn from millions of inodes → a handful of large files.
- Streams completed files directly into rolling
-
Manifest Logging
- Every file is logged as JSONL (
manifest.jsonl
) with URL, checksum, and status. - Safe to resume, audit, or verify integrity after the fact.
- Every file is logged as JSONL (
Clone and build:
git clone https://github.com/APTlantis/Mirror-Rust-Crates
cd Mirror-Rust-Crates
go build Download-Crates.go
Run a mirror using a local crates.io-index
checkout:
./Download-Crates \
-index-dir "/path/to/crates.io-index" \
-out "/path/to/crates-mirror" \
-concurrency 256 \
-log-format text \
-log-level info \
-progress-interval 5s \
-progress-every 500
Test with a subset (first 10,000 crates):
./Download-Crates -index-dir /path/to/crates.io-index -limit 10000 -out out
Include yanked versions:
./Download-Crates -index-dir /path/to/crates.io-index -include-yanked -out out
Mirror from a list of URLs instead:
./Download-Crates -list urls.txt -out out -concurrency 256
Enable bundling into 8GB .tar.zst
archives:
./Download-Crates -index-dir /path/to/crates.io-index -bundle -bundle-size-gb 8 -bundles-out bundles
High-level pipeline and data flow are described in docs/architecture.md
.
The repository includes a small Python convenience wrapper to clone/update the crates.io index and invoke the Go downloader with sensible defaults.
Quick usage (non-interactive, CI-friendly):
PS> python .\Clone-Index.py `
--index-dir "D:\Rust-Crates\crates.io-index" `
--output-dir "D:\Rust-Crates\The-Crates" `
--threads 256 `
--non-interactive `
--log-level info
Notes:
- The wrapper will auto-detect a local Download-Crates binary in the repo directory or on PATH. If none is found, it will fallback to
go run Download-Crates.go
(requires Go toolchain). - You can override the downloader path explicitly with
--downloader-path C:\path\to\Download-Crates.exe
. --non-interactive
skips the confirmation prompt so you can run this in automation.- The wrapper uses Python logging rather than prints; use
--log-level debug|info|warning|error
.
PS C:\Projects\Mirror-Crates> go run Download-Crates.go `
-index-dir "D:\Rust-Crates\crates.io-index" `
-out "D:\Rust-Crates\The-Crates"
Starting: 1577579 urls, concurrency=256, out=D:\Rust-Crates\The-Crates
500 done (ok=500, err=0)
1000 done (ok=1000, err=0)
1500 done (ok=1500, err=0)
2000 done (ok=2000, err=0)
...
- ok = successfully mirrored files
- err = failures (e.g., HTTP error, checksum mismatch)
- Progress prints every 500 files, with a final summary when complete.
Flag | Description | Default |
---|---|---|
-index-dir |
Path to local crates.io-index directory |
Required |
-out |
Directory to store downloaded files | out |
-concurrency |
Number of concurrent downloads (32×CPU cores ) |
auto-computed |
-list |
Path to newline-delimited list of crate URLs | none |
-include-yanked |
Include yanked versions when scanning index | false |
-limit |
Limit number of crates (0 = all) | 0 |
-bundle |
Stream files into rolling .tar.zst archives while downloading |
false |
-bundle-size-gb |
Target bundle size (GB) | 8 |
-bundles-out |
Output directory for .tar.zst bundles |
bundles |
-manifest |
JSONL manifest log | manifest.jsonl |
-checksums |
Optional JSONL file of {url, sha256} entries |
none |
-log-format |
Logging format: text or json |
text |
-log-level |
Logging level: debug , info , warn , error |
info |
-progress-interval |
Periodic progress logging interval (e.g., 5s ; 0=disabled) |
0 |
-progress-every |
Log progress every N processed items (0=disabled) | 0 |
-retries |
Total retry attempts for transient errors | 3 |
-retry-base |
Base backoff (exponential with jitter) | 250ms |
-retry-max |
Max backoff per attempt | 5s |
-listen |
Serve Prometheus metrics and pprof at address (e.g., :9090 ) |
none |
Older scripts (like Python with 4 threads) often took 10+ days to complete and could not resume gracefully. This Go tool:
- Runs hundreds of concurrent HTTP/2 streams.
- Resumes cleanly if interrupted.
- Bundles efficiently to reduce filesystem stress.
- Maintains a manifest for auditing and verification.
Result: Full crates.io mirroring is now practical for home labs, universities, and hobbyists.
To generate per-crate-version metadata JSON sidecars (one JSON file per version) from a local crates.io-index clone:
PS> go build Generate-Sidecars.go
PS> .\Generate-Sidecars.exe `
-index-dir "D:\Rust-Crates\crates.io-index" `
-out "D:\Rust-Crates\The-Crates" `
-concurrency 256 `
-log-format text `
-log-level info `
-progress-interval 5s `
-progress-every 500
- The sidecar files are written alongside the normal crate layout used by Download-Crates, as:
- D:\Rust-Crates\The-Crates<shard><shard><name>-.crate.json
- Each sidecar contains the full metadata from the index line plus helpful fields:
- crate_file, crate_url, index_path
- Use
-include-yanked
to include yanked versions. - Use
-limit N
to test on the first N entries.
Notes:
- This reads the JSONL files under crates.io-index and writes millions of small sidecars; running on SSD/NVMe is recommended.
- You can run the sidecar generator before or after mirroring crates; it does not require
.crate
files to exist.
- CI matrix: Windows/Linux with Go 1.25.x.
- Linting:
staticcheck
,golangci-lint
, andgo vet
run in CI. - Tests:
go test ./...
at repo root; includes unit tests for downloader helpers and bundler. - Logging: all tools expose
-log-format text|json
and-log-level debug|info|warn|error
.
Local lint runs:
go install honnef.co/go/tools/cmd/staticcheck@latest
go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest
staticcheck ./...
golangci-lint run ./...
Expose Prometheus metrics and pprof by running the downloader with -listen
:
./Download-Crates -index-dir /path/to/crates.io-index -out out -listen :9090
- Prometheus:
http://localhost:9090/metrics
- pprof:
http://localhost:9090/debug/pprof/
Metrics include download attempts by status/HTTP code, bytes downloaded, durations, retries, inflight requests, and processed totals.
Generate multi-algorithm hashes for a directory, emit a YAML inventory, and create a TAR archive (which includes a legacy TOML metadata file inside).
Quick start:
PS> go run .\Archive-Hasher\Archive-Hasher.go -dir "D:\\Rust-Crates\\The-Crates" -progress-interval 10s -out-dir "D:\\Rust-Crates\\Artifacts"
See detailed docs and options in Archive-Hasher/README.md.
- Auto-tuned concurrency for adaptive bandwidth use.
- Prometheus metrics endpoint.
- Smarter resume with HTTP Range support.
- "Dry run" mode for quick mirror sizing.
Contributions welcome!
MIT License.