Skip to content

APTlantis/Mirror-Rust-Crates

Repository files navigation

Mirror-Rust-Crates

A high-performance Go utility for mirroring crates.io. Unlike older scripts that took over a week, this tool can create a complete mirror in just ~6 hours on a residential connection — and much faster on data-center bandwidth.


Badges

Go Python License: MIT PRs Welcome Status CI


🚀 Why This Is So Fast

Traditional rsync or Python-based mirroring struggles with crates.io because of millions of tiny files, high TLS overhead, and sequential transfers. This tool fixes that by combining:

  1. Massive Concurrency

    • Defaults to ~32 × CPU cores concurrent downloads.
    • Keeps your bandwidth saturated, even on high-latency links.
  2. HTTP/2 Multiplexing

    • Reuses a small pool of TLS connections.
    • Eliminates the connect/teardown cost for each tiny .crate file.
  3. Checksum-Aware Resume

    • Skips re-downloading files that already match known checksums.
    • Makes reruns incremental instead of starting from scratch.
  4. On-the-Fly Bundling (Optional)

    • Streams completed files directly into rolling .tar.zst archives.
    • Reduces filesystem churn from millions of inodes → a handful of large files.
  5. Manifest Logging

    • Every file is logged as JSONL (manifest.jsonl) with URL, checksum, and status.
    • Safe to resume, audit, or verify integrity after the fact.

🔧 Quick Start

Clone and build:

git clone https://github.com/APTlantis/Mirror-Rust-Crates
cd Mirror-Rust-Crates
go build Download-Crates.go

Run a mirror using a local crates.io-index checkout:

./Download-Crates \
  -index-dir "/path/to/crates.io-index" \
  -out "/path/to/crates-mirror" \
  -concurrency 256 \
  -log-format text \
  -log-level info \
  -progress-interval 5s \
  -progress-every 500

Test with a subset (first 10,000 crates):

./Download-Crates -index-dir /path/to/crates.io-index -limit 10000 -out out

Include yanked versions:

./Download-Crates -index-dir /path/to/crates.io-index -include-yanked -out out

Mirror from a list of URLs instead:

./Download-Crates -list urls.txt -out out -concurrency 256

Enable bundling into 8GB .tar.zst archives:

./Download-Crates -index-dir /path/to/crates.io-index -bundle -bundle-size-gb 8 -bundles-out bundles

🏗️ Architecture

High-level pipeline and data flow are described in docs/architecture.md.


🧭 Clone-Index wrapper (Python)

The repository includes a small Python convenience wrapper to clone/update the crates.io index and invoke the Go downloader with sensible defaults.

Quick usage (non-interactive, CI-friendly):

PS> python .\Clone-Index.py `
  --index-dir "D:\Rust-Crates\crates.io-index" `
  --output-dir "D:\Rust-Crates\The-Crates" `
  --threads 256 `
  --non-interactive `
  --log-level info

Notes:

  • The wrapper will auto-detect a local Download-Crates binary in the repo directory or on PATH. If none is found, it will fallback to go run Download-Crates.go (requires Go toolchain).
  • You can override the downloader path explicitly with --downloader-path C:\path\to\Download-Crates.exe.
  • --non-interactive skips the confirmation prompt so you can run this in automation.
  • The wrapper uses Python logging rather than prints; use --log-level debug|info|warning|error.

📊 Example Run

PS C:\Projects\Mirror-Crates> go run Download-Crates.go `
  -index-dir "D:\Rust-Crates\crates.io-index" `
  -out "D:\Rust-Crates\The-Crates"

Starting: 1577579 urls, concurrency=256, out=D:\Rust-Crates\The-Crates
500 done (ok=500, err=0)
1000 done (ok=1000, err=0)
1500 done (ok=1500, err=0)
2000 done (ok=2000, err=0)
...
  • ok = successfully mirrored files
  • err = failures (e.g., HTTP error, checksum mismatch)
  • Progress prints every 500 files, with a final summary when complete.

⚙️ Flags

Flag Description Default
-index-dir Path to local crates.io-index directory Required
-out Directory to store downloaded files out
-concurrency Number of concurrent downloads (32×CPU cores) auto-computed
-list Path to newline-delimited list of crate URLs none
-include-yanked Include yanked versions when scanning index false
-limit Limit number of crates (0 = all) 0
-bundle Stream files into rolling .tar.zst archives while downloading false
-bundle-size-gb Target bundle size (GB) 8
-bundles-out Output directory for .tar.zst bundles bundles
-manifest JSONL manifest log manifest.jsonl
-checksums Optional JSONL file of {url, sha256} entries none
-log-format Logging format: text or json text
-log-level Logging level: debug, info, warn, error info
-progress-interval Periodic progress logging interval (e.g., 5s; 0=disabled) 0
-progress-every Log progress every N processed items (0=disabled) 0
-retries Total retry attempts for transient errors 3
-retry-base Base backoff (exponential with jitter) 250ms
-retry-max Max backoff per attempt 5s
-listen Serve Prometheus metrics and pprof at address (e.g., :9090) none

📦 Comparison

Older scripts (like Python with 4 threads) often took 10+ days to complete and could not resume gracefully. This Go tool:

  • Runs hundreds of concurrent HTTP/2 streams.
  • Resumes cleanly if interrupted.
  • Bundles efficiently to reduce filesystem stress.
  • Maintains a manifest for auditing and verification.

Result: Full crates.io mirroring is now practical for home labs, universities, and hobbyists.


📝 Sidecar Metadata Generator

To generate per-crate-version metadata JSON sidecars (one JSON file per version) from a local crates.io-index clone:

PS> go build Generate-Sidecars.go
PS> .\Generate-Sidecars.exe `
  -index-dir "D:\Rust-Crates\crates.io-index" `
  -out "D:\Rust-Crates\The-Crates" `
  -concurrency 256 `
  -log-format text `
  -log-level info `
  -progress-interval 5s `
  -progress-every 500
  • The sidecar files are written alongside the normal crate layout used by Download-Crates, as:
    • D:\Rust-Crates\The-Crates<shard><shard><name>-.crate.json
  • Each sidecar contains the full metadata from the index line plus helpful fields:
    • crate_file, crate_url, index_path
  • Use -include-yanked to include yanked versions.
  • Use -limit N to test on the first N entries.

Notes:

  • This reads the JSONL files under crates.io-index and writes millions of small sidecars; running on SSD/NVMe is recommended.
  • You can run the sidecar generator before or after mirroring crates; it does not require .crate files to exist.

🧰 Development & CI

  • CI matrix: Windows/Linux with Go 1.25.x.
  • Linting: staticcheck, golangci-lint, and go vet run in CI.
  • Tests: go test ./... at repo root; includes unit tests for downloader helpers and bundler.
  • Logging: all tools expose -log-format text|json and -log-level debug|info|warn|error.

Local lint runs:

go install honnef.co/go/tools/cmd/staticcheck@latest
go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest
staticcheck ./...
golangci-lint run ./...

📈 Metrics & Profiling

Expose Prometheus metrics and pprof by running the downloader with -listen:

./Download-Crates -index-dir /path/to/crates.io-index -out out -listen :9090
  • Prometheus: http://localhost:9090/metrics
  • pprof: http://localhost:9090/debug/pprof/

Metrics include download attempts by status/HTTP code, bytes downloaded, durations, retries, inflight requests, and processed totals.


📦 Archive-Hasher (Directory hasher and tar packer)

Generate multi-algorithm hashes for a directory, emit a YAML inventory, and create a TAR archive (which includes a legacy TOML metadata file inside).

Quick start:

PS> go run .\Archive-Hasher\Archive-Hasher.go -dir "D:\\Rust-Crates\\The-Crates" -progress-interval 10s -out-dir "D:\\Rust-Crates\\Artifacts"

See detailed docs and options in Archive-Hasher/README.md.


🔮 Roadmap

  • Auto-tuned concurrency for adaptive bandwidth use.
  • Prometheus metrics endpoint.
  • Smarter resume with HTTP Range support.
  • "Dry run" mode for quick mirror sizing.

Contributions welcome!


📜 License

MIT License.

About

An effective way to mirror the crates.io repsitory.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published