A tool for querying the Common Crawl CDX Index. Versions in both Python and Rust are included in this repository. The command–line syntax is identical in both versions.
- Clone this repository
- To run the Rust version, compile and run via:
$ cargo build --release
$ cd target/release
$ chmod +x scdx
$ scdx --sleep 2 --domain commoncrawl.org --crawls CC-MAIN-2021-04 CC-MAIN-2024-10$ scdx -s 10 -d '*.wikipedia.org' -c CC-MAIN-2023-50$ scdx -l -d apple.comThe program will display a progress bar and output a file with a timestamp (e.g 2024-02-27_18-34-50_output.jsonl) to the working directory, unless the -o or --output options are used.
The default sleep time is 2 seconds. Please be polite! Polling multiple times a second will make the index server sad. See the CCF system status here.
If no crawls are specified, all crawls will be queried. Use the -l or --latest flag to only query the latest crawl.
The API used supports two methods of wildcarding, like the (more advanced and mature) cdx-toolkit by Greg Lindahl.
-
Prefixed asterisk
The query
*.example.com, in CDX jargon setsmatchType='domain', and will return captures forblog.example.com,support.example.com, etc. -
Appended asterisk
The query
example.com/*will return captures for any page onexample.com.
The Python version uses tqdm to display a progress bar, and the Rust version uses indicatif.