A Python tool for downloading machine learning papers from arXiv.
- Search for machine learning papers from the last 5 years
- Download PDFs and store metadata
- Support for incremental updates (run multiple times to add more papers)
- Configurable search parameters
- Make sure you have Python 3.8+ and Poetry installed
- Clone this repository
- Install dependencies:
poetry install
Run the crawler with default settings:
poetry run python arxiv_crawler.py
-n, --num-papers
: Number of papers to download (default: 100)--start-date
: Start date for papers (format: YYYY-MM-DD)--end-date
: End date for papers (format: YYYY-MM-DD)-q, --query
: Custom search query (default: machine learning categories)--metadata-only
: Only fetch metadata, don't download PDFs
Download 50 papers:
poetry run python arxiv_crawler.py -n 50
Download papers from a specific date range:
poetry run python arxiv_crawler.py --start-date 2023-01-01 --end-date 2023-12-31
Use a custom search query:
poetry run python arxiv_crawler.py -q "cat:cs.CV AND (deep learning OR neural network)"
Only fetch metadata without downloading PDFs:
poetry run python arxiv_crawler.py --metadata-only
- PDFs are saved to
result/PDF/
- Metadata is saved to
result/papers.json
src/config.py
: Configuration settingssrc/api.py
: arXiv API interactionsrc/downloader.py
: PDF downloading functionalitysrc/storage.py
: Handling storage of PDFs and metadatasrc/main.py
: Main entry pointarxiv_crawler.py
: Command-line scriptview_metadata.py
: Script to view the metadata
You can view the metadata using the view_metadata.py script:
poetry run python view_metadata.py
If you encounter connection issues with the arXiv API, try the following:
- Check your internet connection
- Try again later (the API might be rate-limiting your requests)
The crawler includes robust error handling to deal with common issues:
- Connection errors
- API rate limiting
- Interrupted downloads
- File system errors
Errors are logged to the console, and the crawler will attempt to continue processing other papers when possible.