Skip to content

NeoGendaijin/arxiv-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arXiv Crawler

A Python tool for downloading machine learning papers from arXiv.

Features

  • Search for machine learning papers from the last 5 years
  • Download PDFs and store metadata
  • Support for incremental updates (run multiple times to add more papers)
  • Configurable search parameters

Installation

  1. Make sure you have Python 3.8+ and Poetry installed
  2. Clone this repository
  3. Install dependencies:
poetry install

Usage

Run the crawler with default settings:

poetry run python arxiv_crawler.py

Command-line Options

  • -n, --num-papers: Number of papers to download (default: 100)
  • --start-date: Start date for papers (format: YYYY-MM-DD)
  • --end-date: End date for papers (format: YYYY-MM-DD)
  • -q, --query: Custom search query (default: machine learning categories)
  • --metadata-only: Only fetch metadata, don't download PDFs

Examples

Download 50 papers:

poetry run python arxiv_crawler.py -n 50

Download papers from a specific date range:

poetry run python arxiv_crawler.py --start-date 2023-01-01 --end-date 2023-12-31

Use a custom search query:

poetry run python arxiv_crawler.py -q "cat:cs.CV AND (deep learning OR neural network)"

Only fetch metadata without downloading PDFs:

poetry run python arxiv_crawler.py --metadata-only

Output

  • PDFs are saved to result/PDF/
  • Metadata is saved to result/papers.json

Project Structure

  • src/config.py: Configuration settings
  • src/api.py: arXiv API interaction
  • src/downloader.py: PDF downloading functionality
  • src/storage.py: Handling storage of PDFs and metadata
  • src/main.py: Main entry point
  • arxiv_crawler.py: Command-line script
  • view_metadata.py: Script to view the metadata

You can view the metadata using the view_metadata.py script:

poetry run python view_metadata.py

Troubleshooting

Connection Issues

If you encounter connection issues with the arXiv API, try the following:

  1. Check your internet connection
  2. Try again later (the API might be rate-limiting your requests)

Error Handling

The crawler includes robust error handling to deal with common issues:

  • Connection errors
  • API rate limiting
  • Interrupted downloads
  • File system errors

Errors are logged to the console, and the crawler will attempt to continue processing other papers when possible.

About

A Python tool for downloading machine learning papers from arXiv.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages