arXiv Crawler

A Python tool for downloading machine learning papers from arXiv.

Features

Search for machine learning papers from the last 5 years
Download PDFs and store metadata
Support for incremental updates (run multiple times to add more papers)
Configurable search parameters

Installation

Make sure you have Python 3.8+ and Poetry installed
Clone this repository
Install dependencies:

poetry install

Usage

Run the crawler with default settings:

poetry run python arxiv_crawler.py

Command-line Options

-n, --num-papers: Number of papers to download (default: 100)
--start-date: Start date for papers (format: YYYY-MM-DD)
--end-date: End date for papers (format: YYYY-MM-DD)
-q, --query: Custom search query (default: machine learning categories)
--metadata-only: Only fetch metadata, don't download PDFs

Examples

Download 50 papers:

poetry run python arxiv_crawler.py -n 50

Download papers from a specific date range:

poetry run python arxiv_crawler.py --start-date 2023-01-01 --end-date 2023-12-31

Use a custom search query:

poetry run python arxiv_crawler.py -q "cat:cs.CV AND (deep learning OR neural network)"

Only fetch metadata without downloading PDFs:

poetry run python arxiv_crawler.py --metadata-only

Output

PDFs are saved to result/PDF/
Metadata is saved to result/papers.json

Project Structure

src/config.py: Configuration settings
src/api.py: arXiv API interaction
src/downloader.py: PDF downloading functionality
src/storage.py: Handling storage of PDFs and metadata
src/main.py: Main entry point
arxiv_crawler.py: Command-line script
view_metadata.py: Script to view the metadata

You can view the metadata using the view_metadata.py script:

poetry run python view_metadata.py

Troubleshooting

Connection Issues

If you encounter connection issues with the arXiv API, try the following:

Check your internet connection
Try again later (the API might be rate-limiting your requests)

Error Handling

The crawler includes robust error handling to deal with common issues:

Connection errors
API rate limiting
Interrupted downloads
File system errors

Errors are logged to the console, and the crawler will attempt to continue processing other papers when possible.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
result		result
src		src
.gitignore		.gitignore
README.md		README.md
arxiv_crawler.py		arxiv_crawler.py
pyproject.toml		pyproject.toml
view_metadata.py		view_metadata.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

arXiv Crawler

Features

Installation

Usage

Command-line Options

Examples

Output

Project Structure

Troubleshooting

Connection Issues

Error Handling

About

Uh oh!

Releases

Packages

Uh oh!

Languages

NeoGendaijin/arxiv-crawler

Folders and files

Latest commit

History

Repository files navigation

arXiv Crawler

Features

Installation

Usage

Command-line Options

Examples

Output

Project Structure

Troubleshooting

Connection Issues

Error Handling

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages