arXiv Scraper and Evaluator Tool

This tool scrapes arXiv papers from the Computer Science (cs), Electrical Engineering and Systems Science (eess), and Statistics (stat) categories and includes functionality to evaluate papers for startup potential using OpenAI's GPT-4o model.

Features

Scraper (main.py)

Automatically tracks the last scrape date in last_update.txt
Limits scraping to a maximum of 7 days to avoid excessive data retrieval
Saves results in both CSV and JSON formats
Scrapes papers from cs, eess, and stat categories
Creates a results directory if it doesn't exist

Evaluator (evaluator.py)

Analyzes paper abstracts for startup viability using OpenAI's GPT-4o
Scores each paper on a scale of 1-10 for startup potential
Provides reasoning for each score
Can process all papers or a specified number of papers
Saves results with original data plus evaluation columns

Installation

Install the required dependencies:

pip install -r requirements.txt

Usage

Scraping Papers

Run the main script to scrape papers:

python main.py

The script will:

Check if last_update.txt exists to determine the date range to scrape
Retrieve papers from the specified categories within that date range
Save the results to the results folder in both CSV and JSON formats
Update the last_update.txt file with the current date

Evaluating Papers for Startup Potential

After scraping papers, use the evaluator script to analyze them:

python evaluator.py --rows 5  # Evaluate first 5 papers

Parameters:

--csv: Path to CSV file (optional, uses latest file in results directory if not specified)
--rows: Number of rows to evaluate (use '*' for all papers)
--output: Custom output path (optional)

Examples:

python evaluator.py  # Default: evaluates first 5 papers from latest CSV
python evaluator.py --rows 10  # Evaluate first 10 papers
python evaluator.py --rows '*'  # Evaluate all papers (may take time)
python evaluator.py --csv custom_path.csv  # Use specific CSV file

The evaluator requires an OpenAI API key in a .env file:

OPENAI_API_KEY=your_api_key_here

Output Files

Scraper Output

The scraped papers will be saved in the results folder with fixed filenames:

arxiv_papers.csv
arxiv_papers.json

Evaluator Output

The evaluated papers will be saved with the suffix _evaluated added to the original filename:

arxiv_papers_evaluated.csv

Configuration

You can modify the following constants in the script:

CATEGORIES: List of arXiv categories to scrape
MAX_DAYS: Maximum number of days to look back
RESULTS_FOLDER: Folder where results are saved
LAST_UPDATE_FILE: File that tracks the last update date

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
results		results
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
evaluator.py		evaluator.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

arXiv Scraper and Evaluator Tool

Features

Scraper (main.py)

Evaluator (evaluator.py)

Installation

Usage

Scraping Papers

Evaluating Papers for Startup Potential

Output Files

Scraper Output

Evaluator Output

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Languages

8AVIANVS/arxiv_scraper

Folders and files

Latest commit

History

Repository files navigation

arXiv Scraper and Evaluator Tool

Features

Scraper (main.py)

Evaluator (evaluator.py)

Installation

Usage

Scraping Papers

Evaluating Papers for Startup Potential

Output Files

Scraper Output

Evaluator Output

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages