A web crawler for downloading papers from the Network and Distributed System Security Symposium (NDSS).
- Download papers from specified NDSS conference year
- Automatically download PDF papers and presentation slides(Video is alternative, just add a button in the paper detail page)
- Save paper information to CSV file
- Support resume download (skip existing files)
- Support Summer/Fall Cycle papers identification
- Automatic filename sanitization and formatting
-
Install dependencies
pip install -r requirements.txt
-
Run the program:
python spider.py
-
Input the year you want to crawl (e.g., 2024)
-
The program will automatically:
- Create necessary directories
- Fetch paper list
- Download papers and slides
- Generate CSV file
ndss{year}/ ├── papers/ # Store paper PDFs ├── slides/ # Store presentation slides └── paper_list.csv # Paper information list
paper_list.csv contains the following columns:
- index: Sequence number (starting from 1)
- title: Paper title
- authors: Author list
- cycle: Paper cycle (Summer/Fall)
- details_url: Paper detail page URL
requests >= 2.31.0
beautifulsoup4 >= 4.12.2
pandas >= 2.1.0
lxml >= 4.9.3
- Python 3.8 or higher recommended
- Existing files will be skipped during download
The program includes comprehensive error handling for:
- Network connection errors
- File download failures
- Parsing errors
- File saving errors
All errors are logged and displayed without interrupting program execution.
NDSSSpider
class: Main spider class__init__
: Initialize configuration and pathscreate_dirs
: Create necessary directoriessanitize_filename
: Clean filenamesget_paper_list
: Get paper listdownload_file
: Download filesget_paper_details
: Get paper detailssave_paper_list_to_csv
: Save CSV filerun
: Main execution function