A multi-threaded web scraper for the Cal State Fullerton course catalog. This tool fetches course information including titles, descriptions, departments, and units from the university's course catalog website.
- Multi-threaded scraping for improved performance
- Progress tracking with rich console output
- Automatic handling of course departments from section headers
- Unicode character cleaning utility
- JSON output format for easy data processing
- Python 3.12 or higher
uv
package manager
- Clone the repository:
git clone https://github.com/yourusername/tuffysearch-scraper.git
cd tuffysearch-scraper
- Install
uv
if you haven't already:
curl -LsSf https://astral.sh/uv/install.sh | sh
Simply run:
uv run scrape.py
This will:
- Create a virtual environment if it doesn't exist
- Install dependencies from
pyproject.toml
- Fetch course data from the CSUF course catalog
- Process and organize the information
- Save the results to
data/raw_2025-2026_catalog.json
To clean the data and extract relevant information, run the second script:
uv run reprocess.py
This will:
- Read the JSON data taken from the scraped web pages
- Save the processed data to
data/processed_2025-2026_catalog.json
- Print any unknown description blocks to the console
tuffysearch-scraper/
βββ data/ # Output directory for JSON files
βββ modules/ # Python modules
β βββ course_departments.py # Department mapping module
β βββ util.py # Utility functions
βββ models/ # Data models
β βββ courses.py # Course data type definitions
βββ scrape.py # Main scraper script
βββ reprocess.py # Course data processing script
βββ clean.py # Unicode character cleaning utility
βββ pyproject.toml # Project metadata and dependencies
The project consists of two main scripts:
scrape.py
: Scrapes course data from the CSUF course catalog using a multi-threaded approach with progress trackingreprocess.py
: Processes the raw course data into a structured format with progress tracking
The data flow is:
- Raw course data is scraped and saved to
data/raw_YYYY-YYYY_catalog.json
- The raw data is processed and saved to
data/processed_YYYY-YYYY_catalog.json
- Uses
requests
for HTTP requests BeautifulSoup4
for HTML parsingrich
for beautiful console output- Multi-threading with
ThreadPoolExecutor
- Progress tracking with custom progress bars
- The scraper is configured for the 2025-2026 course catalog
- Course departments are now extracted from section headers instead of the department mapping page
- The
course_departments.py
module is currently unused but kept for reference - Uses modern Python packaging with
pyproject.toml
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.