TuffySearch Course Catalog Scraper

A multi-threaded web scraper for the Cal State Fullerton course catalog. This tool fetches course information including titles, descriptions, departments, and units from the university's course catalog website.

🚀 Features

Multi-threaded scraping for improved performance
Progress tracking with rich console output
Automatic handling of course departments from section headers
Unicode character cleaning utility
JSON output format for easy data processing

📋 Prerequisites

Python 3.12 or higher
uv package manager

🛠️ Installation

Clone the repository:

git clone https://github.com/yourusername/tuffysearch-scraper.git
cd tuffysearch-scraper

Install uv if you haven't already:

curl -LsSf https://astral.sh/uv/install.sh | sh

🏃‍♂️ Usage

Scraping the Course Catalog

Simply run:

uv run scrape.py

This will:

Create a virtual environment if it doesn't exist
Install dependencies from pyproject.toml
Fetch course data from the CSUF course catalog
Process and organize the information
Save the results to data/raw_2025-2026_catalog.json

Process the raw data

To clean the data and extract relevant information, run the second script:

uv run reprocess.py

This will:

Read the JSON data taken from the scraped web pages
Save the processed data to data/processed_2025-2026_catalog.json
Print any unknown description blocks to the console

📁 Project Structure

tuffysearch-scraper/
├── data/                  # Output directory for JSON files
├── modules/              # Python modules
│   ├── course_departments.py  # Department mapping module
│   └── util.py          # Utility functions
├── models/              # Data models
│   └── courses.py       # Course data type definitions
├── scrape.py           # Main scraper script
├── reprocess.py        # Course data processing script
├── clean.py            # Unicode character cleaning utility
└── pyproject.toml      # Project metadata and dependencies

The project consists of two main scripts:

scrape.py: Scrapes course data from the CSUF course catalog using a multi-threaded approach with progress tracking
reprocess.py: Processes the raw course data into a structured format with progress tracking

The data flow is:

Raw course data is scraped and saved to data/raw_YYYY-YYYY_catalog.json
The raw data is processed and saved to data/processed_YYYY-YYYY_catalog.json

🔧 Technical Details

Uses requests for HTTP requests
BeautifulSoup4 for HTML parsing
rich for beautiful console output
Multi-threading with ThreadPoolExecutor
Progress tracking with custom progress bars

📝 Notes

The scraper is configured for the 2025-2026 course catalog
Course departments are now extracted from section headers instead of the department mapping page
The course_departments.py module is currently unused but kept for reference
Uses modern Python packaging with pyproject.toml

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TuffySearch Course Catalog Scraper

🚀 Features

📋 Prerequisites

🛠️ Installation

🏃‍♂️ Usage

Scraping the Course Catalog

Process the raw data

📁 Project Structure

🔧 Technical Details

📝 Notes

🤝 Contributing

📄 License

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
models		models
modules		modules
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
reprocess.py		reprocess.py
scrape.py		scrape.py
uv.lock		uv.lock

License

clxmente/tuffysearch-scraper

Folders and files

Latest commit

History

Repository files navigation

TuffySearch Course Catalog Scraper

🚀 Features

📋 Prerequisites

🛠️ Installation

🏃‍♂️ Usage

Scraping the Course Catalog

Process the raw data

📁 Project Structure

🔧 Technical Details

📝 Notes

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages