Skip to content

clxmente/tuffysearch-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TuffySearch Course Catalog Scraper

Python Version License

A multi-threaded web scraper for the Cal State Fullerton course catalog. This tool fetches course information including titles, descriptions, departments, and units from the university's course catalog website.

πŸš€ Features

  • Multi-threaded scraping for improved performance
  • Progress tracking with rich console output
  • Automatic handling of course departments from section headers
  • Unicode character cleaning utility
  • JSON output format for easy data processing

πŸ“‹ Prerequisites

  • Python 3.12 or higher
  • uv package manager

πŸ› οΈ Installation

  1. Clone the repository:
git clone https://github.com/yourusername/tuffysearch-scraper.git
cd tuffysearch-scraper
  1. Install uv if you haven't already:
curl -LsSf https://astral.sh/uv/install.sh | sh

πŸƒβ€β™‚οΈ Usage

Scraping the Course Catalog

Simply run:

uv run scrape.py

This will:

  1. Create a virtual environment if it doesn't exist
  2. Install dependencies from pyproject.toml
  3. Fetch course data from the CSUF course catalog
  4. Process and organize the information
  5. Save the results to data/raw_2025-2026_catalog.json

Process the raw data

To clean the data and extract relevant information, run the second script:

uv run reprocess.py

This will:

  1. Read the JSON data taken from the scraped web pages
  2. Save the processed data to data/processed_2025-2026_catalog.json
  3. Print any unknown description blocks to the console

πŸ“ Project Structure

tuffysearch-scraper/
β”œβ”€β”€ data/                  # Output directory for JSON files
β”œβ”€β”€ modules/              # Python modules
β”‚   β”œβ”€β”€ course_departments.py  # Department mapping module
β”‚   └── util.py          # Utility functions
β”œβ”€β”€ models/              # Data models
β”‚   └── courses.py       # Course data type definitions
β”œβ”€β”€ scrape.py           # Main scraper script
β”œβ”€β”€ reprocess.py        # Course data processing script
β”œβ”€β”€ clean.py            # Unicode character cleaning utility
└── pyproject.toml      # Project metadata and dependencies

The project consists of two main scripts:

  1. scrape.py: Scrapes course data from the CSUF course catalog using a multi-threaded approach with progress tracking
  2. reprocess.py: Processes the raw course data into a structured format with progress tracking

The data flow is:

  1. Raw course data is scraped and saved to data/raw_YYYY-YYYY_catalog.json
  2. The raw data is processed and saved to data/processed_YYYY-YYYY_catalog.json

πŸ”§ Technical Details

  • Uses requests for HTTP requests
  • BeautifulSoup4 for HTML parsing
  • rich for beautiful console output
  • Multi-threading with ThreadPoolExecutor
  • Progress tracking with custom progress bars

πŸ“ Notes

  • The scraper is configured for the 2025-2026 course catalog
  • Course departments are now extracted from section headers instead of the department mapping page
  • The course_departments.py module is currently unused but kept for reference
  • Uses modern Python packaging with pyproject.toml

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

πŸ€– Web scraper to scrape course information from CSUF's course catalog

Topics

Resources

License

Stars

Watchers

Forks

Languages