This project is a course recommendation system powered by a Retrieval-Augmented Generation (RAG) model. It uses a database of university courses to provide intelligent course suggestions based on user queries. The data pipeline includes scraping course data, preprocessing it, and storing it in a database for efficient retrieval and embedding-based similarity searches.
-
Environment Setup: Install Python dependencies:
pip install -r requirements.txt
-
Run Data Collection:
python scraper.py python data_transform.py
-
Encode the data and store it into MongoDB
python db_store.py
- Objective: Convert the raw scraped data into a structured format and encode necessary fields.
- Steps:
- Load the raw JSON files.
- Transform the data to extract the following fields:
course_id
: Unique identifier for the course.course_code
: Course code (e.g., "CSC101").section_code
: Course section (e.g., "Y").name
: Course title.description
: Detailed course information.division
: Faculty or department offering the course.prerequisites
: Required courses or conditions.exclusions
: Courses that cannot be taken with this course.sessions
: Academic terms when the course is offered.
- Encode the
description
field into vector embeddings using Sentence-Transformer:- Purpose: Capture the semantic content of course descriptions for similarity searches.
- Save the transformed data in a database-friendly JSON format.
- Use MongoDB to store preprocessed data.
- Key details:
- Encoded Fields:
description
: Stored as STransformer vector embeddings.
- Raw Fields: All other fields are stored in plain text for reference and display.
- Encoded Fields:
Here’s a list of tasks to help distribute work among team members:
- Set up the course data scraper and verify data completeness.
- Handle any website/API changes that require adjustments to the scraper.
- Write a script to clean and normalize raw JSON data.
- Develop a function to encode the
description
field using STransformer. - Test encoding results and validate the embeddings.
- Design MongoDB schemas for
courses
andmeeting_sections
collections. - Write and test scripts for inserting data into MongoDB.
- Optimize database queries for similarity searches.
- Integrate SBERT-encoded embeddings with the RAG pipeline.
- Test retrieval accuracy and adjust embeddings or preprocessing logic if necessary.