Course Data Processing and Recommendation System

Project Overview

This project is a course recommendation system powered by a Retrieval-Augmented Generation (RAG) model. It uses a database of university courses to provide intelligent course suggestions based on user queries. The data pipeline includes scraping course data, preprocessing it, and storing it in a database for efficient retrieval and embedding-based similarity searches.

Data Collection and Preprocessing

Environment Setup: Install Python dependencies:
```
pip install -r requirements.txt
```

Run Data Collection:

python scraper.py
python data_transform.py

Encode the data and store it into MongoDB
```
python db_store.py
```

2. Preprocessing Data

Objective: Convert the raw scraped data into a structured format and encode necessary fields.
Steps:
1. Load the raw JSON files.
2. Transform the data to extract the following fields:
  - course_id: Unique identifier for the course.
  - course_code: Course code (e.g., "CSC101").
  - section_code: Course section (e.g., "Y").
  - name: Course title.
  - description: Detailed course information.
  - division: Faculty or department offering the course.
  - prerequisites: Required courses or conditions.
  - exclusions: Courses that cannot be taken with this course.
  - sessions: Academic terms when the course is offered.
3. Encode the description field into vector embeddings using Sentence-Transformer:
  - Purpose: Capture the semantic content of course descriptions for similarity searches.
4. Save the transformed data in a database-friendly JSON format.

3. Storing Data

Use MongoDB to store preprocessed data.
Key details:
- Encoded Fields:
  - description: Stored as STransformer vector embeddings.
- Raw Fields: All other fields are stored in plain text for reference and display.

Work Distribution

Here’s a list of tasks to help distribute work among team members:

Data Collection

Set up the course data scraper and verify data completeness.
Handle any website/API changes that require adjustments to the scraper.

Data Preprocessing

Write a script to clean and normalize raw JSON data.
Develop a function to encode the description field using STransformer.
Test encoding results and validate the embeddings.

Database Setup

Design MongoDB schemas for courses and meeting_sections collections.
Write and test scripts for inserting data into MongoDB.
Optimize database queries for similarity searches.

Model Integration

Integrate SBERT-encoded embeddings with the RAG pipeline.
Test retrieval accuracy and adjust embeddings or preprocessing logic if necessary.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
RAG		RAG
backend		backend
frontend		frontend
mongodb		mongodb
.gitignore		.gitignore
JSONGeneratorAgent.py		JSONGeneratorAgent.py
README.md		README.md
TextRecommendationAgent.py		TextRecommendationAgent.py
course_stats.png		course_stats.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Course Data Processing and Recommendation System

Project Overview

Data Collection and Preprocessing

2. Preprocessing Data

3. Storing Data

Work Distribution

Data Collection

Data Preprocessing

Database Setup

Model Integration

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

ece1786-2024/CourseCraft

Folders and files

Latest commit

History

Repository files navigation

Course Data Processing and Recommendation System

Project Overview

Data Collection and Preprocessing

2. Preprocessing Data

3. Storing Data

Work Distribution

Data Collection

Data Preprocessing

Database Setup

Model Integration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages