In this project, we will use extracted n-grams to build a database of features for a collection of programs.
Note: Due to privacy policies, I am not allowed to post the dataset publicly.
- Introduction
- Building the SQLite Database
- Filtering Frequent N-Grams
- Implemented Features
- Similarity Calculation
- Analysis and Conclusions
In this project, we will create and analyze SQLite databases that store n-grams extracted from student files. The goal is to apply methods for storage, filtering, and similarity analysis to detect patterns and relationships between programs.
- SQLite Database
raw.db
Contains aHomeworks
table with the following structure:Hash
- the file hash (MD5, SHA-1, or SHA-256)Assign
- the assignment numberStudent
- the student's identifierNgrams
- a blob containing a sorted list of extracted n-grams. Each n-gram is represented as an unsigned 32-bit integer.
- SQLite Database
features.db
Based onraw.db
, this database is built with the same structure but excludes n-grams that appear in more thanT
files (whereT = 30
is suggested).
- Functions:
sim1(db, h1, h2)
Calculates the Jaccard similarity based on two provided hashes.sim2(db, assign, s1, s2)
Calculates the Jaccard similarity based on an assignment number and two student identifiers.- Returns
0
if one of the students does not exist in the database.
- Returns
-
For each assignment:
- Calculate the similarity between all pairs of submissions.
- Create a top-500 list of the most similar pairs for each of the two databases (
raw.db
andfeatures.db
).
-
Analyze source code:
- Select 10 pairs of code from each top list for further analysis.
- The analysis of similar pairs provides insights into potential common patterns or plagiarism among students.
- Using n-grams and the optimized database (
features.db
) helps reduce noise caused by frequently used elements.
- Build the
raw.db
database using the initial collection of files. - Apply filtering to create the
features.db
database. - Implement the
sim1
andsim2
functions. - Calculate and analyze similarities according to the requirements.