🤖Ngram-Similarity-Engine📚

In this project, we will use extracted n-grams to build a database of features for a collection of programs.

Note: Due to privacy policies, I am not allowed to post the dataset publicly.

Table of Contents📑

Introduction
Building the SQLite Database
Filtering Frequent N-Grams
Implemented Features
Similarity Calculation
Analysis and Conclusions

Introduction📘

In this project, we will create and analyze SQLite databases that store n-grams extracted from student files. The goal is to apply methods for storage, filtering, and similarity analysis to detect patterns and relationships between programs.

Building the SQLite Database🛠️

SQLite Database raw.db
Contains a Homeworks table with the following structure:
- Hash - the file hash (MD5, SHA-1, or SHA-256)
- Assign - the assignment number
- Student - the student's identifier
- Ngrams - a blob containing a sorted list of extracted n-grams. Each n-gram is represented as an unsigned 32-bit integer.

Filtering Frequent N-Grams🗂️

SQLite Database features.db
Based on raw.db, this database is built with the same structure but excludes n-grams that appear in more than T files (where T = 30 is suggested).

Implemented Features🧩

Functions:
- sim1(db, h1, h2)
  Calculates the Jaccard similarity based on two provided hashes.
- sim2(db, assign, s1, s2)
  Calculates the Jaccard similarity based on an assignment number and two student identifiers.
  - Returns 0 if one of the students does not exist in the database.

Similarity Calculation📊

For each assignment:
- Calculate the similarity between all pairs of submissions.
- Create a top-500 list of the most similar pairs for each of the two databases (raw.db and features.db).
Analyze source code:
- Select 10 pairs of code from each top list for further analysis.

Analysis and Conclusions🔍

The analysis of similar pairs provides insights into potential common patterns or plagiarism among students.
Using n-grams and the optimized database (features.db) helps reduce noise caused by frequently used elements.

Instructions for Use💾

Build the raw.db database using the initial collection of files.
Apply filtering to create the features.db database.
Implement the sim1 and sim2 functions.
Calculate and analyze similarities according to the requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
docs		docs
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤖Ngram-Similarity-Engine📚

Table of Contents📑

Introduction📘

Building the SQLite Database🛠️

Filtering Frequent N-Grams🗂️

Implemented Features🧩

Similarity Calculation📊

Analysis and Conclusions🔍

Instructions for Use💾

About

Uh oh!

Releases

Packages

Languages

RobCyberLab/Ngram-Similarity-Engine

Folders and files

Latest commit

History

Repository files navigation

🤖Ngram-Similarity-Engine📚

Table of Contents📑

Introduction📘

Building the SQLite Database🛠️

Filtering Frequent N-Grams🗂️

Implemented Features🧩

Similarity Calculation📊

Analysis and Conclusions🔍

Instructions for Use💾

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages