README ASSIGNMENT 3

Authors

Naveh Vaz Dias Hadas
Amit Ner-Gaon

Intro

This project is the third assignment in the Distributed System Programming: Scale Out with Cloud Computing and Map-Reduce course at Ben-Gurion University in 2025. Assignment instructions can be found in assignment3.pdf. This project focuses on semantic similarity classification using MapReduce and Machine Learning. The project is based on the paper Comparing Measures of Semantic Similarity. We modify the algorithm and use the Google Syntactic N-Grams as the corpus. Before processing, we use a Porter Stemmer to obtain the lexeme of each word. We define a feature as a pair consisting of a lexeme and a dependency label. For each lexeme, the system builds a representative vector where each entry represents the count of a specific feature. Then, for each pair of lexemes, the system constructs a 24-dimensional vector representing the distance between the lexemes' vectors, evaluated using four measures of association with context and six measures of vector similarity. Finally, we use WEKA to train a classifier and evaluate the system's accuracy, referring to word-relatedness.txt as ground truth.

How to run

Configure your AWS credentials.
Create a bucket with App.bucketname and upload the steps' JAR files to bucket/jars/.
In the S3 bucket, delete the log/ and outputs/ folders if they exist.
Upload word-relatedness.txt to the S3 bucket. If an example corpus is needed, upload s3inputtemp.txt to S3.
Run App

Handy Command

To read the output file directly from an S3 bucket without downloading it to your local system, you can use the following command:

aws s3 cp s3://bucketassignment3/output_step1/part-r-00000 - | cat

Note: You may need to install the AWS Toolkit.

System Architecture

Terminology

Feature: a pair consisting of a lexeme and a dependency label.

Counts:

count(F): The total number of features.
count(F = f): The number of occurrences of a specific feature f.
count(L): The total number of lexemes.
count(L = l): The number of occurrences of a specific lexeme l.
count(F = f, L = l): The number of times the specific feature f appears with the specific lexeme l.

Overview

The system consists of four parts:

Step01, Step02 – Preprocessing: Filter the relevant lexemes and features.
Step1, Step2 – Corpus Statistics: Calculate count(F = f), count(L = l), and count(F = f, L = l).
Step3, Step4 – Algorithm Calculation: Measure association with context and compute vector similarity.
Step5 – Assessment: Evaluate the model's accuracy.

Steps

Step 01: create a LexemeSet with the all lexemes in word-relatedness.txt.
Step 02: create a DepLabelSet with the all dependencies label in the corpus.
Step 1: calculates count(F=f) and count(L=l) at the corpus. Used for creating lexemeFeatureToCountMap. Output: (Text feature/lexeme, LongWritable quantity).
Step 2: for each lexeme presented in both the corpus and word-relatedness.txt, calculates a vector of counts(F=f,L=l). The step uses TreeMap to create a lexicographically ordered map, ensuring a consistent structure for all lexeme vectors. Output: (Text lexeme, Text spaces_separated_counts(F=f, L=l))
Step 3: measure association with the context and create four vectors, one for each association method. Output: (Text lexeme, Text v5:v6:v7:v8, vi is space separated vector).
Step 4: using fuzzy join, for each lexemes pair, create a 24-dimensional vector that measures vector similarity (distance) using six distance measure methods. Output: (Text lexeme, Text spaces_separated_vector)
Step 5: (Not part of the MapReduce pattern) Convert the result to ARFF type and using Weka to assess the model's accuracy.

Memory Assumptions

As instructed, we assume that the word pairs in the gold-standard dataset word-relatedness.txt can be stored in memory. This assumption was used in steps 1 and 2 to build the lexeme set and in step 3 to perform a mapper-side join with the data from step 1's output.

Input and Output Example

Run Example on a Small Corpus

Results

System's Results

Steps Outputs

Evaluation

Using WEKA, we trained a model on the word-relatedness.txt dataset and evaluated the system’s results with the trained model.

We chose the RandomForest classifier and applied stratified 10-fold cross-validation for training. This combination helps stabilize the learning process on imbalanced data and produces a more reliable model. This is necessary because both the system's output and the word-relatedness.txt dataset are heavily skewed toward the FALSE class (1:10 ration).

WEKA Results RandomForest

We also tried the J48 classifier WEKA Results J48.

Report

Report

Improvements suggestion

With the benefit of hindsight, we would like to suggest a few improvements to the system architecture:

Unify Step01, Step02, and Step1 into a single step. The updated step will:
1. Mapper.setup(): Create a set of all lexemes in word-relatedness.txt.
2. Mapper.map(): Process the corpus as input and, for each lexeme present in the set from setup(), calculate count(L = l) and count(F = f). Emit with a tag to indicate whether it is a lexeme or a feature.
3. Reducer.reduce(): Sum up all counts and, using the tag, build two dictionaries, one for lexemes and one for features, mapping each lexeme/feature to its count.
We believe that using a JSON format or another textual representation of a dictionary would be a good practice instead of manually implementing the parsing.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
METAS		METAS
resources		resources
src/main/java		src/main/java
.gitignore		.gitignore
AWS.iml		AWS.iml
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

README ASSIGNMENT 3

Authors

Intro

How to run

Handy Command

System Architecture

Terminology

Counts:

Overview

Steps

Memory Assumptions

Input and Output Example

Results

System's Results

Evaluation

Report

Improvements suggestion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

AmitNG2000/AWS-Semantic-Similarity

Folders and files

Latest commit

History

Repository files navigation

README ASSIGNMENT 3

Authors

Intro

How to run

Handy Command

System Architecture

Terminology

Counts:

Overview

Steps

Memory Assumptions

Input and Output Example

Results

System's Results

Evaluation

Report

Improvements suggestion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages