Naveh Vaz Dias Hadas
Amit Ner-Gaon
This project is the third assignment in the Distributed System Programming: Scale Out with Cloud Computing and Map-Reduce course at Ben-Gurion University in 2025. Assignment instructions can be found in assignment3.pdf. This project focuses on semantic similarity classification using MapReduce and Machine Learning. The project is based on the paper Comparing Measures of Semantic Similarity. We modify the algorithm and use the Google Syntactic N-Grams as the corpus. Before processing, we use a Porter Stemmer to obtain the lexeme of each word. We define a feature as a pair consisting of a lexeme and a dependency label. For each lexeme, the system builds a representative vector where each entry represents the count of a specific feature. Then, for each pair of lexemes, the system constructs a 24-dimensional vector representing the distance between the lexemes' vectors, evaluated using four measures of association with context and six measures of vector similarity. Finally, we use WEKA to train a classifier and evaluate the system's accuracy, referring to word-relatedness.txt as ground truth.
- Configure your AWS credentials.
- Create a bucket with
App.bucketname
and upload the steps' JAR files tobucket/jars/
. - In the S3 bucket, delete the
log/
andoutputs/
folders if they exist. - Upload
word-relatedness.txt
to the S3 bucket. If an example corpus is needed, uploads3inputtemp.txt
to S3. - Run
App
To read the output file directly from an S3 bucket without downloading it to your local system, you can use the following command:
aws s3 cp s3://bucketassignment3/output_step1/part-r-00000 - | cat
Note: You may need to install the AWS Toolkit
.
- Feature: a pair consisting of a lexeme and a dependency label.
- count(F): The total number of features.
- count(F = f): The number of occurrences of a specific feature f.
- count(L): The total number of lexemes.
- count(L = l): The number of occurrences of a specific lexeme l.
- count(F = f, L = l): The number of times the specific feature f appears with the specific lexeme l.
The system consists of four parts:
- Step01, Step02 – Preprocessing: Filter the relevant lexemes and features.
- Step1, Step2 – Corpus Statistics: Calculate
count(F = f)
,count(L = l)
, andcount(F = f, L = l)
. - Step3, Step4 – Algorithm Calculation: Measure association with context and compute vector similarity.
- Step5 – Assessment: Evaluate the model's accuracy.
- Step 01: create a
LexemeSet
with the all lexemes inword-relatedness.txt
. - Step 02: create a
DepLabelSet
with the all dependencies label in thecorpus
. - Step 1: calculates count(F=f) and count(L=l) at the
corpus
. Used for creatinglexemeFeatureToCountMap
. Output: (Text feature/lexeme, LongWritable quantity). - Step 2: for each lexeme presented in both the
corpus
andword-relatedness.txt
, calculates a vector of counts(F=f,L=l). The step usesTreeMap
to create a lexicographically ordered map, ensuring a consistent structure for all lexeme vectors. Output: (Text lexeme, Text spaces_separated_counts(F=f, L=l)) - Step 3: measure association with the context and create four vectors, one for each association method. Output: (Text lexeme, Text v5:v6:v7:v8, vi is space separated vector).
- Step 4: using fuzzy join, for each lexemes pair, create a 24-dimensional vector that measures vector similarity (distance) using six distance measure methods. Output: (Text lexeme, Text spaces_separated_vector)
- Step 5: (Not part of the MapReduce pattern) Convert the result to ARFF type and using Weka to assess the model's accuracy.
As instructed, we assume that the word pairs in the gold-standard dataset word-relatedness.txt
can be stored in memory. This assumption was used in steps 1 and 2 to build the lexeme set and in step 3 to perform a mapper-side join with the data from step 1's output.
Using WEKA, we trained a model on the word-relatedness.txt
dataset and evaluated the system’s results with the trained model.
We chose the RandomForest classifier and applied stratified 10-fold cross-validation for training.
This combination helps stabilize the learning process on imbalanced data and produces a more reliable model. This is necessary because both the system's output and the word-relatedness.txt
dataset are heavily skewed toward the FALSE class (1:10 ration).
- We also tried the J48 classifier WEKA Results J48.
With the benefit of hindsight, we would like to suggest a few improvements to the system architecture:
-
Unify Step01, Step02, and Step1 into a single step. The updated step will:
Mapper.setup()
: Create a set of all lexemes inword-relatedness.txt
.Mapper.map()
: Process thecorpus
as input and, for each lexeme present in the set fromsetup()
, calculatecount(L = l)
andcount(F = f)
. Emit with a tag to indicate whether it is a lexeme or a feature.Reducer.reduce()
: Sum up all counts and, using the tag, build two dictionaries, one for lexemes and one for features, mapping each lexeme/feature to its count.
-
We believe that using a JSON format or another textual representation of a dictionary would be a good practice instead of manually implementing the parsing.