This repository contains all code and statsitical analysis results of our study
Our study is based on the EMIP dataset. For more details about the dataset, please refer to the official EMIP website: http://www.emipws.org/dataset/. The directory emip_dataset/
contains the original EMIP datase including raw eye-tracking data, metadata as well as stimulis. The directory EMIP Toolkit/
is a collection of essential tools for preprocessing the dataset. (Reference: https://doi.org/10.1145/3448018.3457425, Github Repo: https://github.com/nalmadi/EMIP-Toolkit). The directory Corrected EMIP Dataset/
is a preprocessed version of the dataset provided by the authors of the EMIP Toolkit paper. We strongly advise using this corrected version, as the fixation correction step requires extensive manual review, which has already been conducted through a peer-review approach by the authors. For accuracy and efficiency, our study also uses the corrected EMIP dataset rather than conducting our own preprocessing. This leverages the high-quality fixation corrections performed by the original authors.
All preprocessing scripts for the corrected EMIP dataset are located in the Preprocessing_And_Data_Management/
directory. Since the corrected dataset stores eye-tracking data from all trials of all participants in a single .csv
file, this can be inconvenient for further analysis. The scripts provided help organize and refine the dataset for improved usability.
To split the data by participant and trial, run split_corrected_data_into_participants_trials.py
. This script creates separate directories for each participant and saves each trial as an individual .csv
file within the corresponding participant’s directory.
To incorporate complexity levels, run adding_complexity_value.py
. This script appends a new column labeled "complexity", indicating the complexity level of each trial based on stimulis. Alternatively, run adding_complexity_value_new_categorization.py
which is based on another categorization of code snippets.
To extract the correctness of participant's answer to the comprehension question from the metadata, run adding_comprehension_question_results.py
. This script appends a new column labeled "Comprehension_Question_Result", indicating whether the participant answered the comprehension question correctly after reading each trial (1 for Correct and 0 for Incorrect). Additionally, this script also appends the "complexity" column by following the same logic implemented in adding_complexity_value.py
.
To analyze different stages of the comprehension process, trials can be segmented into timeframes. Run separating_trial_into_two_parts.py
to split each trial into two equally long timeframes. The two segments are saved as separate .csv
files. Similarly, running separating_trial_into_three_parts.py
divides each trial into three equally long timeframes, saving the three segments as distinct .csv
files.
The directories Computing Metrics/
, Computing Metrics Complexity/
, Computing Metrics Complexity New Categorization/
, Computing Metrics Behavioural Data/
, Computing Metrics Complexity_2stages/
, and Computing Metrics Complexity_3stages/
contain scripts for computing local and global metrics. Each directory provides similar functionality with slight adjustments tailored to different research questions. Please select the directory that aligns with the specific research question you are investigating. The Computing Metrics/
directory serves as the base version.
To compute local metrics, run Computing_local_metrics.py
. The summary of resulting local metrics is stored in a .csv
file.
To compute global metrics, run Computing_global_metrics.py
. The summary of resulting global metrics is stored in a .csv
file.
To merge the summaries of local and global metrics, run Combining_Summaries.py
. This script consolidates the computed results into a single .csv
file for further analysis.
The script NW_Algorithm.py
contains the implementation of the Needleman-Wunsch Algorithm, which is essential for computing global metrics. Ensure that this script is properly executed as part of the global metric computation process.
The directories Data_Analysis_LMM_Code/
, Data_Analysis_Complexity_LMM_Code/
, Data_Analysis_Complexity_New_Categorization_LMM_Code/
, Data_Analysis_Behavioural_Data_LMM_Code/
, Data_Analysis_Complexity_2stages_LMM_Code/
, Data_Analysis_Complexity_3stages_LMM_Code/
contain statistical analysis scripts written in R. Each directory corresponds to a different research question, so please select the appropriate directory that aligns with your investigation. The Data_Analysis_LMM_Code/
directory serves as the base version. Each R file under these directories contains a single regression model for one linearity measure, includng Element Coverage, Execution Order Dynamic Score, Execuion Order Dynamic Repetitions, Execution Order Naive Score, Horizontal Later Rate, Line Regression Rate, Regression Rate, Saccade Length, Story Order Dynamic Score, Story Order Dynamic Repetitions, Story Order Naive Score, Vertical Later Rate and Vertical Next Rate. The outputs are saved in Analysis_Results_Expertise_Programming_Language/
, Analysis_Results_Expertise_Programming_Language_Complexity/
, Analysis_Results_Behavioural_Data/
,Analysis_Results_Expertise_Programming_Language_Complexity_2stages/
, Analysis_Results_Expertise_Programming_Language_Complexity_3stages/
respectively.
We propose two different approaches to categorize the code snippets in the EMIP dataset based on their complexity levels. For the first approach, follow the pipeline of adding_complexity_value.py
, Computing Metrics Complexity/
and Data_Analysis_Complexity_LMM_Code/
. For the second approach, follow the pipeline of adding_complexity_value_new_categorization.py
, Computing Metrics Complexity New Categorization/
and Data_Analysis_Complexity_New_Categorization_LMM_Code/
. Please refer to Measuring the Complexity of the Source Code.txt
for more details about these two categorization methods
The directory semantic category analysis/
contains all scripts required for conducting abstract scan path analysis. To assign tokens to semantic categories, run assigning_semantic_category.py
. To generate the full abstract scan path for all trials, run generating_abstract_scan_path.py
which results in a single .csv file where each row corresponds to one trial and includes relevant metadata including participant , code file, programming language, comprehension question result, source code complexity, expertise level, and the generated scan path. The scan path itself is represented as a sequence of semantic categories connected by arrows. To perform the bigram and trigram analysis, run bigram_trigram_analysis.py
. To generate plots of bigram and trigram results, run bigram_plot.py
and trigram_plot.py
respectively. To train the logistic regression to predict comprehension result based on features of abstract scan path, run predict_comprehension.py