Code-Comprehension-Cheng

This repository contains all code and statsitical analysis results of our study

Data

Our study is based on the EMIP dataset. For more details about the dataset, please refer to the official EMIP website: http://www.emipws.org/dataset/. The directory emip_dataset/ contains the original EMIP datase including raw eye-tracking data, metadata as well as stimulis. The directory EMIP Toolkit/ is a collection of essential tools for preprocessing the dataset. (Reference: https://doi.org/10.1145/3448018.3457425, Github Repo: https://github.com/nalmadi/EMIP-Toolkit). The directory Corrected EMIP Dataset/ is a preprocessed version of the dataset provided by the authors of the EMIP Toolkit paper. We strongly advise using this corrected version, as the fixation correction step requires extensive manual review, which has already been conducted through a peer-review approach by the authors. For accuracy and efficiency, our study also uses the corrected EMIP dataset rather than conducting our own preprocessing. This leverages the high-quality fixation corrections performed by the original authors.

Preprocessing

All preprocessing scripts for the corrected EMIP dataset are located in the Preprocessing_And_Data_Management/ directory. Since the corrected dataset stores eye-tracking data from all trials of all participants in a single .csv file, this can be inconvenient for further analysis. The scripts provided help organize and refine the dataset for improved usability.

To split the data by participant and trial, run split_corrected_data_into_participants_trials.py. This script creates separate directories for each participant and saves each trial as an individual .csv file within the corresponding participant’s directory.

To incorporate complexity levels, run adding_complexity_value.py. This script appends a new column labeled "complexity", indicating the complexity level of each trial based on stimulis. Alternatively, run adding_complexity_value_new_categorization.py which is based on another categorization of code snippets.

To extract the correctness of participant's answer to the comprehension question from the metadata, run adding_comprehension_question_results.py. This script appends a new column labeled "Comprehension_Question_Result", indicating whether the participant answered the comprehension question correctly after reading each trial (1 for Correct and 0 for Incorrect). Additionally, this script also appends the "complexity" column by following the same logic implemented in adding_complexity_value.py.

To analyze different stages of the comprehension process, trials can be segmented into timeframes. Run separating_trial_into_two_parts.py to split each trial into two equally long timeframes. The two segments are saved as separate .csv files. Similarly, running separating_trial_into_three_parts.py divides each trial into three equally long timeframes, saving the three segments as distinct .csv files.

Analysis

The directories Computing Metrics/, Computing Metrics Complexity/, Computing Metrics Complexity New Categorization/, Computing Metrics Behavioural Data/, Computing Metrics Complexity_2stages/, and Computing Metrics Complexity_3stages/ contain scripts for computing local and global metrics. Each directory provides similar functionality with slight adjustments tailored to different research questions. Please select the directory that aligns with the specific research question you are investigating. The Computing Metrics/ directory serves as the base version.

To compute local metrics, run Computing_local_metrics.py. The summary of resulting local metrics is stored in a .csv file.

To compute global metrics, run Computing_global_metrics.py. The summary of resulting global metrics is stored in a .csv file.

To merge the summaries of local and global metrics, run Combining_Summaries.py. This script consolidates the computed results into a single .csv file for further analysis.

The script NW_Algorithm.py contains the implementation of the Needleman-Wunsch Algorithm, which is essential for computing global metrics. Ensure that this script is properly executed as part of the global metric computation process.

Statistics

The directories Data_Analysis_LMM_Code/, Data_Analysis_Complexity_LMM_Code/, Data_Analysis_Complexity_New_Categorization_LMM_Code/, Data_Analysis_Behavioural_Data_LMM_Code/, Data_Analysis_Complexity_2stages_LMM_Code/, Data_Analysis_Complexity_3stages_LMM_Code/ contain statistical analysis scripts written in R. Each directory corresponds to a different research question, so please select the appropriate directory that aligns with your investigation. The Data_Analysis_LMM_Code/ directory serves as the base version. Each R file under these directories contains a single regression model for one linearity measure, includng Element Coverage, Execution Order Dynamic Score, Execuion Order Dynamic Repetitions, Execution Order Naive Score, Horizontal Later Rate, Line Regression Rate, Regression Rate, Saccade Length, Story Order Dynamic Score, Story Order Dynamic Repetitions, Story Order Naive Score, Vertical Later Rate and Vertical Next Rate. The outputs are saved in Analysis_Results_Expertise_Programming_Language/, Analysis_Results_Expertise_Programming_Language_Complexity/, Analysis_Results_Behavioural_Data/,Analysis_Results_Expertise_Programming_Language_Complexity_2stages/, Analysis_Results_Expertise_Programming_Language_Complexity_3stages/ respectively.

Code Complexity Categorization

We propose two different approaches to categorize the code snippets in the EMIP dataset based on their complexity levels. For the first approach, follow the pipeline of adding_complexity_value.py, Computing Metrics Complexity/ and Data_Analysis_Complexity_LMM_Code/. For the second approach, follow the pipeline of adding_complexity_value_new_categorization.py, Computing Metrics Complexity New Categorization/ and Data_Analysis_Complexity_New_Categorization_LMM_Code/. Please refer to Measuring the Complexity of the Source Code.txt for more details about these two categorization methods

Abstract Scan Path Analysis

The directory semantic category analysis/ contains all scripts required for conducting abstract scan path analysis. To assign tokens to semantic categories, run assigning_semantic_category.py. To generate the full abstract scan path for all trials, run generating_abstract_scan_path.py which results in a single .csv file where each row corresponds to one trial and includes relevant metadata including participant , code file, programming language, comprehension question result, source code complexity, expertise level, and the generated scan path. The scan path itself is represented as a sequence of semantic categories connected by arrows. To perform the bigram and trigram analysis, run bigram_trigram_analysis.py. To generate plots of bigram and trigram results, run bigram_plot.py and trigram_plot.py respectively. To train the logistic regression to predict comprehension result based on features of abstract scan path, run predict_comprehension.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Code-Comprehension-Cheng

Data

Preprocessing

Analysis

Statistics

Code Complexity Categorization

Abstract Scan Path Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Abstract_Scan_Path_Analysis		Abstract_Scan_Path_Analysis
Analysis_Results_Behavioural_Data		Analysis_Results_Behavioural_Data
Analysis_Results_Expertise_Programming_Language		Analysis_Results_Expertise_Programming_Language
Analysis_Results_Expertise_Programming_Language_Complexity		Analysis_Results_Expertise_Programming_Language_Complexity
Analysis_Results_Expertise_Programming_Language_Complexity_2stages		Analysis_Results_Expertise_Programming_Language_Complexity_2stages
Analysis_Results_Expertise_Programming_Language_Complexity_3stages		Analysis_Results_Expertise_Programming_Language_Complexity_3stages
Computing Metrics Behavioural Data		Computing Metrics Behavioural Data
Computing Metrics Complexity New Categorization		Computing Metrics Complexity New Categorization
Computing Metrics Complexity		Computing Metrics Complexity
Computing Metrics Complexity_2stages		Computing Metrics Complexity_2stages
Computing Metrics Complexity_3stages		Computing Metrics Complexity_3stages
Computing Metrics		Computing Metrics
Corrected EMIP Dataset		Corrected EMIP Dataset
Data_Analysis_Behavioural_Data_LMM_Code		Data_Analysis_Behavioural_Data_LMM_Code
Data_Analysis_Complexity_2stages_LMM_Code		Data_Analysis_Complexity_2stages_LMM_Code
Data_Analysis_Complexity_3stages_LMM_Code		Data_Analysis_Complexity_3stages_LMM_Code
Data_Analysis_Complexity_LMM_Code		Data_Analysis_Complexity_LMM_Code
Data_Analysis_Complexity_New_Categorization_LMM_Code		Data_Analysis_Complexity_New_Categorization_LMM_Code
Data_Analysis_LMM_Code		Data_Analysis_LMM_Code
EMIP Toolkit		EMIP Toolkit
Participant_Trials		Participant_Trials
Participant_Trials_Abstract_Scan_Path		Participant_Trials_Abstract_Scan_Path
Participant_Trials_Complexity		Participant_Trials_Complexity
Participant_Trials_Complexity_2Stages		Participant_Trials_Complexity_2Stages
Participant_Trials_Complexity_3Stages		Participant_Trials_Complexity_3Stages
Participant_Trials_Complexity_New_Categorization		Participant_Trials_Complexity_New_Categorization
Participant_Trials_Comprehension_Results		Participant_Trials_Comprehension_Results
Participant_Trials_Semantic_Category		Participant_Trials_Semantic_Category
Preprocessing_And_Data_Management		Preprocessing_And_Data_Management
semantic category analysis		semantic category analysis
.DS_Store		.DS_Store
Measuring the Complexity of the Source Code.txt		Measuring the Complexity of the Source Code.txt
README.md		README.md
all_metrics_behavioural_data_summary.csv		all_metrics_behavioural_data_summary.csv
all_metrics_complexity_2stages_summary.csv		all_metrics_complexity_2stages_summary.csv
all_metrics_complexity_3stages_summary.csv		all_metrics_complexity_3stages_summary.csv
all_metrics_complexity_new_categorization_summary.csv		all_metrics_complexity_new_categorization_summary.csv
all_metrics_complexity_summary.csv		all_metrics_complexity_summary.csv
all_metrics_summary.csv		all_metrics_summary.csv
global_metrics_behavioural_data_summary.csv		global_metrics_behavioural_data_summary.csv
global_metrics_complexity_2stages_summary.csv		global_metrics_complexity_2stages_summary.csv
global_metrics_complexity_3stages_summary.csv		global_metrics_complexity_3stages_summary.csv
global_metrics_complexity_new_categorization_summary.csv		global_metrics_complexity_new_categorization_summary.csv
global_metrics_complexity_summary.csv		global_metrics_complexity_summary.csv
global_metrics_summary.csv		global_metrics_summary.csv
local_metrics_behavioural_data_summary.csv		local_metrics_behavioural_data_summary.csv
local_metrics_complexity_2stages_summary.csv		local_metrics_complexity_2stages_summary.csv
local_metrics_complexity_3stages_summary.csv		local_metrics_complexity_3stages_summary.csv
local_metrics_complexity_new_categorization_summary.csv		local_metrics_complexity_new_categorization_summary.csv
local_metrics_complexity_summary.csv		local_metrics_complexity_summary.csv
local_metrics_summary.csv		local_metrics_summary.csv

ETH-PEACH-Lab/Code-Comprehension-Cheng

Folders and files

Latest commit

History

Repository files navigation

Code-Comprehension-Cheng

Data

Preprocessing

Analysis

Statistics

Code Complexity Categorization

Abstract Scan Path Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages