NLPdisparity

Code and data for audit NLP models for performance disparity

To access data

The data is available in the output_data folder. This folder includes the syntehtic dyslexia injected English text and the translated output from the services menmtioned in the paper.

Each file within the dataset consists of a “.txt” or “.docx” file containing the translated sentences from AWS, Google, Azure and OpenAI. Each line represents a translated sentence. The file names indicate the type of synthetic injection that was done to the English version and the associated injection probability. The “default” directory consists of the English versions that were submitted to the translation services. The “v1” and “v2” folder names can be ignored. File names and the folder name indicated the type and probability of injection. Each file is the same but with different varying levels/types of injections. E.g. the file name “wmt14_en_p_homophone_0.2_p_letter_0.0_p_confusing_word_0.0” has a probability of 20% to inject a homophone in a sentence, 0 % of injecting a confusing letter and 0% to inject a confusing word. The injection process is explained in our paper.

Notable classes

Injecting_Dyslexia.ipynb was used to inject synthetic dyslexi style text data
baseline_results.ipynb is the notebook that looks over preliminary results (BLEU and WER)
edit_distance.ipynb is the notebook where edit distance was calculated and includes some analysis
bert_score.ipynb was used to calculate BERT score (bert_scores folder contains the saved scores for quicker access) and has some analysis
BLEURT.ipynb was used to calculate BLEURT score (BLEURT_scores folder contains the saved scores for quicker access) and has some analysis
COMET.ipynb was used to calculate COMET score (COMET_scores folder contains the saved scores for quicker access) and has some analysis
LaBSE_data.ipynb was used to calculate LaBSE embeddings and has some analysis (LaBSE folder contains the saved results for quicker access)
pos_tags_analysis.ipynb is an analysis of the POS tags of the translated text
diff_lib_analysis.ipynb is the notebook that uses DiffLib to analyze the outputs from the translation services
swap_results_combined.csv contains the results of the injection statistics for the injection text files that were translated and investigated
Datasheet for datasets contains all important information regarding the dataset
investigating_translations.ipynb is the notebook where the translations were analyzed at a sentence level
All python files (DataLoader, DyslexiaInjector and TestInjector) can be used to inject dyslexia into text and test the translation services
output_data folder contains all the data outputted from the translation services
dict folder contains the dictionaries used for the injection process (mentioned in the paper)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NLPdisparity

To access data

Notable classes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
BLEURT_scores		BLEURT_scores
COMET_scores		COMET_scores
LaBSE_Scores		LaBSE_Scores
bert_scores		bert_scores
dict		dict
output_data		output_data
.gitignore		.gitignore
BLEURT.ipynb		BLEURT.ipynb
COMET.ipynb		COMET.ipynb
DataLoader.py		DataLoader.py
DyslexiaInjector.py		DyslexiaInjector.py
Injecting_Dyslexia.ipynb		Injecting_Dyslexia.ipynb
LaBSE_SE.csv		LaBSE_SE.csv
LaBSE_data.ipynb		LaBSE_data.ipynb
README.md		README.md
TestInjector.py		TestInjector.py
baseline_results.ipynb		baseline_results.ipynb
bert_score.ipynb		bert_score.ipynb
diff_lib_analysis.ipynb		diff_lib_analysis.ipynb
edit_distance.ipynb		edit_distance.ipynb
investigating_translations.ipynb		investigating_translations.ipynb
pos_tags_analysis.ipynb		pos_tags_analysis.ipynb
requirements.txt		requirements.txt
swap_results_combined.csv		swap_results_combined.csv
wmt14_en.txt		wmt14_en.txt
wmt14_fr.txt		wmt14_fr.txt

aimpowered/NLPdisparity

Folders and files

Latest commit

History

Repository files navigation

NLPdisparity

To access data

Notable classes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages