Lightweight Language Agnostic Data Sanitization Pipeline for Dealing with Homoglyphs in Code-Mixed Languages

With the rise in hate speech on social media, numerous Natural Language Processing (NLP) techniques like text classification have been employed for detecting hate speech to make social media less toxic. However, hate speech users have started employing homoglyphs, which are characters that look identical to each other but have a different encoding or structure, to evade detection since most NLP models are trained on commonly recognized Unicode characters. In this paper we propose a novel lightweight language agnostic data sanitization pipeline which constitutes of a CNN for character level OCR followed by Symspell algorithm for candidate word generation and n-grams for word retrieval with the aim of retrieving dehomoglyphed sentences from homoglyphed sentences. We also introduce HEMNIST, an extended version of EMNIST that includes images of homoglyphs. We achieve a cosine similarity of 0.922, 0.845, 0.671, 0.508 and 0.231 between original and retrieved text at 5%, 10%, 20%, 30% and 50% masking respectively.

Homoglyphed-EMNIST(HEMNIST) Download

Authors : Mohammad Yusuf Jamal Aziz Azmi, Subalalitha Chinnaudayar Navaneethakrishnan

Published in: Speech and Language Technologies for Low-Resource Languages

Publisher: Springer Nature Switzerland

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data_n_stuff		data_n_stuff
.gitattributes		.gitattributes
EMNIST_CNN.ipynb		EMNIST_CNN.ipynb
Homoglyph_exploration_and_mapping_creation.ipynb		Homoglyph_exploration_and_mapping_creation.ipynb
LICENSE		LICENSE
MAIN-homoglyphed_code_mixed_word_retrieval_experiment.ipynb		MAIN-homoglyphed_code_mixed_word_retrieval_experiment.ipynb
README.md		README.md
Symspell_spellling_corrector.ipynb		Symspell_spellling_corrector.ipynb
Train_n_grams_language_model_refined.ipynb		Train_n_grams_language_model_refined.ipynb
text-to-image.ipynb		text-to-image.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lightweight Language Agnostic Data Sanitization Pipeline for Dealing with Homoglyphs in Code-Mixed Languages

About

Uh oh!

Releases

Packages

Languages

License

ya0002/SPELLL_2023_Light_weight_data_sanitization_homoglyphs

Folders and files

Latest commit

History

Repository files navigation

Lightweight Language Agnostic Data Sanitization Pipeline for Dealing with Homoglyphs in Code-Mixed Languages

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages