Skip to content

implementation of a simple, fast Unicode homoglyph normalization method to sanitize text by replacing visually similar characters (e.g., Cyrillic 'а' → Latin 'a'). Useful for spam filtering, input validation, and improving LLM robustness. Developed for SPELLL 2023.

License

Notifications You must be signed in to change notification settings

ya0002/SPELLL_2023_Light_weight_data_sanitization_homoglyphs

Repository files navigation

With the rise in hate speech on social media, numerous Natural Language Processing (NLP) techniques like text classification have been employed for detecting hate speech to make social media less toxic. However, hate speech users have started employing homoglyphs, which are characters that look identical to each other but have a different encoding or structure, to evade detection since most NLP models are trained on commonly recognized Unicode characters. In this paper we propose a novel lightweight language agnostic data sanitization pipeline which constitutes of a CNN for character level OCR followed by Symspell algorithm for candidate word generation and n-grams for word retrieval with the aim of retrieving dehomoglyphed sentences from homoglyphed sentences. We also introduce HEMNIST, an extended version of EMNIST that includes images of homoglyphs. We achieve a cosine similarity of 0.922, 0.845, 0.671, 0.508 and 0.231 between original and retrieved text at 5%, 10%, 20%, 30% and 50% masking respectively.

2 drawio

Homoglyphed-EMNIST(HEMNIST) Download

grid

Authors : Mohammad Yusuf Jamal Aziz Azmi, Subalalitha Chinnaudayar Navaneethakrishnan

Published in: Speech and Language Technologies for Low-Resource Languages

Publisher: Springer Nature Switzerland

About

implementation of a simple, fast Unicode homoglyph normalization method to sanitize text by replacing visually similar characters (e.g., Cyrillic 'а' → Latin 'a'). Useful for spam filtering, input validation, and improving LLM robustness. Developed for SPELLL 2023.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published