Levenshtein Typo Detector

This project is assigned by my data structure and algorithm course lecturer, Pak Imam Cholissodin, S.Si., M.Kom. as a final project. The main task basically is to implement a searching algorithm and determine if the input word is present in KBBI (Kamus Besar Bahasa Indonesia) or not. If the input word is not present in KBBI, it means that the input is a typo.

I think that the way a word is classified as a typo is too broad. There are several possibilities:

The input word is in fact typo
The input word is just not in the formal form of Bahasa Indonesia

To solve this problem I tried many methods, such as cosine similarity, euclidean distance, and Levenshtein ratio (See experiment). Turns out that the Levenshtein ratio is the best method to solve the problem. By getting the ratio score, we can extract correction suggestions from the similarity ratio score.

Flowchart

Optimizations

KBBI dataset has at least ±72k rows, of course, it will be so slow if we calculate Levenshtein ratio for all words linearly. To optimize this, I eliminate a lot of impossible keys with two methods: word_ngram and tolerated_word_ngram.

word_ngram

This method basically is just like a normal n-gram (in my case the best configuration is to use trigram only) but at the word level.

Example:

"biyu" -> ["biy", "iyu"]
"ubed" -> ["ube", "bed"]

This list will be used for eliminating all keys that is doesn't contain the list's elements. This optimization could decrease the amount of possible keys from ±72k to ±13k possibilities.

tolerated_word_ngram

This is an improvisation from word_ngram. The main different is when eliminating keys, it will even eliminating keys that contain the list's element but not in tolerated word length. The lower the threshold, the stricter the elimination

Example:

typo_suspect = ["biyu"]
possible_key = ["biy", "iyu"]
tolerated_length_threshold = 2
upper_bound_tolerence = typo_suspect[0].length + tolerated_length_threshold
lower_bound_tolerence = typo_suspect[0].length - tolerated_length_threshold

key = [words in kbbi that contain possible_key
       && words length is in between upper and lower bound]

With this optimization, the amount of possible keys decrease from ±72k to ±11k

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data/raw		data/raw
notebooks		notebooks
src		src
Procfile		Procfile
README.md		README.md
reference.md		reference.md
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Levenshtein Typo Detector

Flowchart

Optimizations

word_ngram

tolerated_word_ngram

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

SulthanAbiyyu/levenshtein-typo-detector

Folders and files

Latest commit

History

Repository files navigation

Levenshtein Typo Detector

Flowchart

Optimizations

word_ngram

tolerated_word_ngram

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages