This repository contains an implementation of a statistical language model for Turkish. The goal is to model the Turkish language statistically by working at the syllable level, rather than words or characters, which is especially relevant for agglutinative languages like Turkish. The project uses 1-Gram (unigram), 2-Gram (bigram), and 3-Gram (trigram) models and applies Good-Turing smoothing to handle unseen sequences. The model is evaluated using perplexity, and random sentences are generated to observe the quality of predictions.
Please see the detailed explanation in the project report.
.
├── data/ # Raw and preprocessed data
├── ngram/ # Source code and modules
├── demo.ipynb # Main notebook for training and evaluation
├── report.pdf # Project report
└── README.md # This documentation
-
Clone this repository:
git clone https://github.com/yourusername/turkish-syllable-ngram-model.git cd turkish-syllable-ngram-model
-
Install required dependencies:
pip install -e .
-
Download Turkish Wikipedia Dump from Kaggle, and place into data folder.
-
Run the Jupyter notebook:
jupyter notebook demo.ipynb
-
Follow the steps inside the notebook to:
- Preprocess the data
- Train the N-gram models
- Evaluate perplexity
- Generate sentences
Preprocessing consisted of the following steps:
- All characters were converted to lowercase.
- Turkish characters were replaced with their English equivalents for consistency.
- Non-alphabetic characters (punctuation, numbers, etc.) were removed.
- Each word was segmented into syllables using an open-source library: ftkurt/python-syllable
- Special tokens were added to denote sentence boundaries and spaces.
- Resulting tokenized data was written to a new file preserving structure.
- The tokenized data was split into training (95%) and test (5%) sets.
- Implemented unigram, bigram, and trigram models.
- Frequency counts were stored in nested dictionaries for space and time efficiency.
- Applied to all N-gram models to avoid zero probabilities for unseen sequences.
- Improves generalization and robustness by redistributing probability mass.
Perplexity measures how well the language model predicts unseen data:
- Lower perplexity = better performance
- Calculated using the chain rule of probability and Markov assumptions
- Logarithmic probabilities were used to avoid underflow
Sentence | Unigram | Bigram | Trigram |
---|---|---|---|
Kablumbağalar uzun yaşar. | 280.03 | 40.33 | 4.33 |
Cengiz han dünyaya hükmetti. | 152.35 | 28.49 | 4.45 |
Soğuktan üşüyen kediye süt ısıtıp verdi. | 159.40 | 35.70 | 7.81 |
Ormanda yürüyüş yaparken... | 136.28 | 29.64 | 11.17 |
Dağların zirvesine tırmanırken... | 98.61 | 34.60 | 12.76 |
Model | Perplexity |
---|---|
Unigram | 126.94 |
Bigram | 26.63 |
Trigram | 8.58 |
Sentences were generated using the top 5 most probable next syllables at each step.
- la..la le.la lalalela.lala la.le.la la..lelelala.lelala...
- . ...lela lalale.le lalalalala .le le. le..lelela lele le. la.lalalalela . ..le lale
- birle olamasinadogu ikisinindadir olusmaktaydi olanmislarlarina anayaziya olustusureket oluslaraktigini verini i alanma verenlerindenlemelerindaginindekiyeti ala isecim onemlerden icindenlerlerinda
- verengibilimcileresinayilindigibiligin olanmalarinayinetirilmek ve ise ve icin onem alamalarinden ola birlerece ozellidir olustur.yinedegininmisti ilereceleri olusmaktaydi bir birlinemindandirmesi
- rostan anafilenin yapildiktan alan verilerekta bulu ana gorecelinince isein verenlererasinaviniminininluteryenler.adalarin yasasi verileri bulunabilgisa dayanirlar tarihli veya sahipliginagini a
- tepe veri olusturmasinabilimlererasinabilimin bir yapilan verenle ikilestirilendirmeye verilir.ocak.yiginin.yilinininluteryen takma olusmayacaksa verildikle illeriyken birlikin ozelligiyla alandaki
Trigram sentences are noticeably more coherent and grammatically sound, highlighting the benefit of larger context windows.
- Trigram model achieved the best perplexity and most coherent sentences.
- As the N-gram order increases, model performance improves.
- Good-Turing smoothing was effective in mitigating sparsity.
- Turkish Wikipedia Dump (Kaggle)
- Syllabification Tool
- Manning, C., Schütze, H. Foundations of Statistical Natural Language Processing
Feel free to ⭐ star this repo or fork it for your own research or academic use!