This repo includes basic implementations of Unigram language modeling with smoothing, Document ranking using probabilistic retrieval, and Zipf’s law and cross-entropy evaluation
-
Unigram Language Modeling:
- Tokenization and preprocessing
- Unigram probability estimation
- Evaluation using Zipf’s law and cross-entropy
-
Probabilistic Information Retrieval:
- Ranking documents based on unigram probabilities (Jelinek-Mercer smoothing)
- Evaluation of ranking effectiveness based on λ tuning and query types