Language Modelling

Tokenizer implemented using regex-es from scratch

Considered apostrophes as separate tokens
Currency of the form Rs. and $ has been taken care of
Standard email ids, URLs, Hashtags # and mentions @ are also being handled

Implementation of language modelling algorithm

Kneser-Ney Smoothing
Interpolation
N-grams upto order 6 have been considered
corpus_EN.txt contains sentences in standard English
corpus_TW.txt contains assorted tweets
The language model gets stored in a file named "LM"

Visualization of Word Frequency v/s Word Occurence Rank

Resembles a Zipf's Distribution for most analytic languages
Graph for a selected corpus can be constructed
In the present setting:
1. The first graph considers the top-1000 ranked tokens
2. The second graph considers 10001 to 11000 ranked words in the corpus

Computation of Model Perplexity Scores

Enter a test_corpus to generate the perplexity scores for each sentence
For the comparison of language models, the average perplexity scores across all sentences in the test_corpus is considered

Sentence Generation

The maximum N parameter for used N-gram models can be varied

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data_files		data_files
result_plots		result_plots
README.md		README.md
corpus_EN.txt		corpus_EN.txt
corpus_TW.txt		corpus_TW.txt
language_modelling.ipynb		language_modelling.ipynb
tokenizer.py		tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Language Modelling

Tokenizer implemented using regex-es from scratch

Implementation of language modelling algorithm

Visualization of Word Frequency v/s Word Occurence Rank

Computation of Model Perplexity Scores

Sentence Generation

About

Uh oh!

Releases

Packages

Languages

sayarghoshroy/Language-Modelling

Folders and files

Latest commit

History

Repository files navigation

Language Modelling

Tokenizer implemented using regex-es from scratch

Implementation of language modelling algorithm

Visualization of Word Frequency v/s Word Occurence Rank

Computation of Model Perplexity Scores

Sentence Generation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages