Skip to content

ItsLastDay/Twitter-language-identification

Repository files navigation

Twitter-language-identification

Academic project, coursework on 3rd year of studying.

Researched language identification algorithms, with focus on short informal messages.
Gathered data-set of 227k messages (most of them in Russian) from various sources (Twitter API, other works).
Implemented two approaches for language identification task, made modifications to one approach.
Compared performance of 6 approaches on gathered data-set.

As a result, approach modified by me outperforms others. However, it is rather memory-consuming.

This repository contains all files, that were gathered\produced during the research. My implementations of LID algorithms lie in /progs/logr and /progs/liga. There are also a bunch of programs in /scripts, which helped with tweet processing.

About

Bachelor thesis - language identification of short texts (2014)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published