STRATEGIES FOR THE PROGRAMMATIC GENERATION OF LABELLED CORPUS FOR TEXT CLASSIFICATION

Master’s Thesis presented to the Telecommunications Engineering School of the Universidade de Vigo

Master’s Degree in Telecommunications Engineering

Pedro Alonso Doval

Abstract

This master thesis explores weak supervision technologies, focusing on the more innovative data programming strategies for automatically labelling large datasets of texts for complex problems in which human labelling is impractical. The objective of these techniques is to obtain a training corpus that can be used later for training machine learning models in text classification problems achieving comparable results to fully supervised models. To test this approach we selected a real problem: categorizing journalistic news into multiple predefined categories. This document presents the state of the art about data programming techniques, and describes the process of designing a solution using the Snorkel framework. The tasks covered by this master thesis include, first of all, the design of multiple label functions that serve as weak supervision sources, using an heterogeneous set of techniques such as heuristics or keyword detection, linear regression classifiers, clustering methods or deep learning models. Afterwards, the labels of the different sources are aggregated using a generative model, obtaining probabilistic labels for the news. Lastly, the final data set is obtained from the probabilistic labels. There is a performance analysis for different configurations using a crowdsourced dataset as reference. These results are evaluated to find out the potential applications of data programming techniques and how to take full advantage of them in practical scenarios.

Keywords: Natural language processing Weak Supervision Data programming Text classification Label function

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
Stategies-for-programmatic-labelling-Pedro_Alonso_Doval.pdf		Stategies-for-programmatic-labelling-Pedro_Alonso_Doval.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

STRATEGIES FOR THE PROGRAMMATIC GENERATION OF LABELLED CORPUS FOR TEXT CLASSIFICATION

Abstract

About

Uh oh!

Releases

Packages

PedroDoval/STRATEGIES-FOR-THE-PROGRAMMATIC-GENERATION-OF-LABELLED-CORPUS-FOR-TEXT-CLASSIFICATION

Folders and files

Latest commit

History

Repository files navigation

STRATEGIES FOR THE PROGRAMMATIC GENERATION OF LABELLED CORPUS FOR TEXT CLASSIFICATION

Abstract

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages