Conditional Random Fields model for a punctuation restoration task.
This repo contains the utilities necessary to allow convenient training of a Conditional Random Fields (CRF) model for restoration of punctuation to non-punctuated streams of text.
E.g.
this is my input sentence
becomes This is my input sentence.
The model is based on the works of Lui, M. and Wang, L. (2013), 'Recovering Casing and Punctuation using Conditional Random Fields'.
The task here is a multi-class token classification task where classification is applied to sequence of words.
The CRF model takes into account the word, POS tag, chunk tags, and NE tags for the current word and two words either side (i.e. 5-gram model)
git clone https://github.com/anthonyyhughes/naive-bayes-space-restorer.git
virtualenv env
pip install -r requirements
Recommended method for Google Colab notebooks:
!git clone https://github.com/anthonyyhughes/naive-bayes-space-restorer.git
!pip install -r requirements
Example usage for the operations covered below is also included in the example notebook: crf_punc_restorer_example.ipynb.
Example usage:
python train.py
python inference.py
Lui, M. and Wang, L., ”Recovering Casing and Punctuation using Conditional Random Fields” July, 2018. Available: https://aclanthology.org/U13-1020.pdf.