Skip to content

seukgcode/BalCompress

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BalCompress

The code and dataset for "BalCompress: BalCompress: Mode Compression with Open-domain Unlabeled Data for Named Entity Recognition".

The repo is based on Fliar framework with a lot of modifications.

Requirements

This repository is tested on pytorch 1.10.2 with CUDA==10.2 and cuDNN==7.6. Please run

pip3 install -r requirements.txt

to install all dependencies

Usage

  • Train teacher (optional)

You can fine-tune a teacher with following command or use the fine-tuned model provided by Fliar (repace teacher_path with 'flair/ner-english-large').

python train_teacher.py --save_path resources/taggers/teacher \
                        --root_path data/conll-2003\
                        --train train.txt \
                        --dev dev.txt \
                        --test test.txt
  • rank and select unlabeled data (you must put the unlabeled data in the same folder as your train)
python select_data.py --teacher_path flair/ner-english-large \
                      --unlabeled_path data/wikitext/wiki_split_30.txt \
                      --save_path data/conll-2003/unlabeled.txt \
                      --number 40000
  • Distill a student model
python train_student.py --save_path resources/taggers/student \
                        --root_path data/conll-2003 \
                        --train train.txt \
                        --dev dev.txt \
                        --test test.txt \
                        --unlabeled unlabeled.txt \
                        --teacher_path flair/ner-english-large
  • Train on OntoNote 5.0

Download OntoNote 5.0 data from LDC, put it in the data folder, and run the commands above.

About

BalCompress: model compression with open-domain unlabeled data for named entity recognition

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages