Code repository for EMNLP 2020 proceedings paper Cold-start Active Learning through Self-supervised Language Modeling. The main contribution of the paper is an active learning algorithm called ALPS (Active Learning through Processing Surprisal) that is based on the language modeling objective.
- Create virtual environment with Python 3.7+
- Run following commands:
git clone https://github.com/forest-snow/alps.git
cd alps
pip install -r requirements.txt
The repository is organized as the following subfolders:
src: source codescripts: scripts for running experimentsdata: folder for datasetsmodels: saved models from running experimentsanalysis: analysis of active learning experiments
All commands below should be ran in the top-level directory alps.
To simply fine-tune a model on the full training dataset, run
bash scripts/train.sh
After fine-tuning, this model will be saved under a subdirectory called base in models directory. Results on dev set will be saved in eval_results.txt.
You may modify the parameters (like model type, task, seed, etc.) in scripts/train.shby configuring the variables at the top of the script.
To simulate active learning, run
bash scripts/active_train.sh
This script will sample data for a fixed number of iterations and then fine-tune the model on the sampled data for each iteration. The fine-tuned model will be saved under a subdirectory called {strategy}_{size} where strategy is the active learning strategy used to sample data and size is the number of examples used to fine-tune the model. Results on dev set will be saved in eval_results.txt.
To modify parameters in scripts/active_train.sh, you can configure the variables at the top of the script. Please read the instructions below for more information.
Here are the naming conventions of the strategies from the paper:
- Random sampling:
rand - Max. entropy sampling:
entropy - ALPS:
alps - BADGE:
badge - BERT-KM:
bertKM - FT-BERT-KM:
FTbertKM
So, whenever you want to use ALPS, you would pass in alps as input to the commands presented below.
For active learning strategies that DO NOT require a model already fine-tuned on downstream task (rand, alps, and bertKM), you set variable SAMPLINGto the strategy's name and variable COLDSTART to none. This will use method specified inSAMPLING to sample data on each iteration.
For active learning strategies that DO require a model already fine-tuned on downstream task (badge, entropy, and FTbertKM), you set variable SAMPLINGto the strategy's name and variable COLDSTART to the method used for sampling data in the first iteration. For instance, max. entropy sampling would have SAMPLING set to entropy and COLDSTART set to rand.
NOTE: you must run simulation for method specified in COLDSTART for at least one iteration. For example, run randfor 1 iteration before running simulations forentropy.
To set the size of data sampled on each iteration, configure the variable INCREMENT. To set the maximum size of total data sampled, configure the variable MAX_SIZE. The number of iterations would be MAX_SIZE\INCREMENT.
To test models that have been fine-tuned, run
python -m src.test --models models
This will iterate through every model located in subdirectories of folder models and evaluate them on the test dataset. However, it will skip over any models that are just checkpoints or were not evaluated on a dev set (models trained with scripts will automatically be tested on dev set). The script will output results in test_results.txt
To analyze the uncertainty and diversity of batched sampled with active learning, run
bash scripts/analyze.sh
This will output a CSV file in analysis folder containing uncertainty and diversity scores for each sampled batch. The header of the CSV file will besampling,iteration,task,diversity,uncertainty. Each row indicates the diversity and uncertainty scores for data sampled with strategy at a certain iteration for a task.
@inproceedings{yuan2020alps,
title={Cold-start Active Learning through Self-supervised Language Modeling},
author={Yuan, Michelle and Lin, Hsuan-Tien and Boyd-Graber, Jordan},
booktitle={Empirical Methods in Natural Language Processing},
year={2020}
}