This repository contains PyTorch implementation and pre-trained models for ASP, described in Autoregressive Structured Prediction with Language Models.
Links: ETH-NLPED lab , Rycolab
git clone https://github.com/lyutyuh/ASP.git
cd ASP
export ASP=$PWD # setting environment variable  pip  
python -m venv <path_to_venv>/asp    # create a new environment (asp)
source <path_to_venv>/asp/bin/activate
pip install -r requirements.txt conda 
conda env create -f environment.yml    # create a new environment (asp)  named entity recognition  
  wget https://polybox.ethz.ch/index.php/s/bFf8vJBonIT7sr8/download -O ./data/conll03_ner.zip
  unzip ./data/conll03_ner.zip -d ./data
  rm ./data/conll03_ner.zip
  python ./data/conll03_ner/conll03_to_json.py
  python ./data/t5minimize_ner.py ./data/conll03_ner ./data/conll03_nerComing soon!
  end-to-end relation extraction  
  wget https://polybox.ethz.ch/index.php/s/Lk44AwhOeDSeZTh/download -O ./data/conll04_ere.zip
  unzip ./data/conll04_ere.zip -d ./data
  rm ./data/conll04_ere.zip
  python ./data/t5minimize_ere.py ./data/conll04_ere/ ./data/conll04_ereACE-05 is not a publically available dataset. Please follow https://github.com/luanyi/DyGIE/tree/master/preprocessing to obtain
the dataset json files {train,dev,test}.json and copy them to ./data/ace05_ere/.
Then:
  python ./data/ace05_ere/ace05_to_json.py
  python ./data/t5minimize_ere.py ./data/ace05_ere ./data/ace05_ere  coreference resolution  
OntoNotes is not a publically available dataset. Please follow http://conll.cemantix.org/2012/data.html and https://catalog.ldc.upenn.edu/LDC2013T19 to obtain
the files {train,dev,test}.english.v4_gold_conll and copy them to ./data/ontonotes_coref/.
Then:
python ./data/t5minimize_coref.py ./data/ontonotes_coref/ ./data/ontonotes_coref/For task in {ner,ere,coref}:
  python run_{task}.py <config_name> 0 Please find the <config_name> in each {ner,ere,coref}.conf file under configs
- For named entity recognitionandrelation extraction, convert the new dataset to<newdataset>_{train,dev,test}.jsonin the following format:
[{
    "tokens": ["John", "Wilkes", "Booth", ",", "who", "assassinated", "President", "Lincoln", ",", "was", "an", "actor", "."], 
    "entities": [{"type": "Peop", "start": 0, "end": 3}, {"type": "Peop", "start": 6, "end": 8}], 
    "relations": [{"type": "Kill", "head": 0, "tail": 1}] # Not necessary for NER
}, ...]and <newdataset>_types.json:
{
    "entities": {
        "Loc": {"short": "Loc", "verbose": "Location"}, 
        "Org": {"short": "Org", "verbose": "Organization"}, 
        "Peop": {"short": "Peop", "verbose":"People"}, 
        "Other": {"short": "Other", "verbose": "Other"}
    }, 
    "relations": { # Not necessary for NER
        "Work_For": {"short": "Work", "verbose": "Work for", "symmetric": false}, 
        "Kill": {"short": "Kill", "verbose": "Kill", "symmetric": false}, 
        "OrgBased_In": {"short": "OrgBI", "verbose": "Organization based in", "symmetric": false}, 
        "Live_In": {"short": "Live", "verbose": "Live in", "symmetric": false}, 
        "Located_In": {"short": "LocIn", "verbose": "Located in", "symmetric": false}
    }
}and run
  python ./data/t5minimize_ere.py ./data/<newdataset>/ ./data/<newdataset>/- For coreference resolution, convert the new dataset to CoNLL-12 format. Then
python ./data/t5minimize_coref.py ./data/<newdataset>/ ./data/<newdataset>/Add a new entry in the corresponding .conf file under configs with the directory to the new dataset data_dir = ${ASP}/data/<newdataset>/
Use the following command to load the pre-trained model and evaluate on the corresponding task.
config_name refers to the experiment name given in the .conf file under configs.
python evaluate_<task>.py <config_name> <checkpoint_name> <gpu_id>| config_name | checkpoint_name | dataset | link | params | 
|---|---|---|---|---|
| flant5_base | tliu/asp-coref-flan-t5-base | CoNLL-2012 (OntoNotes) | link | 220 M | 
| flant5_large | tliu/asp-coref-flan-t5-large | CoNLL-2012 (OntoNotes) | link | 770 M | 
| flant5_xl | tliu/asp-coref-flan-t5-xl | CoNLL-2012 (OntoNotes) | link | 3 B | 
| t0_3b | tliu/asp-coref-t0-3b | CoNLL-2012 (OntoNotes) | link | 3 B | 
| config_name | checkpoint_name | dataset | link | params | 
|---|---|---|---|---|
| flant5_base | tliu/asp-ner-flan-t5-base | CoNLL-03 NER | link | 220 M | 
| flant5_large | tliu/asp-ner-flan-t5-large | CoNLL-03 NER | link | 770 M | 
| config_name | checkpoint_name | dataset | link | params | 
|---|---|---|---|---|
| flant5_base_conll04 | tliu/asp-re-flan-t5-base | CoNLL-04 RE | link | 220 M | 
| flant5_large_conll04 | tliu/asp-re-flan-t5-large | CoNLL-04 RE | link | 770 M | 
| flant5_xl_conll04 | tliu/asp-re-flan-t5-xl | CoNLL-04 RE | link | 3 B | 
@inproceedings{liu-etal-2022-autoregressive,
    title={Autoregressive Structured Prediction with Language Models},
    author={Tianyu Liu and Yuchen Jiang and Nicholas Monath and Ryan Cotterell and Mrinmaya Sachan},
    year={2022},
    url={https://arxiv.org/abs/2210.14698},
    eprint={2210.14698},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}