|
1 |
| -# Summarization |
| 1 | +# Summary Loop |
2 | 2 |
|
3 |
| -This repository groups the code to train a summarizer model without Summary supervision. |
| 3 | +This repository contains the code to apply the Summary Loop procedure to train a Summarizer in an unsupervised way, without example summaries. |
4 | 4 |
|
5 |
| -## Training procedure |
| 5 | +<p align="center"> |
| 6 | + <img width="460" height="300" src="https://people.eecs.berkeley.edu/~phillab/images/summary_loop.png"> |
| 7 | +</p> |
6 | 8 |
|
7 |
| -First need to have three models ready & pre-trained: |
8 |
| - - Coverage model, based on a BERT model, finetuned using the `pretrain_coverage.py` script. A keyword_extrator model is required as well. You can ask me for my file for standard BERT vocab or use the training script in `coverage.py` to make a keyword_extractor for your own vocab. |
9 |
| - - Fluency model, based on a GPT2 model, can use GPT2 directly or a finetuned version using `train_generator.py` (recommended finetuning on domain of summaries, such as news, legal, etc.) |
10 |
| - - Summarizer mode, based on a GPT2 model. Should use a GPT2 model finetuned to copy at first (using `train_generator.py --task copy`). The copy finetuning is recommended to teach the model to use the <END> token. |
| 9 | +## Training Procedure |
11 | 10 |
|
12 |
| -Once the three model initializations are ready, the main training script can be run: `train_summarizer.py`. This script outputs a log file with 1 example / minute of summaries produced. |
13 |
| - |
14 |
| -## Using Scorer models separately |
| 11 | +We provide pre-trained models for each component needed in the Summary Loop Release: |
15 | 12 |
|
16 |
| -The Coverage and Fluency model scores can be used separately for comparison. They are respectively in `coverage.py` and `fluency.py`, each model is implemented as a class with a `score(document, summary)` function. |
| 13 | +- `keyword_extractor.joblib`: An sklearn pipeline that will extract can be used to compute tf-idf scores of words according to the BERT vocabulary, which is used by the Masking Procedure, |
| 14 | +- `bert_coverage.bin`: A bert-base-uncased finetuned model on the task of Coverage for the news domain, |
| 15 | +- `fluency_news_bs32.bin`: A GPT2 (base) model finetuned on a large corpus of news articles, used as the Fluency model, |
| 16 | +- `gpt2_copier23.bin`: A GPT2 (base) model that can be used as an initial point for the Summarizer model. |
| 17 | + |
| 18 | +We also provide: |
| 19 | +- `pretrain_coverage.py` script to train a coverage model from scratch, |
| 20 | +- `train_generator.py` to train a fluency model from scratch (we recommend Fluency model on domain of summaries, such as news, legal, etc.) |
| 21 | + |
| 22 | +Once all the pretraining models are ready, training a summarizer can be done using the `train_summary_loop.py`: |
| 23 | +``` |
| 24 | +python train_summary_loop.py --experiment wikinews_test --dataset_file data/wikinews.db |
| 25 | +``` |
| 26 | + |
| 27 | +## Scorer Models |
| 28 | + |
| 29 | +The Coverage and Fluency model scores can be used separately for analysis, evaluation, etc. |
| 30 | +They are respectively in `coverage.py` and `fluency.py`, each model is implemented as a class with a `score(document, summary)` function. |
17 | 31 | Examples of how to run each model are included in the class files, at the bottom of the files.
|
18 | 32 |
|
19 |
| -# Obtaining the datasets & models |
| 33 | +## Further Questions |
20 | 34 |
|
21 |
| -Contact me at phillab@berkeley.edu to obtain: |
22 |
| -- Datasets used for training (for now a large corpus of news articles). |
23 |
| -- Pretrained models: |
24 |
| - - Coverage model & Keyword Extractor |
25 |
| - - Fluency model (GPT2 finetuned on news) |
26 |
| - - Initial Summarizer (finetuned to copy) |
27 |
| -- Final Summarization models (three models: target length 10, 24, 45). |
| 35 | +Feel free to contact me at phillab@berkeley.edu to discuss the results, the code or future steps. |
0 commit comments