license is chosen based on the kaggle rules Winner License Type: Open Source - MIT
What to do to train a model
Models are trained by modifying
val_foldandconfigon top oftrain_script.pyin corresponding folder and then running
# GPUN being a gpu number
python train_script.py GPUN &Script assumes that there is a checkpoints directory in the same location
Whatever that will help understanding
of the codebase and easily start based on it
python lets_do_thisfeedback
βββ kaggle_inference_notebooks # inference notebooks of each models for kaggle
β βββ deberta
β βββ longformer
β βββ xlnet
β βββ ... # TODO: Add more models
β
βββ models_training # Test files (alternatively `spec` or `tests`)
β βββ deberta
β βββ longformer
β β βββ longformer # Original Longformer Code
β β β
β β βββ submission # Group of codes for Logformer model submssion
β β β βββ codes # Modified Longformer & Huggingface Code
β β β βββ pretrained_checkpoints #
β β β βββ tokenizer #
β β β βββ weights #
β β β
β β βββ submission_large # same as above `submission`
β β βββ ...
β βββ xlnet
β βββ ... # TODO: Add more models
β β
β βββ oof (out of fold) #
β βββ post processing #
β
βββ train.csv
βββ check_and_split_data.ipynbcheck_and_split_data.ipynbwas used to make splits.- it is not deterministic due to rapids umap, so produced splits also included in that folder.
- rapids umap code is mostly taken from kaggle notebook - cdeotte/rapids-umap-tfidf-kmeans-discovers-15-topics
train.csvis slightly cleaner version of public train file.- train.csv was made semi-manually after searching for entities where the symbol before first letter of discourse_text was alphanumeric.
- Has several columns related to the
gt label, hosts-provided target isdiscourse_text, what been scored is an overlap withpredictionstring - Those columns are all a
noisy target,discourse_textworked best in preliminary tests.
data_rev1.csv- Made in similar process when looking for starts/ends of discourse_text split in
train.csv - For samples where
discourse_textstarts 1 word beforepunctuation markor ends 1 word afterpunctuation mark data_rev1.csvwas made with a script inlongformerdirectory and newtrain.csvwith the same as for debertav3 except for character replacement
- Made in similar process when looking for starts/ends of discourse_text split in
Deberta- not deterministic, yet better results, faster training and faster submission as well.Longformers- training scripts in longformer directory are deterministic, but slowxlnet- ...- TODO: Add more models
- other models with relative positional encoding are ernie series from baidu
- Longformer, BigBird, ETC, are based on
robertacheckpoints
- Training scripts are in
models_training.- Includes some modified import codes in
./models_training/longformer/submissionfolder. - Training data for
longformerand fordebertav1is made by the script in longformer folder, as it was assumed that tokenizers are identical. - Also, when making that particular data, original
train.csvwas used.
- Includes some modified import codes in
debertafolder- has a notebook to make data for debertav3.
longformerfolder./models_training/longformer/sumbission/codes/new_transformers_branch/transformersis from mingboiz/transformer
xlnetfolder- contains
check_labels.ipynbwhich is used to sanity check produced data files. - Also has a notebook to prepare training data.
- contains
- submission notebooks in
code/kaggle_inference_notebooks - submission time
longformer- 40 minutes for 1 folddebertav1- 22 minutes for 1 fold
- Make sure
entitiesstart from an alphanumeric character - class weights
- label smoothing
- global attention to
sep/clstoken and [.?!] tokens for longformer - swa ( sliding windows version of )
- reverse cross entropy
- reverse cross entropy appears to have speed up convergence, maybe reduce number of epochs to 7 or less
- Making sure that tokenization of
xlnetanddebertav3preserves newlines, otherwise severe drop in performance
- mixup - briefly tried, looks like same results
- cleaning unicode artefacts in data with ftfy and regex substitutions
| Model | Fold | Epochs | Training Time (h) | Val | CV | LB | Special note |
|---|---|---|---|---|---|---|---|
| Xlnet | 5 | - | rtx3090 x 1 19h | - | - | - | |
| Longformer | 5 | - | rtx3090 x 1 19h30 | - | - | 0.670 | with bug entity |
| Debertav1 | 5 | - | rtx3090 x 1 13h | - | - | 0.678 | with bug entity |
| Debertav1 | 5 | - | rtx3090 x 1 13h | - | - | 0.681 | partially fixed entity extraction |
| Debertav1 | 5 | - | rtx3090 x 1 13h | - | 0.69724 | 0.699 | fixed entity extraction + adding filtering based on minimal number of words in predicted entity and some confidence thresholds |
| Longformer + Debertav1 | 5 | - | - | - | 0.69945 | 0.700 | fixed entity extraction + adding filtering based on minimal number of words in predicted entity and some confidence thresholds |
- The code used to find thresholds was
ad-hoc, does not optimize correct metric - The above models were validated using the bugged entity extraction code, so the models may be suboptimal.
- Training of xlnet looks deterministic
- RAM
- 4 xlnets in parallel training takes 220gb of ram
- 4 debertav1 barely fit in 256gb
- 4 debertav3 will likely not fit
- Wandb Logs
- Finish training
xlnetand train adebertav3 - Training one more transformer with adding predicted probability weighted embeddings of predicted token types to the word embeddings as a stacking model
Q : ../../data_rev1.csv file used in prepare_data_for_longformer_bkpv1.ipynb (which makes train data for longformer and debertav1), the same file as train.csv?
Almost same, use train.csv
labels format used was:
0- outside
1 - b-lead
2 - i-lead
3 - b-position
4 - i-position, etc.
When scanning the argmaxed prediction, new entity is started when an odd prediction is encountered when it's 0 or prediction != current category + 1.
bugged version had only check for odd number. you can see bugged version in train scipt of longformer and debertav1, function extract_entities
fixed version in train script of xlnet. Fixed version checks for an odd prediction, when it's 0 or when prediction != current category + 1.
i.e. if the prediction was 1 2 2 4 6 8 10 0 0 0 3 4 4 ...
- old code would extract
entities:
1: [0 - 9, ...],
3: [10 - ...]
- new code would extract
entities:
1: [0 - 2, ...],
3: [10 - ...]
Q : Why does the performance is similar or better when newline (\n) is recognized in the deberta then longformer?
In the longformer the same tokenizer as in roberta is used. that one is also used for debertav1, and the tokenizer preserves newlines.
when using xlnet tokenizer or debertav3 tokenizer, the newlines are gone.
summary
longformer-\ntoekn as newlineroberta-\ntoken as newlinedebertav1-\ntoken as newlineXlnet-<eop>token as a newlinedebertav3-[MASK]token as a newline
Overall deberta produces better results all models are trained with max_len 2048
Submission with .700 score has longformer model as well
Note that from tvm import te is different from import tvm as te. Library namespace had changed. Few years ago in tvm variable was made with tvm.var, in latest release it is tvm.te.var but current longformer library still uses tvm.var.
tvm.var turned out useless.
- Custom gpu kernel turned out useless as while taking less gpu ram for training it's also slower and not deterministic.
- That file is needed to build and compile custom gpu kernel
So to use tvm.te.var the following had been made
# before
import tvm
b = tvm.var('b')
# after
from tvm import te
b = te.var('b')- ./models_training/longformer/longformer/longformer/longformer.py#L187-L188
- ./models_training/longformer/longformer/longformer/longformer.py#L263-L264
Other changes to that code ( some indexing modification and attention mask broadcasts ) were to make the code work with torch.use_deterministic_algorithms(True) to make training deterministic when using global attention. Also there is a crucial semicolon on line 264.
