Overview

Models: Hugging face implementation of BERT/DistilBERT and fine tuned them to classify text. Bert Docs

Article that explains how bert style models work.

Structure:

flowchart TD
    A[Load in and normalize text] --> B[Split into transformers datasets]


    B --> X[Pick and instantiate model]
    X --> C[Tokenize text]
    C --> D[Train base model]
    D --> E[Evaluate on unseen data]
    E --> F[Optimize hyperparameters]

    A --datasets used--- newLines(["Amazon reviews classified to product category
                                                    Movie plots classified to genre
                                                    News headlines classified to news category "])
    

    
    D --hyperparams--- S([Learning rate, batch size, num epochs, weight decay, evaluation metric, etc... ])

    C --hyperparams--- Y([padding, max_length, truncation])
    
    
    X -- options--- Z([Bert, Distilbert, Roberta, GPT, etc..])

bert-base-uncased

From huggingface website:

BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives:

Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the sentence. Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not. This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard classifier using the features produced by the BERT model as inputs.

Model size: 110M params

distilbert-base-uncased

From distilbert page

DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts using the BERT base model. More precisely, it was pretrained with three objectives:

Distillation loss: the model was trained to return the same probabilities as the BERT base model. Masked language modeling (MLM): (Same as above) Cosine embedding loss: the model was also trained to generate hidden states as close as possible as the BERT base model. This way, the model learns the same inner representation of the English language than its teacher model, while being faster for inference or downstream tasks.

Model size: 67M params

Datasets used to train bert and distilbert:

English wikipedia
Book corpus(11,038 unpublished books)

Preprocessing

For review text data, I combined the title and review body for simplicity and sent that through a text normalization function.

Current functionally:

Expanding contractions
Removing punctuation and any formatting characters
Lowercase

I then one hot encoded the classes and converted that pandas df into a dataset from the datasets library with a train,test, and validation dataset. These are easier to us than pandas with the transformers library for training.

Next I made a dict of the classes to correspond with an int.

Tokenizer

For each model I used the corresponding tokenizer from huggingface. I used the same parameters for each.

tokenizer(text, padding="max_length", truncation=True, max_length=256)

The max_length of 256 and subsequent truncation does cut off some data but my GPU (RTX 3070, 8gb VRam) was not able to handle anything larger with training batch size of 8. The padding is on the right as is recommended.

Training

The training loop I used was based on this implementation which I found linked in the transformers documentation. It might be sightly outdated and sub-optimal based on the depreciation warnings thrown but it gets the job done.

Evaluation

Explanation of confidence threshold

These models are set up to return the probability of each text example belonging to each class. To convert to binary outcomes an arbitrary threshold is set. Everything above is a prediction, everything below is not. A very low threshold might return 3 or more predictions, indicating that the model is moderately confident the correct classification is included in the set of predictions, but the usefulness of this depends on the specifics of each use case. If classification into a single class is necessary it will have to be manually classified anyway and having a short list could prove useful.A high threshold returns a lower proportion of wrong predictions, but also a much higher proportion of non predictions. In the case of single class classification, these non predictions would still have to be manually classified. The graphs below highlight this, and the right balance will have to be determined depending on the circumstances of each case.

The confidence threshold used in the training loop 0.5 for all of the models. Finding an optimal value for each use is a chance to improve the fine tuning, but has not been explored yet.

Explanation of multi-guess penalty

The way the accuracy of each model is calculated is correct predictions get a 1 and incorrect predictions get a 0. At low confidence thresholds the model can return many predictions for each example. I wanted a way to include this in the evaluation as it is not useful if the model is predicting every class for each example which inflates the accuracy. The formula is 0.9^(number of classes guessed). For 2 guesses it would score a 0.81 instead of a 1, for 3 guesses 0.73, 4 guesses 0.65, etc. This is arbitrary, constant values from 0.8 to 0.9 gave a decent representation of performance. It quickly converges with the correct category at thresholds above 0.3 or so because the model starts returning only 1 prediction.

Explanation of following dataframe column names

threshold: Confidence threshold

correct: The correct class was included in the classes returned by the model.

correct_discount: This is the multi-guess penalty, explained above.

correct_non_preds: The correct class was returned OR no prediction was made. No prediction is better than a wrong one in some cases so this was accounted for.

non_preds: No prediction was made

wrong_preds: A prediction was made and the correct class was NOT returned by the model.

Amazon reviews classified to product category

Corresponding notebooks:

Initial fine-tuning: train_bert_model.ipynb

Loading and evaluation: loaded_fine_tuned_bert.ipynb

Data Example:

Review	Product category
perfect just what i needed perfect for western theme party	apparel

Model: bert-base-uncased

Training size: 100k

Training time: 2 hrs

Number of classes: 30

Evaluation dataset results:

eval loss: 0.083
eval f1: 0.523
eval roc acc: 0.711
eval accuracy: 0.428

Results on unseen test data(%):

Model: bert-base-uncased (same as above with less training data)

Training size: 10k

Training time: 12 mins

Number of classes: 30

Evaluation dataset results:

eval loss: 0.105
eval f1: 0.227
eval roc acc: 0.567
eval accuracy: 0.136

Results on unseen data(%):

Movie plots classified to genre

Corresponding notebooks:

movie_plots.ipynb

Data Example:

Text	Genre
three schoolgirls are infatuated with a yakuza...	crime

Model: distilbert-base-uncased

Training size: 18244

Training time: 12 mins

Number of classes: 100

Evaluation dataset results:

eval loss: 0.031
eval f1: 0.398
eval roc acc: 0.648
eval accuracy: 0.299

Results on unseen test data(%):

News headlines classified to news category

Corresponding notebooks:

news_headlines.ipynb

Data Example:

Text	Genre
tips for your child s first summer sleep away ... .	PARENTS

Model: bert-base-uncased

Training size: 18750

Training time: 24 mins

Number of classes: 41

Evaluation dataset results:

eval loss: 0.0537
eval f1: 0.595
eval roc acc: 0.743
eval accuracy: 0.490

Results on unseen test data(%):

Things to do

Get RoBerta to work. notebook
Training confidence threshold
In evaluation, test confidence thresholds from 0.01 to 0.1. Optimal value might lay in this range
Deberta notebook working, needs to be trained on more data. Takes much longer than bert to train due to batch size being cut in half to avoid memory errors.
Deberta-v2 is too big to fit on this machine even with batch size = 1
Pre-train on engineering corpus? notebook

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
images		images
.gitignore		.gitignore
deberta.ipynb		deberta.ipynb
drug_reviews.ipynb		drug_reviews.ipynb
hyperparameter_optim.ipynb		hyperparameter_optim.ipynb
loaded_fine_tuned_bert.ipynb		loaded_fine_tuned_bert.ipynb
local_functions.py		local_functions.py
mistral7b.ipynb		mistral7b.ipynb
model_exploration(not-important).ipynb		model_exploration(not-important).ipynb
movie_plots.ipynb		movie_plots.ipynb
news_headlines.ipynb		news_headlines.ipynb
readme.md		readme.md
train_bert_model.ipynb		train_bert_model.ipynb
trainer.ipynb		trainer.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

bert-base-uncased

distilbert-base-uncased

Preprocessing

Tokenizer

Training