Skip to content

Commit c8fcd5b

Browse files
committed
added contribution
1 parent 962e9bb commit c8fcd5b

File tree

1 file changed

+20
-73
lines changed

1 file changed

+20
-73
lines changed

README.md

Lines changed: 20 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -221,8 +221,8 @@ The distribution of the different classes in the SST dataset is not equal (class
221221
To balance the QQP and SST trainset we add weights to our Cross-Entropy loss function such that a training sample from a small class is assigned with an higher weight. This resulted in the following performance:
222222
| Model name | SST accuracy | QQP accuracy | STS correlation |
223223
| ------------------ |---------------- | -------------- | -------------- |
224-
| sBERT-Sophia_base | .. % | .. % | .. % |
225-
| sBERT-Sophia_dropout | .. % | ..% | ..% |
224+
| Sophia Baseline (Finn) | 45% | 77,8% | 32 % |
225+
| Sophia balanced data | 47,8 % | 81,8% | 45,5% |
226226

227227
Use the same command as in the previous section and add the argument ```--para_sep True --weights True``` for reproducing the results.
228228

@@ -232,13 +232,12 @@ With this approach we could improve the performance on the SST dataset compared
232232
...
233233

234234
#### Additional layers
235-
Another problem we earlier observed was that the task contradict each other, i.e. in separating QQP training the paraphrasing accuracy increased but the other to accuracies decreased. We try to solve these conflicts by adding a simple neural network with one hidden layer as classifier for each task instead of only a linear classifier. The idea is that each task gets more parameters to adjust which are not influenced by the other tasks. As activation function in the neuronal network we tested ReLu and tanh activation layers between the hidden layer and the output, but both options performed equally poor.
236-
| Model name | SST train_accuracy | QQP train_accuracy | STS train_correlation |
237-
| ------------------ |---------------- | -------------- | -------------- |
238-
| sBERT-Sophia_base | .. % | .. % | .. % |
239-
| sBERT-Sophia_dropout | .. % | ..% | ..% |
240-
| Adam base | .. % | .. % | .. % |
241-
| Adam additional layers | .. % | ..% | ..% |
235+
Another problem we earlier observed was that the task contradict each other, i.e. in separating QQP training the paraphrasing accuracy increased but the other to accuracies decreased. We try to solve these conflicts by adding a simple neural network with one hidden layer as classifier for each task instead of only a linear classifier. The idea is that each task gets more parameters to adjust which are not influenced by the other tasks. As activation function in the neuronal network we tested ReLu and tanh activation layers between the hidden layer and the output. The ReLu activation function performed better. Furthermore, we tried to freeze the BERT parameters in the last trainings epohs and only train the classifier parameters. This improved the performance especially on the SST dataset.
236+
| Model name | SST accuracy | QQP accuracy | STS correlation |
237+
| ------------------ |---------------- | -------------- | ---
238+
| Adam new base | 50,3 % | 86,4 % | 84,7 % |
239+
| Adam additional layer| 50% | 88,4% | 84,4 % |
240+
| Adam extra classifier training| 51,6% | 88,5% | 84,3 % |
242241

243242
Use the same command as in the previous section and add the argument ```--para_sep True --weights True --add_layers True``` for reproducing the results.
244243

@@ -313,66 +312,14 @@ Tensorboard: Aug25_09-53-27_ggpu137shared
313312

314313
### Custom Attention
315314
[Generalisations on Custom Attention](https://gitlab.gwdg.de/lukas.niegsch/language-ninjas/-/milestones/11#tab-issues)
316-
317-
We tried changing the normal custom attention formula:
318-
319-
1) Generalize $QK^T$ with symmetric linear combination of both $Q, K$ and learn the combination:
320-
321-
$$attention(Q, K, V) = softmax\left(\frac{(\alpha_1 * Q + \alpha_2 * K + \alpha_3I)(\beta_1 * Q + \beta_2 * K + \beta_3I)^T}{\sqrt{d_k}}\right)V$$
322-
323-
2) Replace softmax with sparsemax (see https://arxiv.org/abs/1602.02068v2):
324-
325-
$$attention(Q, K, V) = sparsemax\left(\frac{QK^T}{\sqrt{d_j}}\right)V$$
326-
327-
3) Add an additional learnable center matrix in between:
328-
329-
$$attention(Q, K, V) = softmax\left(\frac{QWK^T}{\sqrt{d_j}}\right)V$$
330-
331-
For ideas 1, 3 we get the original self attention by having specific parameters. We also found a paper that showed the second idea. The goal was that the model uses the original parameters but having more freedom in manipulating them by adding few extra parameters inside all the bert layers. We later realized that all 3 ideas could be combined resulting in 8 different models (1 baseline + 7 extra):
332-
333-
| Model name | SST accuracy | QQP accuracy | STS correlation |
334-
| -------------------------- | ------------ | ------------ | --------------- |
335-
| sBERT-BertSelfAttention (baseline) | 44.6% | 77.2% | 48.3% |
336-
| sBERT-LinearSelfAttention | 40.5% | 75.6% | 37.8% |
337-
| sBERT-NoBiasLinearSelfAttention | 40.5% | 75.6% | 37.8% |
338-
| sBERT-SparsemaxSelfAttention | 39.0% | 70.7% | 56.8% |
339-
| sBERT-CenterMatrixSelfAttention | 39.1% | 76.4% | 43.4% |
340-
| sBERT-LinearSelfAttentionWithSparsemax | 40.1% | 75.3% | 40.8% |
341-
| sBERT-CenterMatrixSelfAttentionWithSparsemax | 39.1% | 75.6% | 40.4% |
342-
| sBERT-CenterMatrixLinearSelfAttention | 42.4% | 76.2% | 42.4% |
343-
| sBERT-CenterMatrixLinearSelfAttentionWithSparsemax | 39.7% | 76.4% | 39.2% |
344-
345-
Our baseline was different because we used other starting parameters (greater batch size, fewer parameters). We did this to reduce the training time for this experiment, see also ``submit_custom_attention.sh``:
346-
347-
```
348-
python -B multitask_classifier.py --use_gpu --epochs=10 --lr=1e-5 --custom_attention=$CUSTOM_ATTENTION
349-
```
350-
351-
Except for the SparsemaxSelfAttention STS correlation, all values declined. The problem is highly due to overfitting. Making the model even more complex makes overfitting worse, thus we get worse performance.
315+
- At this Station we are considering/trying three ideas of Generalisations by hyperparameters on the Bert-Self-Attention (see (https://gitlab.gwdg.de/lukas.niegsch/language-ninjas/-/issues/54))
316+
- Although the idea of envolving more hyperparameters, should improve the result, however because of overfitting we are getting even a bit lower accuracy.
317+
- Sparessmax (paper) : (https://arxiv.org/abs/1602.02068v2).
352318

353319
### Splitted and reordered batches
354320
[Splitted and reordererd batches](https://gitlab.gwdg.de/lukas.niegsch/language-ninjas/-/milestones/12#tab-issues)
355-
356-
The para dataset is much larger than the other two. Originally, we trained para last and then evaluate all 3 independent from each other. This has the effect that the model is optimized towards para, but forgets information from sst and sts. We moved para first and then did the other two last.
357-
358-
Furthermore, all 3 datasets are learned one after another. This means that the gradiants may point in 3 different directions which we follow one after another. However, our goal is to move in the general direction for all 3 tasks together. We tried splitting the datasets into 6 different chunks (large para), (tiny sst, tiny para), (sts_size sts, sts_size para, sts_size sst). Important here is that the last 3 batches are the same size. Thus we can train all tasks without having para dominate the others.
359-
360-
Lastly, we tried training the batches for the last 3 steps in a round robin way (sts, para, sst, sts, para, sst, ...).
361-
362-
| Model name | SST accuracy | QQP accuracy | STS correlation |
363-
| -------------------------- | ------------ | ------------ | --------------- |
364-
| sBERT-BertSelfAttention (baseline) | 44.6% | 77.2% | 48.3% |
365-
| sBERT-ReorderedTraining (BertSelfAttention) | 45.9% | 79.3% | 49.8% |
366-
| sBERT-RoundRobinTraining (BertSelfAttention) | 45.5% | 77.5% | 50.3% |
367-
368-
We used the same script as for the custom attention, but only used the orignal self attention. The reordered training is enabled by default because it gave the best performance. The round robin training can be enabled using the ``--cyclic_finetuning`` flag.
369-
370-
```
371-
python -B multitask_classifier.py --use_gpu --epochs=10 --lr=1e-5 --cyclic_finetuning=True
372-
```
373-
374-
The reordering improved the performance, most likely just because the para comes first. The round robin did not improve it further, maybe switching after each batch is too much.
375-
321+
- At this Step we are considring a specific order of batches by splitting the the datasets and put them in a specific order, (see (https://gitlab.gwdg.de/lukas.niegsch/language-ninjas/-/issues/59)).
322+
- The idea works. We recieve at least 1% more accurcy at each task.
376323
### Combined Loss
377324

378325
This could work as a kind of regularization, because it is not training on a single task and overfitting, but it uses all losses to optimize.
@@ -572,28 +519,28 @@ This could be achieved be generating more (true) data from the datasets sst and
572519
- Dropout and weight decay tuning for BERT (AdamW and Sophia)
573520

574521
## Member Contributions
575-
Dawor, Moataz: Generalisations on Custom Attention, Splitted and reordererd batches, analysis_dataset
522+
Dawor, Moataz: Generalisations on Custom Attention, Splitted and reordererd batches, analysis_dataset
576523

577524
Lübbers, Christopher L.: Part 1 complete; Part 2: sBERT, Tensorboard (metrics + profiler), sBERT-Baseline, SOPHIA, SMART, Optuna, sBERT-Optuna for Optimizer, Optuna for sBERT and BERT-SMART, Optuna for sBERT-regularization, sBERT with combinded losses, sBERT with gradient surgery, README-Experiments for those tasks, README-Methodology, final model, ai usage card
578525

579-
Niegsch, Lukas*: Generalisations on Custom Attention, Splitted and reordererd batches, repository maintenance (merging, lfs, some code refactoring)
526+
Niegsch, Lukas*: Generalisations on Custom Attention, Splitted and reordererd batches,
580527

581-
Schmidt, Finn Paul:
528+
Schmidt, Finn Paul: sBert ultitzask training, Sophia dropout layers, Sophia seperated paraphrasing training, Sophia weighted loss, Optuna study on the dropout and hyperparameters, BERT baseline adam, BERT additional layers, error_analysis
582529

583530

584531
## Submit commands
585532

586-
Für sophia base mit optimierten parametern zu trainieren:
533+
To train the sophia base model with optimised parameters run :
587534
```
588535
python -u multitask_classifier.py --use_gpu --option finetune --epochs 10 --comment "_sophia-chris_opt2" --batch_size 64 --optimizer "sophiag" --weight_decay_para 0.1267 --weight_decay_sst 0.2302 --weight_decay_sts 0.1384 --rho_para 0.0417 --rho_sst 0.0449 --rho_sts 0.0315 --lr_para 1e-5 --lr_sst 1e-5 --lr_sts 1e-5
589536
```
590537

591-
Für sophia mit optimierten parametern und dropout layern:
538+
To train the sophia model with dropout layers and optimized hyperparameters run:
592539
```
593540
python -u multitask_classifier.py --use_gpu --option finetune --optimizer "sophiag" --epochs 10 --hidden_dropout_prob_para 0.15 --hidden_dropout_prob_sst 0.052 --hidden_dropout_prob_sts 0.22 --lr_para 1.8e-05 --lr_sst 5.6e-06 --lr_sts 1.1e-05 --weight_decay_para 0.038 --weight_decay_sst 0.17 --weight_decay_sts 0.22 --comment individual_dropout
594541
```
595542

596-
Für sophia mit gewichtetem loss und para datenset als erstes ein paar epochen zu trainieren (ohne dropout):
543+
To train the sophia model with weighted loss and seperate paraphrasing training run:
597544
```
598545
python -u multitask_classifier.py --use_gpu --option finetune --optimizer "sophiag" --epochs 5 --weights True --para_sep True --hidden_dropout_prob_para 0 --hidden_dropout_prob_sst 0 --hidden_dropout_prob_sts 0 --lr_para 1e-05 --lr_sst 1e-05 --lr_sts 1e-05 --batch_size 64 --optimizer "sophiag" --weight_decay_para 0.1267 --weight_decay_sst 0.2302 --weight_decay_sts 0.1384 --rho_para 0.0417 --rho_sst 0.0449 --rho_sts 0.0315 --comment weighted_loss_without_dropout
599-
```
546+
```

0 commit comments

Comments
 (0)