Skip to content

Commit c14265a

Browse files
committed
Update README.md
1 parent 773bce1 commit c14265a

File tree

1 file changed

+60
-10
lines changed

1 file changed

+60
-10
lines changed

README.md

Lines changed: 60 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -298,15 +298,66 @@ Tensorboard: Aug25_09-53-27_ggpu137shared
298298
| sBERT-Shared similarity | 50.14 % | 71.08 % | 47.68 % |
299299

300300
### Custom Attention
301-
[Generalisations on Custom Attention](https://gitlab.gwdg.de/lukas.niegsch/language-ninjas/-/milestones/11#tab-issues)
302-
- At this Station we are considering/trying three ideas of Generalisations by hyperparameters on the Bert-Self-Attention (see (https://gitlab.gwdg.de/lukas.niegsch/language-ninjas/-/issues/54))
303-
- Although the idea of envolving more hyperparameters, should improve the result, however because of overfitting we are getting even a bit lower accuracy.
304-
- Sparessmax (paper) : (https://arxiv.org/abs/1602.02068v2).
301+
302+
We tried changing the normal custom attention formula:
303+
304+
1) Generalize $QK^T$ with symmetric linear combination of both $Q, K$ and learn the combination:
305+
306+
$$attention(Q, K, V) = softmax\left(\frac{(\alpha_1 * Q + \alpha_2 * K + \alpha_3I)(\beta_1 * Q + \beta_2 * K + \beta_3I)^T}{\sqrt{d_k}}\right)V$$
307+
308+
2) Replace softmax with sparsemax (see https://arxiv.org/abs/1602.02068v2):
309+
310+
$$attention(Q, K, V) = sparsemax\left(\frac{QK^T}{\sqrt{d_j}}\right)V$$
311+
312+
3) Add an additional learnable center matrix in between:
313+
314+
$$attention(Q, K, V) = softmax\left(\frac{QWK^T}{\sqrt{d_j}}\right)V$$
315+
316+
For ideas 1, 3 we get the original self attention by having specific parameters. We also found a paper that showed the second idea. The goal was that the model uses the original parameters but having more freedom in manipulating them by adding few extra parameters inside all the bert layers. We later realized that all 3 ideas could be combined resulting in 8 different models (1 baseline + 7 extra):
317+
318+
| Model name | SST accuracy | QQP accuracy | STS correlation |
319+
| -------------------------- | ------------ | ------------ | --------------- |
320+
| sBERT-BertSelfAttention (baseline) | 44.6% | 77.2% | 48.3% |
321+
| sBERT-LinearSelfAttention | 40.5% | 75.6% | 37.8% |
322+
| sBERT-NoBiasLinearSelfAttention | 40.5% | 75.6% | 37.8% |
323+
| sBERT-SparsemaxSelfAttention | 39.0% | 70.7% | 56.8% |
324+
| sBERT-CenterMatrixSelfAttention | 39.1% | 76.4% | 43.4% |
325+
| sBERT-LinearSelfAttentionWithSparsemax | 40.1% | 75.3% | 40.8% |
326+
| sBERT-CenterMatrixSelfAttentionWithSparsemax | 39.1% | 75.6% | 40.4% |
327+
| sBERT-CenterMatrixLinearSelfAttention | 42.4% | 76.2% | 42.4% |
328+
| sBERT-CenterMatrixLinearSelfAttentionWithSparsemax | 39.7% | 76.4% | 39.2% |
329+
330+
Our baseline was different because we used other starting parameters (greater batch size, fewer parameters). We did this to reduce the training time for this experiment, see also ``submit_custom_attention.sh``:
331+
332+
```
333+
python -B multitask_classifier.py --use_gpu --epochs=10 --lr=1e-5 --custom_attention=$CUSTOM_ATTENTION
334+
```
335+
336+
Except for the SparsemaxSelfAttention STS correlation, all values declined. The problem is highly due to overfitting. Making the model even more complex makes overfitting worse, thus we get worse performance.
305337

306338
### Splitted and reordered batches
307339
[Splitted and reordererd batches](https://gitlab.gwdg.de/lukas.niegsch/language-ninjas/-/milestones/12#tab-issues)
308-
- At this Step we are considring a specific order of batches by splitting the the datasets and put them in a specific order, (see (https://gitlab.gwdg.de/lukas.niegsch/language-ninjas/-/issues/59)).
309-
- The idea works. We recieve at least 1% more accurcy at each task.
340+
341+
The para dataset is much larger than the other two. Originally, we trained para last and then evaluate all 3 independent from each other. This has the effect that the model is optimized towards para, but forgets information from sst and sts. We moved para first and then did the other two last.
342+
343+
Furthermore, all 3 datasets are learned one after another. This means that the gradiants may point in 3 different directions which we follow one after another. However, our goal is to move in the general direction for all 3 tasks together. We tried splitting the datasets into 6 different chunks (large para), (tiny sst, tiny para), (sts_size sts, sts_size para, sts_size sst). Important here is that the last 3 batches are the same size. Thus we can train all tasks without having para dominate the others.
344+
345+
Lastly, we tried training the batches for the last 3 steps in a round robin way (sts, para, sst, sts, para, sst, ...).
346+
347+
| Model name | SST accuracy | QQP accuracy | STS correlation |
348+
| -------------------------- | ------------ | ------------ | --------------- |
349+
| sBERT-BertSelfAttention (baseline) | 44.6% | 77.2% | 48.3% |
350+
| sBERT-ReorderedTraining (BertSelfAttention) | 45.9% | 79.3% | 49.8% |
351+
| sBERT-RoundRobinTraining (BertSelfAttention) | 45.5% | 77.5% | 50.3% |
352+
353+
We used the same script as for the custom attention, but only used the orignal self attention. The reordered training is enabled by default because it gave the best performance. The round robin training can be enabled using the ``--cyclic_finetuning`` flag.
354+
355+
```
356+
python -B multitask_classifier.py --use_gpu --epochs=10 --lr=1e-5 --cyclic_finetuning=True
357+
```
358+
359+
The reordering improved the performance, most likely just because the para comes first. The round robin did not improve it further, maybe switching after each batch is too much.
360+
310361
### Combined Loss
311362

312363
This could work as a kind of regularization, because it is not training on a single task and overfitting, but it uses all losses to optimize.
@@ -497,20 +548,19 @@ Our model achieves the following performance:
497548
>📋 Include a table of results from your paper, and link back to the leaderboard for clarity and context. If your main result is a figure, include that figure and link to the command or notebook to reproduce it.
498549
499550
## Future work
500-
- Since the huge size of the para dataset (comparing) to both of the sizes of the sst and sts datasets is leading to overfitting, then an enlargemnt of the sizes of the datasets sst and sts should reduce the possibilty of overfitting.
501-
This could be achieved be generating more (true) data from the datasets sst and sts, which is possible by adding another additional Task.
551+
- Since the huge size of the para dataset (comparing) to both of the sizes of the sst and sts datasets is leading to overfitting, then an enlargemnt of the sizes of the datasets sst and sts should reduce the possibilty of overfitting. This could be achieved be generating more (true) data from the datasets sst and sts, which is possible by adding another additional Task, see issue #60 for more information.
502552
- give other losses different weights.
503553
- with or without combined losses.
504554
- maybe based in dev_acc performance in previous epoch.
505555
- implement SMART for BERT-STS
506556
- Dropout and weight decay tuning for BERT (AdamW and Sophia)
507557

508558
## Member Contributions
509-
Dawor, Moataz: Generalisations on Custom Attention, Splitted and reordererd batches, analysis_dataset
559+
Dawor, Moataz: Generalisations on Custom Attention, Splitted and reordererd batches, analysis_dataset
510560

511561
Lübbers, Christopher L.: Part 1 complete; Part 2: sBERT, Tensorboard (metrics + profiler), sBERT-Baseline, SOPHIA, SMART, Optuna, sBERT-Optuna for Optimizer, Optuna for sBERT and BERT-SMART, Optuna for sBERT-regularization, sBERT with combinded losses, sBERT with gradient surgery, README-Experiments for those tasks, README-Methodology, final model, ai usage card
512562

513-
Niegsch, Lukas*: Generalisations on Custom Attention, Splitted and reordererd batches,
563+
Niegsch, Lukas*: Generalisations on Custom Attention, Splitted and reordererd batches, repository maintenance (merging, lfs, some code refactoring)
514564

515565
Schmidt, Finn Paul: sBert multi_task training, Sophia dropout layers, Sophia seperated paraphrasing training, Sophia weighted loss, Optuna study on the dropout and hyperparameters, BERT baseline adam, BERT additional layers, error_analysis
516566

0 commit comments

Comments
 (0)