You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Sophia Tuned standard lr |47,6 % |78,8% | 36,7 % |
209
+
| Sophia Tuned standard lr |78,8 % |47,6% | 36,7 % |
210
210
| Sophia balanced data | 81,8 % | 47,8% | 45,5% |
211
211
212
-
Use the same command as in the Tuning Sophia section and add the argument ```--para_sep True --weights True``` for reproducing the results.
213
-
214
-
That approach could improve the performance on the paraphrasing task by ... but we lost a few percentage points on the other task. So we conclude, on the on hand training on the QQP dataset first actually helps to gain more information from this huge dataset but on the other hand the three tasks seem to conflict each other.
215
-
216
-
#### Tackle imbalanced data
217
-
The distribution of the different classes in the SST dataset is not equal (class one contains over two times more samples than class zero). As we see in the confusion matrix of our model, which was trained as in the previous section, many datapoints from class 0 are falsely predicted to be in class one (same problem with classes five and four).
To balance the QQP and SST trainset we add weights to our Cross-Entropy loss function such that a training sample from a small class is assigned with an higher weight. This resulted in the following performance:
222
-
| Model name | SST accuracy | QQP accuracy | STS correlation |
Use the same command as in the previous section and add the argument ```--para_sep True --weights True``` for reproducing the results.
228
-
229
-
With this approach we could improve the performance on the SST dataset compared to the last section by ... .
212
+
Use the same command as in the Tuning Sophia section (with standard learning rate and no dropout) and add the argument ```--para_sep True --weights True``` for reproducing the results.
230
213
231
214
### AdamW
232
-
...
233
215
234
216
#### Additional layers
235
217
Another problem we earlier observed was that the task contradict each other, i.e. in separating QQP training the paraphrasing accuracy increased but the other to accuracies decreased. We try to solve these conflicts by adding a simple neural network with one hidden layer as classifier for each task instead of only a linear classifier. The idea is that each task gets more parameters to adjust which are not influenced by the other tasks. As activation function in the neuronal network we tested ReLu and tanh activation layers between the hidden layer and the output. The ReLu activation function performed better. Furthermore, we tried to freeze the BERT parameters in the last trainings epohs and only train the classifier parameters. This improved the performance especially on the SST dataset.
@@ -239,8 +221,13 @@ Another problem we earlier observed was that the task contradict each other, i.e
239
221
| Adam additional layer| 50% | 88,4% | 84,4 % |
240
222
| Adam extra classifier training| 51,6% | 88,5% | 84,3 % |
241
223
242
-
Use the same command as in the previous section and add the argument ```--para_sep True --weights True --add_layers True``` for reproducing the results.
For using the non linear classifier with ReLu activation add the argument ```--add_layers``` and for freezing the BERT parameters in the last epochs add the argument ```--freeze_bert```
243
229
230
+
We also tested some dropout and weight decay values, but those couldn't improve the performance. Furthermore, the weighted loss function, which improved the Models performance with the Sophia optimizer didn't help here.
244
231
### SMART
245
232
246
233
#### Implementation
@@ -510,7 +497,8 @@ Our model achieves the following performance:
510
497
>📋 Include a table of results from your paper, and link back to the leaderboard for clarity and context. If your main result is a figure, include that figure and link to the command or notebook to reproduce it.
511
498
512
499
## Future work
513
-
- Since the huge size of the para dataset (comparing) to both of the sizes of the sst and sts datasets is leading to overfitting, then an enlargemnt of the sizes of the datasets sst and sts should reduce the possibilty of overfitting. This could be achieved be generating more (true) data from the datasets sst and sts, which is possible by adding another additional Task, see issue #60 for more details.
500
+
- Since the huge size of the para dataset (comparing) to both of the sizes of the sst and sts datasets is leading to overfitting, then an enlargemnt of the sizes of the datasets sst and sts should reduce the possibilty of overfitting.
501
+
This could be achieved be generating more (true) data from the datasets sst and sts, which is possible by adding another additional Task.
514
502
- give other losses different weights.
515
503
- with or without combined losses.
516
504
- maybe based in dev_acc performance in previous epoch.
@@ -524,22 +512,6 @@ Lübbers, Christopher L.: Part 1 complete; Part 2: sBERT, Tensorboard (metrics +
524
512
525
513
Niegsch, Lukas*: Generalisations on Custom Attention, Splitted and reordererd batches,
526
514
527
-
Schmidt, Finn Paul: sBert ultitzask training, Sophia dropout layers, Sophia seperated paraphrasing training, Sophia weighted loss, Optuna study on the dropout and hyperparameters, BERT baseline adam, BERT additional layers, error_analysis
528
-
529
-
530
-
## Submit commands
531
-
532
-
To train the sophia base model with optimised parameters run :
0 commit comments