You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The 124M GPT-2 model used in chapter 6, starting with pretrained weights, and finetuning all weights:
54
55
@@ -80,7 +81,7 @@ Test accuracy: 91.88%
80
81
<br>
81
82
82
83
83
-
### 2) 340 M BERT
84
+
### 2) 340M BERT
84
85
85
86
86
87
A 340M parameter encoder-style [BERT](https://arxiv.org/abs/1810.04805) model:
@@ -112,7 +113,7 @@ Test accuracy: 90.89%
112
113
<br>
113
114
114
115
115
-
### 3) 66 M DistilBERT
116
+
### 3) 66M DistilBERT
116
117
117
118
A 66M parameter encoder-style [DistilBERT](https://arxiv.org/abs/1910.01108) model (distilled down from a 340M parameter BERT model), starting for the pretrained weights and only training the last transformer block plus output layers:
118
119
@@ -144,7 +145,7 @@ Test accuracy: 91.40%
144
145
<br>
145
146
146
147
147
-
### 4) 355 M RoBERTa
148
+
### 4) 355M RoBERTa
148
149
149
150
A 355M parameter encoder-style [RoBERTa](https://arxiv.org/abs/1907.11692) model, starting for the pretrained weights and only training the last transformer block plus output layers:
Ep 1 (Step 000000): Train loss 0.695, Val loss 0.698
158
159
Ep 1 (Step 000050): Train loss 0.670, Val loss 0.690
159
160
...
161
+
Ep 1 (Step 004300): Train loss 0.083, Val loss 0.098
162
+
Ep 1 (Step 004350): Train loss 0.170, Val loss 0.086
163
+
Training accuracy: 98.12% | Validation accuracy: 96.88%
164
+
Training completed in 11.22 minutes.
165
+
166
+
Evaluating on the full datasets ...
167
+
168
+
Training accuracy: 96.23%
169
+
Validation accuracy: 94.52%
170
+
Test accuracy: 94.69%
171
+
```
172
+
173
+
<br>
174
+
175
+
---
176
+
177
+
<br>
178
+
179
+
180
+
### 5) 304M DeBERTa-v3
181
+
182
+
A 304M parameter encoder-style [DeBERTa-v3](https://arxiv.org/abs/2111.09543) model. DeBERTa-v3 improves upon earlier versions with disentangled attention and improved position encoding.
Ep 1 (Step 000000): Train loss 0.689, Val loss 0.694
191
+
Ep 1 (Step 000050): Train loss 0.673, Val loss 0.683
192
+
...
160
193
Ep 1 (Step 004300): Train loss 0.126, Val loss 0.149
161
194
Ep 1 (Step 004350): Train loss 0.211, Val loss 0.138
162
195
Training accuracy: 92.50% | Validation accuracy: 94.38%
@@ -176,8 +209,9 @@ Test accuracy: 92.95%
176
209
<br>
177
210
178
211
212
+
179
213
180
-
### 5) 149 M ModernBERT Base
214
+
### 6) 149M ModernBERT Base
181
215
182
216
[ModernBERT (2024)](https://arxiv.org/abs/2412.13663) is an optimized reimplementation of BERT that incorporates architectural improvements like parallel residual connections and gated linear units (GLUs) to boost efficiency and performance. It maintains BERT’s original pretraining objectives while achieving faster inference and better scalability on modern hardware.
183
217
@@ -211,7 +245,7 @@ Test accuracy: 93.79%
211
245
212
246
213
247
214
-
### 6) 395 M ModernBERT Large
248
+
### 7) 395M ModernBERT Large
215
249
216
250
Same as above but using the larger ModernBERT variant.
217
251
@@ -248,7 +282,7 @@ Test accuracy: 95.07%
248
282
<br>
249
283
250
284
251
-
### 7) Logistic Regression Baseline
285
+
### 8) Logistic Regression Baseline
252
286
253
287
A scikit-learn [logistic regression](https://sebastianraschka.com/blog/2022/losses-learned-part1.html) classifier as a baseline:
0 commit comments