Skip to content

Commit c278745

Browse files
authored
DeBERTa-v3 baseline (#630)
* Llama3 from scratch improvements * deberta-baseline * restore
1 parent 4ff7430 commit c278745

File tree

2 files changed

+55
-20
lines changed

2 files changed

+55
-20
lines changed

ch06/03_bonus_imdb-classification/README.md

Lines changed: 48 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,14 @@ This folder contains additional experiments to compare the (decoder-style) GPT-2
1010

1111
| | Model | Test accuracy |
1212
| ----- | ---------------------------- | ------------- |
13-
| **1** | 124 M GPT-2 Baseline | 91.88% |
14-
| **2** | 340 M BERT | 90.89% |
15-
| **3** | 66 M DistilBERT | 91.40% |
16-
| **4** | 355 M RoBERTa | 92.95% |
17-
| **5** | 149 M ModernBERT Base | 93.79% |
18-
| **6** | 395 M ModernBERT Large | 95.07% |
19-
| **7** | Logistic Regression Baseline | 88.85% |
13+
| **1** | 124M GPT-2 Baseline | 91.88% |
14+
| **2** | 340M BERT | 90.89% |
15+
| **3** | 66M DistilBERT | 91.40% |
16+
| **4** | 355M RoBERTa | 92.95% |
17+
| **5** | 304M DeBERTa-v3 | 94.69% |
18+
| **6** | 149M ModernBERT Base | 93.79% |
19+
| **7** | 395M ModernBERT Large | 95.07% |
20+
| **8** | Logistic Regression Baseline | 88.85% |
2021

2122

2223

@@ -48,7 +49,7 @@ python download_prepare_dataset.py
4849
## Step 3: Run Models
4950

5051
 
51-
### 1) 124 M GPT-2 Baseline
52+
### 1) 124M GPT-2 Baseline
5253

5354
The 124M GPT-2 model used in chapter 6, starting with pretrained weights, and finetuning all weights:
5455

@@ -80,7 +81,7 @@ Test accuracy: 91.88%
8081
<br>
8182

8283
&nbsp;
83-
### 2) 340 M BERT
84+
### 2) 340M BERT
8485

8586

8687
A 340M parameter encoder-style [BERT](https://arxiv.org/abs/1810.04805) model:
@@ -112,7 +113,7 @@ Test accuracy: 90.89%
112113
<br>
113114

114115
&nbsp;
115-
### 3) 66 M DistilBERT
116+
### 3) 66M DistilBERT
116117

117118
A 66M parameter encoder-style [DistilBERT](https://arxiv.org/abs/1910.01108) model (distilled down from a 340M parameter BERT model), starting for the pretrained weights and only training the last transformer block plus output layers:
118119

@@ -144,7 +145,7 @@ Test accuracy: 91.40%
144145
<br>
145146

146147
&nbsp;
147-
### 4) 355 M RoBERTa
148+
### 4) 355M RoBERTa
148149

149150
A 355M parameter encoder-style [RoBERTa](https://arxiv.org/abs/1907.11692) model, starting for the pretrained weights and only training the last transformer block plus output layers:
150151

@@ -157,6 +158,38 @@ python train_bert_hf.py --trainable_layers "last_block" --num_epochs 1 --model "
157158
Ep 1 (Step 000000): Train loss 0.695, Val loss 0.698
158159
Ep 1 (Step 000050): Train loss 0.670, Val loss 0.690
159160
...
161+
Ep 1 (Step 004300): Train loss 0.083, Val loss 0.098
162+
Ep 1 (Step 004350): Train loss 0.170, Val loss 0.086
163+
Training accuracy: 98.12% | Validation accuracy: 96.88%
164+
Training completed in 11.22 minutes.
165+
166+
Evaluating on the full datasets ...
167+
168+
Training accuracy: 96.23%
169+
Validation accuracy: 94.52%
170+
Test accuracy: 94.69%
171+
```
172+
173+
<br>
174+
175+
---
176+
177+
<br>
178+
179+
&nbsp;
180+
### 5) 304M DeBERTa-v3
181+
182+
A 304M parameter encoder-style [DeBERTa-v3](https://arxiv.org/abs/2111.09543) model. DeBERTa-v3 improves upon earlier versions with disentangled attention and improved position encoding.
183+
184+
185+
```bash
186+
python train_bert_hf.py --trainable_layers "all" --num_epochs 1 --model "deberta-v3-base"
187+
```
188+
189+
```
190+
Ep 1 (Step 000000): Train loss 0.689, Val loss 0.694
191+
Ep 1 (Step 000050): Train loss 0.673, Val loss 0.683
192+
...
160193
Ep 1 (Step 004300): Train loss 0.126, Val loss 0.149
161194
Ep 1 (Step 004350): Train loss 0.211, Val loss 0.138
162195
Training accuracy: 92.50% | Validation accuracy: 94.38%
@@ -176,8 +209,9 @@ Test accuracy: 92.95%
176209
<br>
177210

178211

212+
179213
&nbsp;
180-
### 5) 149 M ModernBERT Base
214+
### 6) 149M ModernBERT Base
181215

182216
[ModernBERT (2024)](https://arxiv.org/abs/2412.13663) is an optimized reimplementation of BERT that incorporates architectural improvements like parallel residual connections and gated linear units (GLUs) to boost efficiency and performance. It maintains BERT’s original pretraining objectives while achieving faster inference and better scalability on modern hardware.
183217

@@ -211,7 +245,7 @@ Test accuracy: 93.79%
211245

212246

213247
&nbsp;
214-
### 6) 395 M ModernBERT Large
248+
### 7) 395M ModernBERT Large
215249

216250
Same as above but using the larger ModernBERT variant.
217251

@@ -248,7 +282,7 @@ Test accuracy: 95.07%
248282
<br>
249283

250284
&nbsp;
251-
### 7) Logistic Regression Baseline
285+
### 8) Logistic Regression Baseline
252286

253287
A scikit-learn [logistic regression](https://sebastianraschka.com/blog/2022/losses-learned-part1.html) classifier as a baseline:
254288

ch06/03_bonus_imdb-classification/train_bert_hf.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -197,7 +197,7 @@ def train_classifier_simple(model, train_loader, val_loader, optimizer, device,
197197
type=str,
198198
default="distilbert",
199199
help=(
200-
"Which model to train. Options: 'distilbert', 'bert', 'roberta', 'modernbert-base/-large'."
200+
"Which model to train. Options: 'distilbert', 'bert', 'roberta', 'modernbert-base/-large', 'deberta-v3-base'."
201201
)
202202
)
203203
parser.add_argument(
@@ -330,11 +330,10 @@ def train_classifier_simple(model, train_loader, val_loader, optimizer, device,
330330

331331
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
332332

333-
elif args.model == "modernbert-base":
333+
elif args.model == "deberta-v3-base":
334334
model = AutoModelForSequenceClassification.from_pretrained(
335-
"answerdotai/ModernBERT-base", num_labels=2
335+
"microsoft/deberta-v3-base", num_labels=2
336336
)
337-
print(model)
338337
model.classifier = torch.nn.Linear(in_features=768, out_features=2)
339338
for param in model.parameters():
340339
param.requires_grad = False
@@ -344,15 +343,17 @@ def train_classifier_simple(model, train_loader, val_loader, optimizer, device,
344343
elif args.trainable_layers == "last_block":
345344
for param in model.classifier.parameters():
346345
param.requires_grad = True
347-
for param in model.layers.layer[-1].parameters():
346+
for param in model.pooler.parameters():
347+
param.requires_grad = True
348+
for param in model.deberta.encoder.layer[-1].parameters():
348349
param.requires_grad = True
349350
elif args.trainable_layers == "all":
350351
for param in model.parameters():
351352
param.requires_grad = True
352353
else:
353354
raise ValueError("Invalid --trainable_layers argument.")
354355

355-
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
356+
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
356357

357358
else:
358359
raise ValueError("Selected --model {args.model} not supported.")

0 commit comments

Comments
 (0)