Skip to content

Improve unit3 for rerelease #970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions chapters/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,13 +58,14 @@
- local: chapter3/2
title: Processing the data
- local: chapter3/3
title: Fine-tuning a model with the Trainer API or Keras
local_fw: { pt: chapter3/3, tf: chapter3/3_tf }
title: Fine-tuning a model with the Trainer API
- local: chapter3/4
title: A full training
title: A full training loop
- local: chapter3/5
title: Fine-tuning, Check!
- local: chapter3/6
title: Understanding Learning Curves
- local: chapter3/7
title: End-of-chapter quiz
quiz: 3

Expand Down
41 changes: 28 additions & 13 deletions chapters/en/chapter3/1.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,20 +7,35 @@
classNames="absolute z-10 right-0 top-0"
/>

In [Chapter 2](/course/chapter2) we explored how to use tokenizers and pretrained models to make predictions. But what if you want to fine-tune a pretrained model for your own dataset? That's the topic of this chapter! You will learn:
In [Chapter 2](/course/chapter2) we explored how to use tokenizers and pretrained models to make predictions. But what if you want to fine-tune a pretrained model to solve a specific task? That's the topic of this chapter! You will learn:

{#if fw === 'pt'}
* How to prepare a large dataset from the Hub
* How to use the high-level `Trainer` API to fine-tune a model
* How to use a custom training loop
* How to leverage the 🤗 Accelerate library to easily run that custom training loop on any distributed setup
* How to prepare a large dataset from the Hub using the latest 🤗 Datasets features
* How to use the high-level `Trainer` API to fine-tune a model with modern best practices
* How to implement a custom training loop with optimization techniques
* How to leverage the 🤗 Accelerate library to easily run distributed training on any setup
* How to apply current fine-tuning best practices for maximum performance

{:else}
* How to prepare a large dataset from the Hub
* How to use Keras to fine-tune a model
* How to use Keras to get predictions
* How to use a custom metric
<Tip>

{/if}
📚 **Essential Resources**: Before starting, you might want to review the [🤗 Datasets documentation](https://huggingface.co/docs/datasets/) for data processing.

In order to upload your trained checkpoints to the Hugging Face Hub, you will need a huggingface.co account: [create an account](https://huggingface.co/join)
</Tip>

This chapter will also serve as an introduction to some Hugging Face libraries beyond the 🤗 Transformers library! We'll see how libraries like 🤗 Datasets, 🤗 Tokenizers, 🤗 Accelerate, and 🤗 Evaluate can help you train models more efficiently and effectively.

Each of the main sections in this chapter will teach you something different:
- **Section 2**: Learn modern data preprocessing techniques and efficient dataset handling
- **Section 3**: Master the powerful Trainer API with all its latest features
- **Section 4**: Implement training loops from scratch and understand distributed training with Accelerate

By the end of this chapter, you'll be able to fine-tune models on your own datasets using both high-level APIs and custom training loops, applying the latest best practices in the field.

<Tip>

🎯 **What You'll Build**: By the end of this chapter, you'll have fine-tuned a BERT model for text classification and understand how to adapt the techniques to your own datasets and tasks.

</Tip>

This chapter focuses exclusively on **PyTorch**, as it has become the standard framework for modern deep learning research and production. We'll use the latest APIs and best practices from the Hugging Face ecosystem.

To upload your trained models to the Hugging Face Hub, you will need a Hugging Face account: [create an account](https://huggingface.co/join)
262 changes: 161 additions & 101 deletions chapters/en/chapter3/2.mdx

Large diffs are not rendered by default.

235 changes: 228 additions & 7 deletions chapters/en/chapter3/3.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,13 @@

<Youtube id="nvBXf7s7vTI"/>

🤗 Transformers provides a `Trainer` class to help you fine-tune any of the pretrained models it provides on your dataset. Once you've done all the data preprocessing work in the last section, you have just a few steps left to define the `Trainer`. The hardest part is likely to be preparing the environment to run `Trainer.train()`, as it will run very slowly on a CPU. If you don't have a GPU set up, you can get access to free GPUs or TPUs on [Google Colab](https://colab.research.google.com/).
🤗 Transformers provides a `Trainer` class to help you fine-tune any of the pretrained models it provides on your dataset with modern best practices. Once you've done all the data preprocessing work in the last section, you have just a few steps left to define the `Trainer`. The hardest part is likely to be preparing the environment to run `Trainer.train()`, as it will run very slowly on a CPU. If you don't have a GPU set up, you can get access to free GPUs or TPUs on [Google Colab](https://colab.research.google.com/).

<Tip>

📚 **Training Resources**: Before diving into training, familiarize yourself with the comprehensive [🤗 Transformers training guide](https://huggingface.co/docs/transformers/main/en/training) and explore practical examples in the [fine-tuning cookbook](https://huggingface.co/learn/cookbook/en/fine_tuning_code_llm_on_single_gpu).

</Tip>

The code examples below assume you have already executed the examples in the previous section. Here is a short summary recapping what you need:

Expand Down Expand Up @@ -42,9 +48,11 @@ from transformers import TrainingArguments
training_args = TrainingArguments("test-trainer")
```

If you want to automatically upload your model to the Hub during training, pass along `push_to_hub=True` in the `TrainingArguments`. We will learn more about this in [Chapter 4](/course/chapter4/3)

<Tip>

💡 If you want to automatically upload your model to the Hub during training, pass along `push_to_hub=True` in the `TrainingArguments`. We will learn more about this in [Chapter 4](/course/chapter4/3)
🚀 **Advanced Configuration**: For detailed information on all available training arguments and optimization strategies, check out the [TrainingArguments documentation](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) and the [training configuration cookbook](https://huggingface.co/learn/cookbook/en/fine_tuning_code_llm_on_single_gpu).

</Tip>

Expand All @@ -58,7 +66,7 @@ model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_label

You will notice that unlike in [Chapter 2](/course/chapter2), you get a warning after instantiating this pretrained model. This is because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead. The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now.

Once we have our model, we can define a `Trainer` by passing it all the objects constructed up to now — the `model`, the `training_args`, the training and validation datasets, our `data_collator`, and our `processing_class` (e.g., a tokenizer, feature extractor, or processor):
Once we have our model, we can define a `Trainer` by passing it all the objects constructed up to now — the `model`, the `training_args`, the training and validation datasets, our `data_collator`, and our `processing_class`. The `processing_class` parameter is a newer addition that tells the Trainer which tokenizer to use for processing:

```py
from transformers import Trainer
Expand All @@ -73,7 +81,13 @@ trainer = Trainer(
)
```

Note that when you pass a tokenizer as the `processing_class`, as we did here, the default `data_collator` used by the `Trainer` will be a `DataCollatorWithPadding` if the `processing_class` is a tokenizer or feature extractor, so you can skip the line `data_collator=data_collator` in this call. It was still important to show you this part of the processing in section 2!
When you pass a tokenizer as the `processing_class`, the default `data_collator` used by the `Trainer` will be a `DataCollatorWithPadding`. You can skip the `data_collator=data_collator` line in this case, but we included it here to show you this important part of the processing pipeline.

<Tip>

📖 **Learn More**: For comprehensive details on the Trainer class and its parameters, visit the [Trainer API documentation](https://huggingface.co/docs/transformers/main/en/main_classes/trainer) and explore advanced usage patterns in the [training cookbook recipes](https://huggingface.co/learn/cookbook/en/fine_tuning_code_llm_on_single_gpu).

</Tip>

To fine-tune the model on our dataset, we just have to call the `train()` method of our `Trainer`:

Expand Down Expand Up @@ -123,6 +137,12 @@ metric.compute(predictions=preds, references=predictions.label_ids)
{'accuracy': 0.8578431372549019, 'f1': 0.8996539792387542}
```

<Tip>

Learn about different evaluation metrics and strategies in the [🤗 Evaluate documentation](https://huggingface.co/docs/evaluate/).

</Tip>

The exact results you get may vary, as the random initialization of the model head might change the metrics it achieved. Here, we can see our model has an accuracy of 85.78% on the validation set and an F1 score of 89.97. Those are the two metrics used to evaluate results on the MRPC dataset for the GLUE benchmark. The table in the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf) reported an F1 score of 88.9 for the base model. That was the `uncased` model while we are currently using the `cased` model, which explains the better result.

Wrapping everything together, we get our `compute_metrics()` function:
Expand Down Expand Up @@ -160,13 +180,214 @@ trainer.train()

This time, it will report the validation loss and metrics at the end of each epoch on top of the training loss. Again, the exact accuracy/F1 score you reach might be a bit different from what we found, because of the random head initialization of the model, but it should be in the same ballpark.

The `Trainer` will work out of the box on multiple GPUs or TPUs and provides lots of options, like mixed-precision training (use `fp16 = True` in your training arguments). We will go over everything it supports in Chapter 10.
### Advanced Training Features[[advanced-training-features]]

The `Trainer` comes with many built-in features that make modern deep learning best practices accessible:

**Mixed Precision Training**: Use `fp16=True` in your training arguments for faster training and reduced memory usage:

```py
training_args = TrainingArguments(
"test-trainer",
eval_strategy="epoch",
fp16=True, # Enable mixed precision
)
```

**Gradient Accumulation**: For effective larger batch sizes when GPU memory is limited:

```py
training_args = TrainingArguments(
"test-trainer",
eval_strategy="epoch",
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
)
```

**Learning Rate Scheduling**: The Trainer uses linear decay by default, but you can customize this:

```py
training_args = TrainingArguments(
"test-trainer",
eval_strategy="epoch",
learning_rate=2e-5,
lr_scheduler_type="cosine", # Try different schedulers
)
```

<Tip>

🎯 **Performance Optimization**: For more advanced training techniques including distributed training, memory optimization, and hardware-specific optimizations, explore the [🤗 Transformers performance guide](https://huggingface.co/docs/transformers/main/en/performance).

</Tip>

The `Trainer` will work out of the box on multiple GPUs or TPUs and provides lots of options for distributed training. We will go over everything it supports in Chapter 10.

This concludes the introduction to fine-tuning using the `Trainer` API. An example of doing this for most common NLP tasks will be given in [Chapter 7](/course/chapter7), but for now let's look at how to do the same thing with a pure PyTorch training loop.

<Tip>

📝 **More Examples**: Check out the comprehensive collection of [🤗 Transformers notebooks](https://huggingface.co/docs/transformers/main/en/notebooks).

</Tip>

This concludes the introduction to fine-tuning using the `Trainer` API. An example of doing this for most common NLP tasks will be given in [Chapter 7](/course/chapter7), but for now let's look at how to do the same thing in pure PyTorch.
## Section Quiz[[section-quiz]]

Test your understanding of the Trainer API and fine-tuning concepts:

### 1. What is the purpose of the <code>processing_class</code> parameter in the Trainer?

<Question
choices={[
{
text: "It specifies which model architecture to use.",
explain: "Model architecture is specified when loading the model, not in the Trainer."
},
{
text: "It tells the Trainer which tokenizer to use for processing data.",
explain: "The processing_class parameter is a modern addition that helps the Trainer know which tokenizer to use.",
correct: true
},
{
text: "It determines the batch size for training.",
explain: "Batch size is set in TrainingArguments, not through processing_class."
},
{
text: "It controls the evaluation frequency.",
explain: "Evaluation frequency is controlled by eval_strategy in TrainingArguments."
}
]}
/>

### 2. Which TrainingArguments parameter controls how often evaluation occurs during training?

<Question
choices={[
{
text: "eval_frequency",
explain: "There's no eval_frequency parameter in TrainingArguments."
},
{
text: "eval_strategy",
explain: "eval_strategy can be set to 'epoch', 'steps', or 'no' to control evaluation timing.",
correct: true
},
{
text: "evaluation_steps",
explain: "eval_steps sets the number of steps between evaluations, but eval_strategy determines if/when evaluation happens."
},
{
text: "do_eval",
explain: "There's no do_eval parameter in modern TrainingArguments."
}
]}
/>

### 3. What does <code>fp16=True</code> in TrainingArguments enable?

<Question
choices={[
{
text: "16-bit integer precision for faster training.",
explain: "fp16 refers to floating-point precision, not integer precision."
},
{
text: "Mixed precision training with 16-bit floating-point numbers for faster training and reduced memory usage.",
explain: "Mixed precision training uses 16-bit floats for forward pass and 32-bit for gradients, improving speed and reducing memory usage.",
correct: true
},
{
text: "Training for exactly 16 epochs.",
explain: "fp16 has nothing to do with the number of epochs."
},
{
text: "Using 16 GPUs for distributed training.",
explain: "The number of GPUs is not controlled by the fp16 parameter."
}
]}
/>

### 4. What is the role of the <code>compute_metrics</code> function in the Trainer?

<Question
choices={[
{
text: "It calculates the loss during training.",
explain: "Loss calculation is handled automatically by the model, not by compute_metrics."
},
{
text: "It converts logits to predictions and calculates evaluation metrics like accuracy and F1.",
explain: "compute_metrics takes predictions and labels, then returns metrics for evaluation.",
correct: true
},
{
text: "It determines which optimizer to use.",
explain: "Optimizer selection is not handled by compute_metrics."
},
{
text: "It preprocesses the training data.",
explain: "Data preprocessing is done before training, not by compute_metrics during evaluation."
}
]}
/>

### 5. What happens when you don't provide an <code>eval_dataset</code> to the Trainer?

<Question
choices={[
{
text: "Training will fail with an error.",
explain: "Training can proceed without an eval_dataset, though you won't get evaluation metrics."
},
{
text: "The Trainer will automatically split the training data for evaluation.",
explain: "The Trainer doesn't automatically create validation splits."
},
{
text: "You won't get evaluation metrics during training, but training will still work.",
explain: "Evaluation is optional - you can train without it, but you won't see validation metrics.",
correct: true
},
{
text: "The model will use the training data for evaluation.",
explain: "The Trainer won't automatically use training data for evaluation - it simply won't evaluate."
}
]}
/>

### 6. What is gradient accumulation and how do you enable it?

<Question
choices={[
{
text: "It saves gradients to disk, enabled with save_gradients=True.",
explain: "Gradient accumulation doesn't involve saving gradients to disk."
},
{
text: "It accumulates gradients over multiple batches before updating, enabled with gradient_accumulation_steps.",
explain: "This allows you to simulate larger batch sizes by accumulating gradients over multiple forward passes.",
correct: true
},
{
text: "It speeds up gradient computation, enabled automatically with fp16.",
explain: "While fp16 can speed up training, gradient accumulation is a separate technique."
},
{
text: "It prevents gradient overflow, enabled with gradient_clipping=True.",
explain: "That describes gradient clipping, not gradient accumulation."
}
]}
/>

<Tip>

✏️ **Try it out!** Fine-tune a model on the GLUE SST-2 dataset, using the data processing you did in section 2.
💡 **Key Takeaways:**
- The `Trainer` API provides a high-level interface that handles most training complexity
- Use `processing_class` to specify your tokenizer for proper data handling
- `TrainingArguments` controls all aspects of training: learning rate, batch size, evaluation strategy, and optimizations
- `compute_metrics` enables custom evaluation metrics beyond just training loss
- Modern features like mixed precision (`fp16=True`) and gradient accumulation can significantly improve training efficiency

</Tip>

Loading