diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml index b300b245f..a230a1e58 100644 --- a/chapters/en/_toctree.yml +++ b/chapters/en/_toctree.yml @@ -58,13 +58,14 @@ - local: chapter3/2 title: Processing the data - local: chapter3/3 - title: Fine-tuning a model with the Trainer API or Keras - local_fw: { pt: chapter3/3, tf: chapter3/3_tf } + title: Fine-tuning a model with the Trainer API - local: chapter3/4 - title: A full training + title: A full training loop - local: chapter3/5 title: Fine-tuning, Check! - local: chapter3/6 + title: Understanding Learning Curves + - local: chapter3/7 title: End-of-chapter quiz quiz: 3 diff --git a/chapters/en/chapter3/1.mdx b/chapters/en/chapter3/1.mdx index 884be198b..743b13c6c 100644 --- a/chapters/en/chapter3/1.mdx +++ b/chapters/en/chapter3/1.mdx @@ -7,20 +7,35 @@ classNames="absolute z-10 right-0 top-0" /> -In [Chapter 2](/course/chapter2) we explored how to use tokenizers and pretrained models to make predictions. But what if you want to fine-tune a pretrained model for your own dataset? That's the topic of this chapter! You will learn: +In [Chapter 2](/course/chapter2) we explored how to use tokenizers and pretrained models to make predictions. But what if you want to fine-tune a pretrained model to solve a specific task? That's the topic of this chapter! You will learn: -{#if fw === 'pt'} -* How to prepare a large dataset from the Hub -* How to use the high-level `Trainer` API to fine-tune a model -* How to use a custom training loop -* How to leverage the 🤗 Accelerate library to easily run that custom training loop on any distributed setup +* How to prepare a large dataset from the Hub using the latest 🤗 Datasets features +* How to use the high-level `Trainer` API to fine-tune a model with modern best practices +* How to implement a custom training loop with optimization techniques +* How to leverage the 🤗 Accelerate library to easily run distributed training on any setup +* How to apply current fine-tuning best practices for maximum performance -{:else} -* How to prepare a large dataset from the Hub -* How to use Keras to fine-tune a model -* How to use Keras to get predictions -* How to use a custom metric + -{/if} +📚 **Essential Resources**: Before starting, you might want to review the [🤗 Datasets documentation](https://huggingface.co/docs/datasets/) for data processing. -In order to upload your trained checkpoints to the Hugging Face Hub, you will need a huggingface.co account: [create an account](https://huggingface.co/join) \ No newline at end of file + + +This chapter will also serve as an introduction to some Hugging Face libraries beyond the 🤗 Transformers library! We'll see how libraries like 🤗 Datasets, 🤗 Tokenizers, 🤗 Accelerate, and 🤗 Evaluate can help you train models more efficiently and effectively. + +Each of the main sections in this chapter will teach you something different: +- **Section 2**: Learn modern data preprocessing techniques and efficient dataset handling +- **Section 3**: Master the powerful Trainer API with all its latest features +- **Section 4**: Implement training loops from scratch and understand distributed training with Accelerate + +By the end of this chapter, you'll be able to fine-tune models on your own datasets using both high-level APIs and custom training loops, applying the latest best practices in the field. + + + +🎯 **What You'll Build**: By the end of this chapter, you'll have fine-tuned a BERT model for text classification and understand how to adapt the techniques to your own datasets and tasks. + + + +This chapter focuses exclusively on **PyTorch**, as it has become the standard framework for modern deep learning research and production. We'll use the latest APIs and best practices from the Hugging Face ecosystem. + +To upload your trained models to the Hugging Face Hub, you will need a Hugging Face account: [create an account](https://huggingface.co/join) \ No newline at end of file diff --git a/chapters/en/chapter3/2.mdx b/chapters/en/chapter3/2.mdx index e94b46d75..bc1b00179 100644 --- a/chapters/en/chapter3/2.mdx +++ b/chapters/en/chapter3/2.mdx @@ -1,29 +1,13 @@ - - # Processing the data[[processing-the-data]] -{#if fw === 'pt'} - - - -{:else} - -{/if} - -{#if fw === 'pt'} -Continuing with the example from the [previous chapter](/course/chapter2), here is how we would train a sequence classifier on one batch in PyTorch: +Continuing with the example from the [previous chapter](/course/chapter2), here is how we would train a sequence classifier on one batch: ```python import torch @@ -48,30 +32,6 @@ loss = model(**batch).loss loss.backward() optimizer.step() ``` -{:else} -Continuing with the example from the [previous chapter](/course/chapter2), here is how we would train a sequence classifier on one batch in TensorFlow: - -```python -import tensorflow as tf -import numpy as np -from transformers import AutoTokenizer, TFAutoModelForSequenceClassification - -# Same as before -checkpoint = "bert-base-uncased" -tokenizer = AutoTokenizer.from_pretrained(checkpoint) -model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint) -sequences = [ - "I've been waiting for a HuggingFace course my whole life.", - "This course is amazing!", -] -batch = dict(tokenizer(sequences, padding=True, truncation=True, return_tensors="tf")) - -# This is new -model.compile(optimizer="adam", loss="sparse_categorical_crossentropy") -labels = tf.convert_to_tensor([1, 1]) -model.train_on_batch(batch, labels) -``` -{/if} Of course, just training the model on two sentences is not going to yield very good results. To get better results, you will need to prepare a bigger dataset. @@ -79,18 +39,16 @@ In this section we will use as an example the MRPC (Microsoft Research Paraphras ### Loading a dataset from the Hub[[loading-a-dataset-from-the-hub]] -{#if fw === 'pt'} -{:else} - -{/if} The Hub doesn't just contain models; it also has multiple datasets in lots of different languages. You can browse the datasets [here](https://huggingface.co/datasets), and we recommend you try to load and process a new dataset once you have gone through this section (see the general documentation [here](https://huggingface.co/docs/datasets/loading)). But for now, let's focus on the MRPC dataset! This is one of the 10 datasets composing the [GLUE benchmark](https://gluebenchmark.com/), which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks. The 🤗 Datasets library provides a very simple command to download and cache a dataset on the Hub. We can download the MRPC dataset like this: -⚠️ **Warning** Make sure that `datasets` is installed by running `pip install datasets`. Then, load the MRPC dataset and print it to see what it contains. + +💡 **Additional Resources**: For more dataset loading techniques and examples, check out the [🤗 Datasets documentation](https://huggingface.co/docs/datasets/). + ```py @@ -119,8 +77,12 @@ DatasetDict({ As you can see, we get a `DatasetDict` object which contains the training set, the validation set, and the test set. Each of those contains several columns (`sentence1`, `sentence2`, `label`, and `idx`) and a variable number of rows, which are the number of elements in each set (so, there are 3,668 pairs of sentences in the training set, 408 in the validation set, and 1,725 in the test set). + + This command downloads and caches the dataset, by default in *~/.cache/huggingface/datasets*. Recall from Chapter 2 that you can customize your cache folder by setting the `HF_HOME` environment variable. + + We can access each pair of sentences in our `raw_datasets` object by indexing, like with a dictionary: ```py @@ -158,11 +120,7 @@ Behind the scenes, `label` is of type `ClassLabel`, and the mapping of integers ### Preprocessing a dataset[[preprocessing-a-dataset]] -{#if fw === 'pt'} -{:else} - -{/if} To preprocess the dataset, we need to convert the text to numbers the model can make sense of. As you saw in the [previous chapter](/course/chapter2), this is done with a tokenizer. We can feed the tokenizer one sentence or a list of sentences, so we can directly tokenize all the first sentences and all the second sentences of each pair like this: @@ -175,6 +133,12 @@ tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"]) tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"]) ``` + + +💡 **Deep Dive**: For more advanced tokenization techniques and understanding how different tokenizers work, explore the [🤗 Tokenizers documentation](https://huggingface.co/docs/transformers/main/en/tokenizer_summary) and the [tokenization guide in the cookbook](https://huggingface.co/learn/cookbook/en/advanced_rag#tokenization-strategies). + + + However, we can't just pass two sequences to the model and get a prediction of whether the two sentences are paraphrases or not. We need to handle the two sequences as a pair, and apply the appropriate preprocessing. Fortunately, the tokenizer can also take a pair of sequences and prepare it the way our BERT model expects: ```py @@ -249,7 +213,13 @@ def tokenize_function(example): This function takes a dictionary (like the items of our dataset) and returns a new dictionary with the keys `input_ids`, `attention_mask`, and `token_type_ids`. Note that it also works if the `example` dictionary contains several samples (each key as a list of sentences) since the `tokenizer` works on lists of pairs of sentences, as seen before. This will allow us to use the option `batched=True` in our call to `map()`, which will greatly speed up the tokenization. The `tokenizer` is backed by a tokenizer written in Rust from the [🤗 Tokenizers](https://github.com/huggingface/tokenizers) library. This tokenizer can be very fast, but only if we give it lots of inputs at once. -Note that we've left the `padding` argument out in our tokenization function for now. This is because padding all the samples to the maximum length is not efficient: it's better to pad the samples when we're building a batch, as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset. This can save a lot of time and processing power when the inputs have very variable lengths! +Note that we've left the `padding` argument out in our tokenization function for now. This is because padding all the samples to the maximum length is not efficient: it's better to pad the samples when we're building a batch, as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset. This can save a lot of time and processing power when the inputs have very variable lengths! + + + +📚 **Performance Tips**: Learn more about efficient data processing techniques in the [🤗 Datasets performance guide](https://huggingface.co/docs/datasets/about_arrow). + + Here is how we apply the tokenization function on all our datasets at once. We're using `batched=True` in our call to `map` so the function is applied to multiple elements of our dataset at once, and not on each element separately. This allows for faster preprocessing. @@ -283,34 +253,25 @@ Our `tokenize_function` returns a dictionary with the keys `input_ids`, `attenti The last thing we will need to do is pad all the examples to the length of the longest element when we batch elements together — a technique we refer to as *dynamic padding*. -### Dynamic padding[[dynamic-padding]] +##### Dynamic padding[[dynamic-padding]] -{#if fw === 'pt'} The function that is responsible for putting together samples inside a batch is called a *collate function*. It's an argument you can pass when you build a `DataLoader`, the default being a function that will just convert your samples to PyTorch tensors and concatenate them (recursively if your elements are lists, tuples, or dictionaries). This won't be possible in our case since the inputs we have won't all be of the same size. We have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding. This will speed up training by quite a bit, but note that if you're training on a TPU it can cause problems — TPUs prefer fixed shapes, even when that requires extra padding. -{:else} + -The function that is responsible for putting together samples inside a batch is called a *collate function*. The default collator is a function that will just convert your samples to tf.Tensor and concatenate them (recursively if your elements are lists, tuples, or dictionaries). This won't be possible in our case since the inputs we have won't all be of the same size. We have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding. This will speed up training by quite a bit, but note that if you're training on a TPU it can cause problems — TPUs prefer fixed shapes, even when that requires extra padding. +🚀 **Optimization Guide**: For more details on optimizing training performance, including padding strategies and TPU considerations, see the [🤗 Transformers performance documentation](https://huggingface.co/docs/transformers/main/en/performance). -{/if} + To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the 🤗 Transformers library provides us with such a function via `DataCollatorWithPadding`. It takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything you need: -{#if fw === 'pt'} ```py from transformers import DataCollatorWithPadding data_collator = DataCollatorWithPadding(tokenizer=tokenizer) ``` -{:else} -```py -from transformers import DataCollatorWithPadding - -data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf") -``` -{/if} To test this new toy, let's grab a few samples from our training set that we would like to batch together. Here, we remove the columns `idx`, `sentence1`, and `sentence2` as they won't be needed and contain strings (and we can't create tensors with strings) and have a look at the lengths of each entry in the batch: @@ -331,17 +292,6 @@ batch = data_collator(samples) {k: v.shape for k, v in batch.items()} ``` -{#if fw === 'tf'} - -```python out -{'attention_mask': TensorShape([8, 67]), - 'input_ids': TensorShape([8, 67]), - 'token_type_ids': TensorShape([8, 67]), - 'labels': TensorShape([8])} -``` - -{:else} - ```python out {'attention_mask': torch.Size([8, 67]), 'input_ids': torch.Size([8, 67]), @@ -351,36 +301,146 @@ batch = data_collator(samples) Looking good! Now that we've gone from raw text to batches our model can deal with, we're ready to fine-tune it! -{/if} - ✏️ **Try it out!** Replicate the preprocessing on the GLUE SST-2 dataset. It's a little bit different since it's composed of single sentences instead of pairs, but the rest of what we did should look the same. For a harder challenge, try to write a preprocessing function that works on any of the GLUE tasks. - - -{#if fw === 'tf'} +📖 **Additional Practice**: Check out these hands-on examples from the [🤗 Transformers examples](https://huggingface.co/docs/transformers/main/en/notebooks). -Now that we have our dataset and a data collator, we need to put them together. We could manually load batches and collate them, but that's a lot of work, and probably not very performant either. Instead, there's a simple method that offers a performant solution to this problem: `to_tf_dataset()`. This will wrap a `tf.data.Dataset` around your dataset, with an optional collation function. `tf.data.Dataset` is a native TensorFlow format that Keras can use for `model.fit()`, so this one method immediately converts a 🤗 Dataset to a format that's ready for training. Let's see it in action with our dataset! + -```py -tf_train_dataset = tokenized_datasets["train"].to_tf_dataset( - columns=["attention_mask", "input_ids", "token_type_ids"], - label_cols=["labels"], - shuffle=True, - collate_fn=data_collator, - batch_size=8, -) +Perfect! Now that we have preprocessed our data with the latest best practices from the 🤗 Datasets library, we're ready to move on to training our model using the modern Trainer API. The next section will show you how to fine-tune your model effectively using the latest features and optimizations available in the Hugging Face ecosystem. + +## Section Quiz[[section-quiz]] + +Test your understanding of data processing concepts: + +### 1. What is the main advantage of using `Dataset.map()` with `batched=True`? + + + +### 2. Why do we use dynamic padding instead of padding all sequences to the maximum length in the dataset? + + + +### 3. What does the `token_type_ids` field represent in BERT tokenization? + + + +### 4. When loading a dataset with `load_dataset('glue', 'mrpc')`, what does the second argument specify? + + + +### 5. What is the purpose of removing columns like 'sentence1' and 'sentence2' before training? + + -tf_validation_dataset = tokenized_datasets["validation"].to_tf_dataset( - columns=["attention_mask", "input_ids", "token_type_ids"], - label_cols=["labels"], - shuffle=False, - collate_fn=data_collator, - batch_size=8, -) -``` + -And that's it! We can take those datasets forward into the next lecture, where training will be pleasantly straightforward after all the hard work of data preprocessing. +💡 **Key Takeaways:** +- Use `batched=True` with `Dataset.map()` for significantly faster preprocessing +- Dynamic padding with `DataCollatorWithPadding` is more efficient than fixed-length padding +- Always preprocess your data to match what your model expects (numerical tensors, correct column names) +- The 🤗 Datasets library provides powerful tools for efficient data processing at scale -{/if} + diff --git a/chapters/en/chapter3/3.mdx b/chapters/en/chapter3/3.mdx index a7f2662bc..12705fca7 100644 --- a/chapters/en/chapter3/3.mdx +++ b/chapters/en/chapter3/3.mdx @@ -11,7 +11,13 @@ -🤗 Transformers provides a `Trainer` class to help you fine-tune any of the pretrained models it provides on your dataset. Once you've done all the data preprocessing work in the last section, you have just a few steps left to define the `Trainer`. The hardest part is likely to be preparing the environment to run `Trainer.train()`, as it will run very slowly on a CPU. If you don't have a GPU set up, you can get access to free GPUs or TPUs on [Google Colab](https://colab.research.google.com/). +🤗 Transformers provides a `Trainer` class to help you fine-tune any of the pretrained models it provides on your dataset with modern best practices. Once you've done all the data preprocessing work in the last section, you have just a few steps left to define the `Trainer`. The hardest part is likely to be preparing the environment to run `Trainer.train()`, as it will run very slowly on a CPU. If you don't have a GPU set up, you can get access to free GPUs or TPUs on [Google Colab](https://colab.research.google.com/). + + + +📚 **Training Resources**: Before diving into training, familiarize yourself with the comprehensive [🤗 Transformers training guide](https://huggingface.co/docs/transformers/main/en/training) and explore practical examples in the [fine-tuning cookbook](https://huggingface.co/learn/cookbook/en/fine_tuning_code_llm_on_single_gpu). + + The code examples below assume you have already executed the examples in the previous section. Here is a short summary recapping what you need: @@ -42,9 +48,11 @@ from transformers import TrainingArguments training_args = TrainingArguments("test-trainer") ``` +If you want to automatically upload your model to the Hub during training, pass along `push_to_hub=True` in the `TrainingArguments`. We will learn more about this in [Chapter 4](/course/chapter4/3) + -💡 If you want to automatically upload your model to the Hub during training, pass along `push_to_hub=True` in the `TrainingArguments`. We will learn more about this in [Chapter 4](/course/chapter4/3) +🚀 **Advanced Configuration**: For detailed information on all available training arguments and optimization strategies, check out the [TrainingArguments documentation](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) and the [training configuration cookbook](https://huggingface.co/learn/cookbook/en/fine_tuning_code_llm_on_single_gpu). @@ -58,7 +66,7 @@ model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_label You will notice that unlike in [Chapter 2](/course/chapter2), you get a warning after instantiating this pretrained model. This is because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead. The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now. -Once we have our model, we can define a `Trainer` by passing it all the objects constructed up to now — the `model`, the `training_args`, the training and validation datasets, our `data_collator`, and our `processing_class` (e.g., a tokenizer, feature extractor, or processor): +Once we have our model, we can define a `Trainer` by passing it all the objects constructed up to now — the `model`, the `training_args`, the training and validation datasets, our `data_collator`, and our `processing_class`. The `processing_class` parameter is a newer addition that tells the Trainer which tokenizer to use for processing: ```py from transformers import Trainer @@ -73,7 +81,13 @@ trainer = Trainer( ) ``` -Note that when you pass a tokenizer as the `processing_class`, as we did here, the default `data_collator` used by the `Trainer` will be a `DataCollatorWithPadding` if the `processing_class` is a tokenizer or feature extractor, so you can skip the line `data_collator=data_collator` in this call. It was still important to show you this part of the processing in section 2! +When you pass a tokenizer as the `processing_class`, the default `data_collator` used by the `Trainer` will be a `DataCollatorWithPadding`. You can skip the `data_collator=data_collator` line in this case, but we included it here to show you this important part of the processing pipeline. + + + +📖 **Learn More**: For comprehensive details on the Trainer class and its parameters, visit the [Trainer API documentation](https://huggingface.co/docs/transformers/main/en/main_classes/trainer) and explore advanced usage patterns in the [training cookbook recipes](https://huggingface.co/learn/cookbook/en/fine_tuning_code_llm_on_single_gpu). + + To fine-tune the model on our dataset, we just have to call the `train()` method of our `Trainer`: @@ -123,6 +137,12 @@ metric.compute(predictions=preds, references=predictions.label_ids) {'accuracy': 0.8578431372549019, 'f1': 0.8996539792387542} ``` + + +Learn about different evaluation metrics and strategies in the [🤗 Evaluate documentation](https://huggingface.co/docs/evaluate/). + + + The exact results you get may vary, as the random initialization of the model head might change the metrics it achieved. Here, we can see our model has an accuracy of 85.78% on the validation set and an F1 score of 89.97. Those are the two metrics used to evaluate results on the MRPC dataset for the GLUE benchmark. The table in the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf) reported an F1 score of 88.9 for the base model. That was the `uncased` model while we are currently using the `cased` model, which explains the better result. Wrapping everything together, we get our `compute_metrics()` function: @@ -160,13 +180,214 @@ trainer.train() This time, it will report the validation loss and metrics at the end of each epoch on top of the training loss. Again, the exact accuracy/F1 score you reach might be a bit different from what we found, because of the random head initialization of the model, but it should be in the same ballpark. -The `Trainer` will work out of the box on multiple GPUs or TPUs and provides lots of options, like mixed-precision training (use `fp16 = True` in your training arguments). We will go over everything it supports in Chapter 10. +### Advanced Training Features[[advanced-training-features]] + +The `Trainer` comes with many built-in features that make modern deep learning best practices accessible: + +**Mixed Precision Training**: Use `fp16=True` in your training arguments for faster training and reduced memory usage: + +```py +training_args = TrainingArguments( + "test-trainer", + eval_strategy="epoch", + fp16=True, # Enable mixed precision +) +``` + +**Gradient Accumulation**: For effective larger batch sizes when GPU memory is limited: + +```py +training_args = TrainingArguments( + "test-trainer", + eval_strategy="epoch", + per_device_train_batch_size=4, + gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16 +) +``` + +**Learning Rate Scheduling**: The Trainer uses linear decay by default, but you can customize this: + +```py +training_args = TrainingArguments( + "test-trainer", + eval_strategy="epoch", + learning_rate=2e-5, + lr_scheduler_type="cosine", # Try different schedulers +) +``` + + + +🎯 **Performance Optimization**: For more advanced training techniques including distributed training, memory optimization, and hardware-specific optimizations, explore the [🤗 Transformers performance guide](https://huggingface.co/docs/transformers/main/en/performance). + + + +The `Trainer` will work out of the box on multiple GPUs or TPUs and provides lots of options for distributed training. We will go over everything it supports in Chapter 10. + +This concludes the introduction to fine-tuning using the `Trainer` API. An example of doing this for most common NLP tasks will be given in [Chapter 7](/course/chapter7), but for now let's look at how to do the same thing with a pure PyTorch training loop. + + + +📝 **More Examples**: Check out the comprehensive collection of [🤗 Transformers notebooks](https://huggingface.co/docs/transformers/main/en/notebooks). + + -This concludes the introduction to fine-tuning using the `Trainer` API. An example of doing this for most common NLP tasks will be given in [Chapter 7](/course/chapter7), but for now let's look at how to do the same thing in pure PyTorch. +## Section Quiz[[section-quiz]] + +Test your understanding of the Trainer API and fine-tuning concepts: + +### 1. What is the purpose of the processing_class parameter in the Trainer? + + + +### 2. Which TrainingArguments parameter controls how often evaluation occurs during training? + + + +### 3. What does fp16=True in TrainingArguments enable? + + + +### 4. What is the role of the compute_metrics function in the Trainer? + + + +### 5. What happens when you don't provide an eval_dataset to the Trainer? + + + +### 6. What is gradient accumulation and how do you enable it? + + -✏️ **Try it out!** Fine-tune a model on the GLUE SST-2 dataset, using the data processing you did in section 2. +💡 **Key Takeaways:** +- The `Trainer` API provides a high-level interface that handles most training complexity +- Use `processing_class` to specify your tokenizer for proper data handling +- `TrainingArguments` controls all aspects of training: learning rate, batch size, evaluation strategy, and optimizations +- `compute_metrics` enables custom evaluation metrics beyond just training loss +- Modern features like mixed precision (`fp16=True`) and gradient accumulation can significantly improve training efficiency diff --git a/chapters/en/chapter3/3_tf.mdx b/chapters/en/chapter3/3_tf.mdx deleted file mode 100644 index 9df89e356..000000000 --- a/chapters/en/chapter3/3_tf.mdx +++ /dev/null @@ -1,199 +0,0 @@ - - -# Fine-tuning a model with Keras[[fine-tuning-a-model-with-keras]] - - - -Once you've done all the data preprocessing work in the last section, you have just a few steps left to train the model. Note, however, that the `model.fit()` command will run very slowly on a CPU. If you don't have a GPU set up, you can get access to free GPUs or TPUs on [Google Colab](https://colab.research.google.com/). - -The code examples below assume you have already executed the examples in the previous section. Here is a short summary recapping what you need: - -```py -from datasets import load_dataset -from transformers import AutoTokenizer, DataCollatorWithPadding -import numpy as np - -raw_datasets = load_dataset("glue", "mrpc") -checkpoint = "bert-base-uncased" -tokenizer = AutoTokenizer.from_pretrained(checkpoint) - - -def tokenize_function(example): - return tokenizer(example["sentence1"], example["sentence2"], truncation=True) - - -tokenized_datasets = raw_datasets.map(tokenize_function, batched=True) - -data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf") - -tf_train_dataset = tokenized_datasets["train"].to_tf_dataset( - columns=["attention_mask", "input_ids", "token_type_ids"], - label_cols=["labels"], - shuffle=True, - collate_fn=data_collator, - batch_size=8, -) - -tf_validation_dataset = tokenized_datasets["validation"].to_tf_dataset( - columns=["attention_mask", "input_ids", "token_type_ids"], - label_cols=["labels"], - shuffle=False, - collate_fn=data_collator, - batch_size=8, -) -``` - -### Training[[training]] - -TensorFlow models imported from 🤗 Transformers are already Keras models. Here is a short introduction to Keras. - - - -That means that once we have our data, very little work is required to begin training on it. - - - -As in the [previous chapter](/course/chapter2), we will use the `TFAutoModelForSequenceClassification` class, with two labels: - -```py -from transformers import TFAutoModelForSequenceClassification - -model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) -``` - -You will notice that unlike in [Chapter 2](/course/chapter2), you get a warning after instantiating this pretrained model. This is because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been inserted instead. The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now. - -To fine-tune the model on our dataset, we just have to `compile()` our model and then pass our data to the `fit()` method. This will start the fine-tuning process (which should take a couple of minutes on a GPU) and report training loss as it goes, plus the validation loss at the end of each epoch. - - - -Note that 🤗 Transformers models have a special ability that most Keras models don't - they can automatically use an appropriate loss which they compute internally. They will use this loss by default if you don't set a loss argument in `compile()`. Note that to use the internal loss you'll need to pass your labels as part of the input, not as a separate label, which is the normal way to use labels with Keras models. You'll see examples of this in Part 2 of the course, where defining the correct loss function can be tricky. For sequence classification, however, a standard Keras loss function works fine, so that's what we'll use here. - - - -```py -from tensorflow.keras.losses import SparseCategoricalCrossentropy - -model.compile( - optimizer="adam", - loss=SparseCategoricalCrossentropy(from_logits=True), - metrics=["accuracy"], -) -model.fit( - tf_train_dataset, - validation_data=tf_validation_dataset, -) -``` - - - -Note a very common pitfall here — you *can* just pass the name of the loss as a string to Keras, but by default Keras will assume that you have already applied a softmax to your outputs. Many models, however, output the values right before the softmax is applied, which are also known as the *logits*. We need to tell the loss function that that's what our model does, and the only way to do that is to call it directly, rather than by name with a string. - - - - -### Improving training performance[[improving-training-performance]] - - - -If you try the above code, it certainly runs, but you'll find that the loss declines only slowly or sporadically. The primary cause -is the *learning rate*. As with the loss, when we pass Keras the name of an optimizer as a string, Keras initializes -that optimizer with default values for all parameters, including learning rate. From long experience, though, we know -that transformer models benefit from a much lower learning rate than the default for Adam, which is 1e-3, also written -as 10 to the power of -3, or 0.001. 5e-5 (0.00005), which is some twenty times lower, is a much better starting point. - -In addition to lowering the learning rate, we have a second trick up our sleeve: We can slowly reduce the learning rate -over the course of training. In the literature, you will sometimes see this referred to as *decaying* or *annealing* -the learning rate. In Keras, the best way to do this is to use a *learning rate scheduler*. A good one to use is -`PolynomialDecay` — despite the name, with default settings it simply linearly decays the learning rate from the initial -value to the final value over the course of training, which is exactly what we want. In order to use a scheduler correctly, -though, we need to tell it how long training is going to be. We compute that as `num_train_steps` below. - -```py -from tensorflow.keras.optimizers.schedules import PolynomialDecay - -batch_size = 8 -num_epochs = 3 -# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied -# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset, -# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size. -num_train_steps = len(tf_train_dataset) * num_epochs -lr_scheduler = PolynomialDecay( - initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps -) -from tensorflow.keras.optimizers import Adam - -opt = Adam(learning_rate=lr_scheduler) -``` - - - -The 🤗 Transformers library also has a `create_optimizer()` function that will create an `AdamW` optimizer with learning rate decay. This is a convenient shortcut that you'll see in detail in future sections of the course. - - - -Now we have our all-new optimizer, and we can try training with it. First, let's reload the model, to reset the changes to the weights from the training run we just did, and then we can compile it with the new optimizer: - -```py -import tensorflow as tf - -model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) -loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) -model.compile(optimizer=opt, loss=loss, metrics=["accuracy"]) -``` - -Now, we fit again: - -```py -model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3) -``` - - - -💡 If you want to automatically upload your model to the Hub during training, you can pass along a `PushToHubCallback` in the `model.fit()` method. We will learn more about this in [Chapter 4](/course/chapter4/3) - - - -### Model predictions[[model-predictions]] - - - - -Training and watching the loss go down is all very nice, but what if we want to actually get outputs from the trained model, either to compute some metrics, or to use the model in production? To do that, we can just use the `predict()` method. This will return the *logits* from the output head of the model, one per class. - -```py -preds = model.predict(tf_validation_dataset)["logits"] -``` - -We can convert these logits into the model's class predictions by using `argmax` to find the highest logit, which corresponds to the most likely class: - -```py -class_preds = np.argmax(preds, axis=1) -print(preds.shape, class_preds.shape) -``` - -```python out -(408, 2) (408,) -``` - -Now, let's use those `preds` to compute some metrics! We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `evaluate.load()` function. The object returned has a `compute()` method we can use to do the metric calculation: - -```py -import evaluate - -metric = evaluate.load("glue", "mrpc") -metric.compute(predictions=class_preds, references=raw_datasets["validation"]["label"]) -``` - -```python out -{'accuracy': 0.8578431372549019, 'f1': 0.8996539792387542} -``` - -The exact results you get may vary, as the random initialization of the model head might change the metrics it achieved. Here, we can see our model has an accuracy of 85.78% on the validation set and an F1 score of 89.97. Those are the two metrics used to evaluate results on the MRPC dataset for the GLUE benchmark. The table in the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf) reported an F1 score of 88.9 for the base model. That was the `uncased` model while we are currently using the `cased` model, which explains the better result. - -This concludes the introduction to fine-tuning using the Keras API. An example of doing this for most common NLP tasks will be given in [Chapter 7](/course/chapter7). If you would like to hone your skills on the Keras API, try to fine-tune a model on the GLUE SST-2 dataset, using the data processing you did in section 2. diff --git a/chapters/en/chapter3/4.mdx b/chapters/en/chapter3/4.mdx index 77e84c75d..e69c4e750 100644 --- a/chapters/en/chapter3/4.mdx +++ b/chapters/en/chapter3/4.mdx @@ -1,4 +1,4 @@ -# A full training[[a-full-training]] +# A full training loop[[a-full-training]] -Now we'll see how to achieve the same results as we did in the last section without using the `Trainer` class. Again, we assume you have done the data processing in section 2. Here is a short summary covering everything you will need: +Now we'll see how to achieve the same results as we did in the last section without using the `Trainer` class, implementing a training loop from scratch with modern PyTorch best practices. Again, we assume you have done the data processing in section 2. Here is a short summary covering everything you will need: + + + +🏗️ **Training from Scratch**: This section builds on the previous content. For comprehensive guidance on PyTorch training loops and best practices, check out the [🤗 Transformers training documentation](https://huggingface.co/docs/transformers/main/en/training#train-in-native-pytorch) and the [custom training cookbook](https://huggingface.co/learn/cookbook/en/fine_tuning_code_llm_on_single_gpu#model). + + ```py from datasets import load_dataset @@ -110,6 +116,17 @@ from torch.optim import AdamW optimizer = AdamW(model.parameters(), lr=5e-5) ``` + + +💡 **Modern Optimization Tips**: For even better performance, you can try: +- **AdamW with weight decay**: `AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)` +- **8-bit Adam**: Use `bitsandbytes` for memory-efficient optimization +- **Different learning rates**: Lower learning rates (1e-5 to 3e-5) often work better for large models + +🚀 **Optimization Resources**: Learn more about optimizers and training strategies in the [🤗 Transformers optimization guide](https://huggingface.co/docs/transformers/main/en/performance#optimizer). + + + Finally, the learning rate scheduler used by default is just a linear decay from the maximum value (5e-5) to 0. To properly define it, we need to know the number of training steps we will take, which is the number of epochs we want to run multiplied by the number of training batches (which is the length of our training dataloader). The `Trainer` uses three epochs by default, so we will follow that: ```py @@ -167,6 +184,19 @@ for epoch in range(num_epochs): progress_bar.update(1) ``` + + +💡 **Modern Training Optimizations**: To make your training loop even more efficient, consider: + +- **Gradient Clipping**: Add `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)` before `optimizer.step()` +- **Mixed Precision**: Use `torch.cuda.amp.autocast()` and `GradScaler` for faster training +- **Gradient Accumulation**: Accumulate gradients over multiple batches to simulate larger batch sizes +- **Checkpointing**: Save model checkpoints periodically to resume training if interrupted + +🔧 **Implementation Guide**: For detailed examples of these optimizations, see the [🤗 Transformers efficient training guide](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one) and the [range of optimizers](https://huggingface.co/docs/transformers/main/en/optimizers). + + + You can see that the core of the training loop looks a lot like the one in the introduction. We didn't ask for any reporting, so this training loop will not tell us anything about how the model fares. We need to add an evaluation loop for that. @@ -174,6 +204,12 @@ You can see that the core of the training loop looks a lot like the one in the i As we did earlier, we will use a metric provided by the 🤗 Evaluate library. We've already seen the `metric.compute()` method, but metrics can actually accumulate batches for us as we go over the prediction loop with the method `add_batch()`. Once we have accumulated all the batches, we can get the final result with `metric.compute()`. Here's how to implement all of this in an evaluation loop: + + +📊 **Evaluation Best Practices**: For more sophisticated evaluation strategies and metrics, explore the [🤗 Evaluate documentation](https://huggingface.co/docs/evaluate/) and the [comprehensive evaluation cookbook](https://github.com/huggingface/evaluation-guidebook). + + + ```py import evaluate @@ -207,20 +243,30 @@ Again, your results will be slightly different because of the randomness in the -The training loop we defined earlier works fine on a single CPU or GPU. But using the [🤗 Accelerate](https://github.com/huggingface/accelerate) library, with just a few adjustments we can enable distributed training on multiple GPUs or TPUs. Starting from the creation of the training and validation dataloaders, here is what our manual training loop looks like: +The training loop we defined earlier works fine on a single CPU or GPU. But using the [🤗 Accelerate](https://github.com/huggingface/accelerate) library, with just a few adjustments we can enable distributed training on multiple GPUs or TPUs. 🤗 Accelerate handles the complexity of distributed training, mixed precision, and device placement automatically. Starting from the creation of the training and validation dataloaders, here is what our manual training loop looks like: + + + +⚡ **Accelerate Deep Dive**: Learn everything about distributed training, mixed precision, and hardware optimization in the [🤗 Accelerate documentation](https://huggingface.co/docs/accelerate/) and explore practical examples in the [transformers documentation](https://huggingface.co/docs/transformers/main/en/accelerate). + + ```py +from accelerate import Accelerator from torch.optim import AdamW from transformers import AutoModelForSequenceClassification, get_scheduler +accelerator = Accelerator() + model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) optimizer = AdamW(model.parameters(), lr=3e-5) -device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") -model.to(device) +train_dl, eval_dl, model, optimizer = accelerator.prepare( + train_dataloader, eval_dataloader, model, optimizer +) num_epochs = 3 -num_training_steps = num_epochs * len(train_dataloader) +num_training_steps = num_epochs * len(train_dl) lr_scheduler = get_scheduler( "linear", optimizer=optimizer, @@ -232,11 +278,10 @@ progress_bar = tqdm(range(num_training_steps)) model.train() for epoch in range(num_epochs): - for batch in train_dataloader: - batch = {k: v.to(device) for k, v in batch.items()} + for batch in train_dl: outputs = model(**batch) loss = outputs.loss - loss.backward() + accelerator.backward(loss) optimizer.step() lr_scheduler.step() @@ -244,51 +289,6 @@ for epoch in range(num_epochs): progress_bar.update(1) ``` -And here are the changes: - -```diff -+ from accelerate import Accelerator - from torch.optim import AdamW - from transformers import AutoModelForSequenceClassification, get_scheduler - -+ accelerator = Accelerator() - - model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) - optimizer = AdamW(model.parameters(), lr=3e-5) - -- device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") -- model.to(device) - -+ train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( -+ train_dataloader, eval_dataloader, model, optimizer -+ ) - - num_epochs = 3 - num_training_steps = num_epochs * len(train_dataloader) - lr_scheduler = get_scheduler( - "linear", - optimizer=optimizer, - num_warmup_steps=0, - num_training_steps=num_training_steps - ) - - progress_bar = tqdm(range(num_training_steps)) - - model.train() - for epoch in range(num_epochs): - for batch in train_dataloader: -- batch = {k: v.to(device) for k, v in batch.items()} - outputs = model(**batch) - loss = outputs.loss -- loss.backward() -+ accelerator.backward(loss) - - optimizer.step() - lr_scheduler.step() - optimizer.zero_grad() - progress_bar.update(1) -``` - The first line to add is the import line. The second line instantiates an `Accelerator` object that will look at the environment and initialize the proper distributed setup. 🤗 Accelerate handles the device placement for you, so you can remove the lines that put the model on the device (or, if you prefer, change them to use `accelerator.device` instead of `device`). Then the main bulk of the work is done in the line that sends the dataloaders, the model, and the optimizer to `accelerator.prepare()`. This will wrap those objects in the proper container to make sure your distributed training works as intended. The remaining changes to make are removing the line that puts the batch on the `device` (again, if you want to keep this you can just change it to use `accelerator.device`) and replacing `loss.backward()` with `accelerator.backward(loss)`. @@ -360,3 +360,209 @@ notebook_launcher(training_function) ``` You can find more examples in the [🤗 Accelerate repo](https://github.com/huggingface/accelerate/tree/main/examples). + + + +🌐 **Distributed Training**: For comprehensive coverage of multi-GPU and multi-node training, check out the [🤗 Transformers distributed training guide](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_many) and the [scaling training cookbook](https://huggingface.co/docs/transformers/main/en/accelerate). + + + +### Next Steps and Best Practices[[next-steps-and-best-practices]] + +Now that you've learned how to implement training from scratch, here are some additional considerations for production use: + +**Model Evaluation**: Always evaluate your model on multiple metrics, not just accuracy. Use the 🤗 Evaluate library for comprehensive evaluation. + +**Hyperparameter Tuning**: Consider using libraries like Optuna or Ray Tune for systematic hyperparameter optimization. + +**Model Monitoring**: Track training metrics, learning curves, and validation performance throughout training. + +**Model Sharing**: Once trained, share your model on the Hugging Face Hub to make it available to the community. + +**Efficiency**: For large models, consider techniques like gradient checkpointing, parameter-efficient fine-tuning (LoRA, AdaLoRA), or quantization methods. + +This concludes our deep dive into fine-tuning with custom training loops. The skills you've learned here will serve you well when you need full control over the training process or want to implement custom training logic that goes beyond what the `Trainer` API offers. + +## Section Quiz[[section-quiz]] + +Test your understanding of custom training loops and advanced training techniques: + +### 1. What is the main difference between Adam and AdamW optimizers? + + + +### 2. In a training loop, what is the correct order of operations? + + + +### 3. What does the 🤗 Accelerate library primarily help with? + + + +### 4. Why do we move batches to the device in a training loop? + + + +### 5. What does `model.eval()` do before evaluation? + + + +### 6. What is the purpose of `torch.no_grad()` during evaluation? + + + +### 7. What changes when you use 🤗 Accelerate in your training loop? + + + + + +💡 **Key Takeaways:** +- Manual training loops give you complete control but require understanding of the proper sequence: forward → backward → optimizer step → scheduler step → zero gradients +- AdamW with weight decay is the recommended optimizer for transformer models +- Always use `model.eval()` and `torch.no_grad()` during evaluation for correct behavior and efficiency +- 🤗 Accelerate makes distributed training accessible with minimal code changes +- Device management (moving tensors to GPU/CPU) is crucial for PyTorch operations +- Modern techniques like mixed precision, gradient accumulation, and gradient clipping can significantly improve training efficiency + + diff --git a/chapters/en/chapter3/5.mdx b/chapters/en/chapter3/5.mdx index 5aa6b002d..e24553dfd 100644 --- a/chapters/en/chapter3/5.mdx +++ b/chapters/en/chapter3/5.mdx @@ -7,19 +7,40 @@ classNames="absolute z-10 right-0 top-0" /> -That was fun! In the first two chapters you learned about models and tokenizers, and now you know how to fine-tune them for your own data. To recap, in this chapter you: - -{#if fw === 'pt'} -* Learned about datasets in the [Hub](https://huggingface.co/datasets) -* Learned how to load and preprocess datasets, including using dynamic padding and collators -* Implemented your own fine-tuning and evaluation of a model -* Implemented a lower-level training loop -* Used 🤗 Accelerate to easily adapt your training loop so it works for multiple GPUs or TPUs - -{:else} -* Learned about datasets in the [Hub](https://huggingface.co/datasets) -* Learned how to load and preprocess datasets -* Learned how to fine-tune and evaluate a model with Keras -* Implemented a custom metric - -{/if} +That was comprehensive! In the first two chapters you learned about models and tokenizers, and now you know how to fine-tune them for your own data using modern best practices. To recap, in this chapter you: + +* Learned about datasets on the [Hub](https://huggingface.co/datasets) and modern data processing techniques +* Learned how to load and preprocess datasets efficiently, including using dynamic padding and data collators +* Implemented fine-tuning and evaluation using the high-level `Trainer` API with the latest features +* Implemented a complete custom training loop from scratch with PyTorch +* Used 🤗 Accelerate to make your training code work seamlessly on multiple GPUs or TPUs +* Applied modern optimization techniques like mixed precision training and gradient accumulation + + + +🎉 **Congratulations!** You've mastered the fundamentals of fine-tuning transformer models. You're now ready to tackle real-world ML projects! + +📖 **Continue Learning**: Explore these resources to deepen your knowledge: +- [🤗 Transformers task guides](https://huggingface.co/docs/transformers/main/en/tasks/sequence_classification) for specific NLP tasks +- [🤗 Transformers examples](https://huggingface.co/docs/transformers/main/en/notebooks) for comprehensive notebooks + +🚀 **Next Steps**: +- Try fine-tuning on your own dataset using the techniques you've learned +- Experiment with different model architectures available on the [Hugging Face Hub](https://huggingface.co/models) +- Join the [Hugging Face community](https://discuss.huggingface.co/) to share your projects and get help + + + +This is just the beginning of your journey with 🤗 Transformers. In the next chapter, we'll explore how to share your models and tokenizers with the community and contribute to the ever-growing ecosystem of pretrained models. + +The skills you've developed here - data preprocessing, training configuration, evaluation, and optimization - are fundamental to any machine learning project. Whether you're working on text classification, named entity recognition, question answering, or any other NLP task, these techniques will serve you well. + + + +💡 **Pro Tips for Success**: +- Always start with a strong baseline using the `Trainer` API before implementing custom training loops +- Use the 🤗 Hub to find pretrained models that are close to your task for better starting points +- Monitor your training with proper evaluation metrics and don't forget to save checkpoints +- Leverage the community - share your models and datasets to help others and get feedback on your work + + diff --git a/chapters/en/chapter3/6.mdx b/chapters/en/chapter3/6.mdx index 89d131b58..ee107eaf8 100644 --- a/chapters/en/chapter3/6.mdx +++ b/chapters/en/chapter3/6.mdx @@ -1,301 +1,419 @@ - +# Understanding Learning Curves[[understanding-learning-curves]] - + -# End-of-chapter quiz[[end-of-chapter-quiz]] +Now that you've learned how to implement fine-tuning using both the `Trainer` API and custom training loops, it's crucial to understand how to interpret the results. Learning curves are invaluable tools that help you evaluate your model's performance during training and identify potential issues before they reduce performance. - +In this section, we'll explore how to read and interpret accuracy and loss curves, understand what different curve shapes tell us about our model's behavior, and learn how to address common training issues. -Test what you learned in this chapter! +## What are Learning Curves?[[what-are-learning-curves]] -### 1. The `emotion` dataset contains Twitter messages labeled with emotions. Search for it in the [Hub](https://huggingface.co/datasets), and read the dataset card. Which of these is not one of its basic emotions? +Learning curves are visual representations of your model's performance metrics over time during training. The two most important curves to monitor are: - +- **Loss curves**: Show how the model's error (loss) changes over training steps or epochs +- **Accuracy curves**: Show the percentage of correct predictions over training steps or epochs -### 2. Search for the `ar_sarcasm` dataset in the [Hub](https://huggingface.co/datasets). Which task does it support? +These curves help us understand whether our model is learning effectively and can guide us in making adjustments to improve performance. In Transformers, these metrics are individually computed for each batch and then logged to the disk. We can then use libraries like [Weights & Biases](https://wandb.ai/) to visualize these curves and track our model's performance over time. -dataset card!" - }, - { - text: "Named entity recognition", - explain: "That's not it — take another look at the dataset card!" - }, - { - text: "Question answering", - explain: "Alas, this question was not answered correctly. Try again!" - } - ]} -/> +### Loss Curves[[loss-curves]] -### 3. How does the BERT model expect a pair of sentences to be processed? +The loss curve shows how the model's error decreases over time. In a typical successful training run, you'll see a curve similar to the one below: -[SEP] special token is needed to separate the two sentences, but that's not the only thing!" - }, - { - text: "[CLS] Tokens_of_sentence_1 Tokens_of_sentence_2", - explain: "A [CLS] special token is required at the beginning, but that's not the only thing!" - }, - { - text: "[CLS] Tokens_of_sentence_1 [SEP] Tokens_of_sentence_2 [SEP]", - explain: "That's correct!", - correct: true - }, - { - text: "[CLS] Tokens_of_sentence_1 [SEP] Tokens_of_sentence_2", - explain: "A [CLS] special token is needed at the beginning as well as a [SEP] special token to separate the two sentences, but that's not all!" - } - ]} -/> +![Loss Curve](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter3/1.png) -{#if fw === 'pt'} -### 4. What are the benefits of the `Dataset.map()` method? +- **High initial loss**: The model starts without optimization, so predictions are initially poor +- **Decreasing loss**: As training progresses, the loss should generally decrease +- **Convergence**: Eventually, the loss stabilizes at a low value, indicating that the model has learned the patterns in the data - +As in previous chapters, we can use the `Trainer` API to track these metrics and visualize them in a dashboard. Below is an example of how to do this with Weights & Biases. -### 5. What does dynamic padding mean? +```python +# Example of tracking loss during training with the Trainer +from transformers import Trainer, TrainingArguments +import wandb - +# Initialize Weights & Biases for experiment tracking +wandb.init(project="transformer-fine-tuning", name="bert-mrpc-analysis") -### 6. What is the purpose of a collate function? +training_args = TrainingArguments( + output_dir="./results", + eval_strategy="steps", + eval_steps=50, + save_steps=100, + logging_steps=10, # Log metrics every 10 steps + num_train_epochs=3, + per_device_train_batch_size=16, + per_device_eval_batch_size=16, + report_to="wandb", # Send logs to Weights & Biases +) -DataCollatorWithPadding specifically." - }, - { - text: "It puts together all the samples in a batch.", - explain: "Correct! You can pass the collate function as an argument of a DataLoader. We used the DataCollatorWithPadding function, which pads all items in a batch so they have the same length.", - correct: true - }, - { - text: "It preprocesses the whole dataset.", - explain: "That would be a preprocessing function, not a collate function." - }, - { - text: "It truncates the sequences in the dataset.", - explain: "A collate function is involved in handling individual batches, not the whole dataset. If you're interested in truncating, you can use the truncate argument of tokenizer." - } - ]} -/> +trainer = Trainer( + model=model, + args=training_args, + train_dataset=tokenized_datasets["train"], + eval_dataset=tokenized_datasets["validation"], + data_collator=data_collator, + processing_class=tokenizer, + compute_metrics=compute_metrics, +) -### 7. What happens when you instantiate one of the `AutoModelForXxx` classes with a pretrained language model (such as `bert-base-uncased`) that corresponds to a different task than the one for which it was trained? +# Train and automatically log metrics +trainer.train() +``` -AutoModelForSequenceClassification with bert-base-uncased, we got warnings when instantiating the model. The pretrained head is not used for the sequence classification task, so it's discarded and a new head is instantiated with random weights.", - correct: true - }, - { - text: "The head of the pretrained model is discarded.", - explain: "Something else needs to happen. Try again!" - }, - { - text: "Nothing, since the model can still be fine-tuned for the different task.", - explain: "The head of the pretrained model was not trained to solve this task, so we should discard the head!" - } - ]} -/> +### Accuracy Curves[[accuracy-curves]] + +The accuracy curve shows the percentage of correct predictions over time. Unlike loss curves, accuracy curves should generally increase as the model learns and can typically include more steps than the loss curve. + +![Accuracy Curve](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter3/2.png) + +- **Start low**: Initial accuracy should be low, as the model has not yet learned the patterns in the data +- **Increase with training**: Accuracy should generally improve as the model learns if it is able to learn the patterns in the data +- **May show plateaus**: Accuracy often increases in discrete jumps rather than smoothly, as the model makes predictions that are close to the true labels + + + +💡 **Why Accuracy Curves Are "Steppy"**: Unlike loss, which is continuous, accuracy is calculated by comparing discrete predictions to true labels. Small improvements in model confidence might not change the final prediction, causing accuracy to remain flat until a threshold is crossed. + + + +### Convergence[[convergence]] + +Convergence occurs when the model's performance stabilizes and the loss and accuracy curves level off. This is a sign that the model has learned the patterns in the data and is ready to be used. In simple terms, we are aiming for the model to converge to a stable performance every time we train it. + +![Convergence](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter3/4.png) + +Once models have converged, we can use them to make predictions on new data and refer to evaluation metrics to understand how well the model is performing. + +## Interpreting Learning Curve Patterns[[interpreting-learning-curve-patterns]] + +Different curve shapes reveal different aspects of your model's training. Let's examine the most common patterns and what they mean. + +### Healthy Learning Curves[[healthy-learning-curves]] + +A well-behaved training run typically shows curve shapes similar to the one below: + +![Healthy Loss Curve](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter3/5.png) + +Let's look at the illustration above. It displays both the loss curve (on the left) and the corresponding accuracy curve (on the right). These curves have distinct characteristics. + +The loss curve shows the value of the model's loss over time. Initially, the loss is high and then it gradually decreases, indicating that the model is improving. A decrease in the loss value suggests that the model is making better predictions, as the loss represents the error between the predicted output and the true output. + +Now let's shift our focus to the accuracy curve. It represents the model's accuracy over time. The accuracy curve begins at a low value and increases as training progresses. Accuracy measures the proportion of correctly classified instances. So, as the accuracy curve rises, it signifies that the model is making more correct predictions. + +One notable difference between the curves is the smoothness and the presence of "plateaus" on the accuracy curve. While the loss decreases smoothly, the plateaus on the accuracy curve indicate discrete jumps in accuracy instead of a continuous increase. This behavior is attributed to how accuracy is measured. The loss can improve if the model's output gets closer to the target, even if the final prediction is still incorrect. Accuracy, however, only improves when the prediction crosses the threshold to be correct. + +For example, in a binary classifier distinguishing cats (0) from dogs (1), if the model predicts 0.3 for an image of a dog (true value 1), this is rounded to 0 and is an incorrect classification. If in the next step it predicts 0.4, it's still incorrect. The loss will have decreased because 0.4 is closer to 1 than 0.3, but the accuracy remains unchanged, creating a plateau. The accuracy will only jump up when the model predicts a value greater than 0.5 that gets rounded to 1. + + + +**Characteristics of healthy curves:** +- **Smooth decline in loss**: Both training and validation loss decrease steadily +- **Close training/validation performance**: Small gap between training and validation metrics +- **Convergence**: Curves level off, indicating the model has learned the patterns + + + +### Practical Examples[[practical-examples]] + +Let's work through some practical examples of learning curves. First, we will highlight some approaches to monitor the learning curves during training. Below, we will break down the different patterns that can be observed in the learning curves. + +#### During Training[[during-training]] + +During the training process (after you've hit `trainer.train()`), you can monitor these key indicators: + +1. **Loss convergence**: Is the loss still decreasing or has it plateaued? +2. **Overfitting signs**: Is validation loss starting to increase while training loss decreases? +3. **Learning rate**: Are the curves too erratic (LR too high) or too flat (LR too low)? +4. **Stability**: Are there sudden spikes or drops that indicate problems? + +#### After Training[[after-training]] + +After the training process is complete, you can analyze the complete curves to understand the model's performance. + +1. **Final performance**: Did the model reach acceptable performance levels? +2. **Efficiency**: Could the same performance be achieved with fewer epochs? +3. **Generalization**: How close are training and validation performance? +4. **Trends**: Would additional training likely improve performance? + + + +🔍 **W&B Dashboard Features**: Weights & Biases automatically creates beautiful, interactive plots of your learning curves. You can: +- Compare multiple runs side by side +- Add custom metrics and visualizations +- Set up alerts for anomalous behavior +- Share results with your team + +Learn more in the [Weights & Biases documentation](https://docs.wandb.ai/). + + +#### Overfitting[[overfitting]] + +Overfitting occurs when the model learns too much from the training data and is unable to generalize to different data (represented by the validation set). -### 8. What's the purpose of `TrainingArguments`? +![Overfitting](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter3/10.png) + +**Symptoms:** + +- Training loss continues to decrease while validation loss increases or plateaus +- Large gap between training and validation accuracy +- Training accuracy much higher than validation accuracy + +**Solutions for overfitting:** +- **Regularization**: Add dropout, weight decay, or other regularization techniques +- **Early stopping**: Stop training when validation performance stops improving +- **Data augmentation**: Increase training data diversity +- **Reduce model complexity**: Use a smaller model or fewer parameters + +In the sample below, we use early stopping to prevent overfitting. We set the `early_stopping_patience` to 3, which means that if the validation loss does not improve for 3 consecutive epochs, the training will be stopped. + +```python +# Example of detecting overfitting with early stopping +from transformers import EarlyStoppingCallback + +training_args = TrainingArguments( + output_dir="./results", + eval_strategy="steps", + eval_steps=100, + save_strategy="steps", + save_steps=100, + load_best_model_at_end=True, + metric_for_best_model="eval_loss", + greater_is_better=False, + num_train_epochs=10, # Set high, but we'll stop early +) + +# Add early stopping to prevent overfitting +trainer = Trainer( + model=model, + args=training_args, + train_dataset=tokenized_datasets["train"], + eval_dataset=tokenized_datasets["validation"], + data_collator=data_collator, + processing_class=tokenizer, + compute_metrics=compute_metrics, + callbacks=[EarlyStoppingCallback(early_stopping_patience=3)], +) +``` + +#### 2. Underfitting[[underfitting]] + +Underfitting occurs when the model is too simple to capture the underlying patterns in the data. This can happen for several reasons: + +- The model is too small or lacks capacity to learn the patterns +- The learning rate is too low, causing slow learning +- The dataset is too small or not representative of the problem +- The model is not properly regularized + +![Underfitting](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter3/7.png) + +**Symptoms:** +- Both training and validation loss remain high +- Model performance plateaus early in training +- Training accuracy is lower than expected + +**Solutions for underfitting:** +- **Increase model capacity**: Use a larger model or more parameters +- **Train longer**: Increase the number of epochs +- **Adjust learning rate**: Try different learning rates +- **Check data quality**: Ensure your data is properly preprocessed + +In the sample below, we train for more epochs to see if the model can learn the patterns in the data. + +```python +from transformers import TrainingArguments + +training_args = TrainingArguments( + output_dir="./results", + -num_train_epochs=5, + +num_train_epochs=10, +) +``` + +#### 3. Erratic Learning Curves[[erratic-learning-curves]] + +Erratic learning curves occur when the model is not learning effectively. This can happen for several reasons: + +- The learning rate is too high, causing the model to overshoot the optimal parameters +- The batch size is too small, causing the model to learn slowly +- The model is not properly regularized, causing it to overfit to the training data +- The dataset is not properly preprocessed, causing the model to learn from noise + +![Erratic Learning Curves](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter3/3.png) + +**Symptoms:** +- Frequent fluctuations in loss or accuracy +- Curves show high variance or instability +- Performance oscillates without clear trend + +Both training and validation curves show erratic behavior. + +![Erratic Learning Curves](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter3/9.png) + +**Solutions for erratic curves:** +- **Lower learning rate**: Reduce step size for more stable training +- **Increase batch size**: Larger batches provide more stable gradients +- **Gradient clipping**: Prevent exploding gradients +- **Better data preprocessing**: Ensure consistent data quality + +In the sample below, we lower the learning rate and increase the batch size. + +```python +from transformers import TrainingArguments + +training_args = TrainingArguments( + output_dir="./results", + -learning_rate=1e-5, + +learning_rate=1e-4, + -per_device_train_batch_size=16, + +per_device_train_batch_size=32, +) +``` + +## Key Takeaways[[key-takeaways]] + +Understanding learning curves is crucial for becoming an effective machine learning practitioner. These visual tools provide immediate feedback about your model's training progress and help you make informed decisions about when to stop training, adjust hyperparameters, or try different approaches. With practice, you'll develop an intuitive understanding of what healthy learning curves look like and how to address issues when they arise. + + + +💡 **Key Takeaways:** +- Learning curves are essential tools for understanding model training progress +- Monitor both loss and accuracy curves, but remember they have different characteristics +- Overfitting shows as diverging training/validation performance +- Underfitting shows as poor performance on both training and validation data +- Tools like Weights & Biases make it easy to track and analyze learning curves +- Early stopping and proper regularization can address most common training issues + +🔬 **Next Steps**: Practice analyzing learning curves on your own fine-tuning experiments. Try different hyperparameters and observe how they affect the curve shapes. This hands-on experience is the best way to develop intuition for reading training progress. + + + +## Section Quiz[[section-quiz]] + +Test your understanding of learning curves and training analysis: + +### 1. What does it typically mean when training loss decreases but validation loss starts increasing? Trainer.", - explain: "Correct!", - correct: true + text: "The model is learning successfully and will continue to improve.", + explain: "If validation loss is increasing while training loss decreases, this indicates a problem, not success." }, { - text: "It specifies the size of the model.", - explain: "The model size is defined by the model configuration, not the class TrainingArguments." + text: "The model is overfitting to the training data.", + explain: "Correct! This is a classic sign of overfitting - the model performs well on training data but poorly on unseen validation data.", + correct: true }, { - text: "It just contains the hyperparameters used for evaluation.", - explain: "In the example, we specified where the model and its checkpoints will be saved. Try again!" + text: "The learning rate is too low.", + explain: "A low learning rate would cause slow learning, not the divergence between training and validation performance." }, { - text: "It just contains the hyperparameters used for training.", - explain: "In the example, we used an evaluation_strategy as well, so this impacts evaluation. Try again!" + text: "The dataset is too small.", + explain: "While small datasets can contribute to overfitting, this specific pattern is the definition of overfitting regardless of dataset size." } ]} /> -### 9. Why should you use the 🤗 Accelerate library? +### 2. Why do accuracy curves often show a "steppy" or plateau-like pattern rather than smooth increases? Trainer, not the 🤗 Accelerate library. Try again!" + text: "Accuracy is a discrete metric that only changes when predictions cross decision boundaries.", + explain: "Correct! Unlike loss, accuracy depends on discrete prediction decisions, so small improvements in confidence may not change the final accuracy until a threshold is crossed.", + correct: true }, { - text: "It makes our training loops work on distributed strategies.", - explain: "Correct! With 🤗 Accelerate, your training loops will work for multiple GPUs and TPUs.", - correct: true + text: "The model is not learning effectively.", + explain: "Steppy accuracy curves are normal even when the model is learning well." }, { - text: "It provides more optimization functions.", - explain: "No, the 🤗 Accelerate library does not provide any optimization functions." + text: "The batch size is too small.", + explain: "Batch size affects training stability but doesn't explain the inherently discrete nature of accuracy metrics." } ]} /> -{:else} -### 4. What happens when you instantiate one of the `TFAutoModelForXxx` classes with a pretrained language model (such as `bert-base-uncased`) that corresponds to a different task than the one for which it was trained? +### 3. What is the best approach when you observe erratic, highly fluctuating learning curves? TFAutoModelForSequenceClassification with bert-base-uncased, we got warnings when instantiating the model. The pretrained head is not used for the sequence classification task, so it's discarded and a new head is instantiated with random weights.", + text: "Reduce the learning rate and possibly increase the batch size.", + explain: "Correct! Lower learning rates and larger batch sizes typically lead to more stable training.", correct: true }, { - text: "The head of the pretrained model is discarded.", - explain: "Something else needs to happen. Try again!" + text: "Stop training immediately as the model won't improve.", + explain: "Erratic curves can often be fixed with hyperparameter adjustments." }, { - text: "Nothing, since the model can still be fine-tuned for the different task.", - explain: "The head of the pretrained model was not trained to solve this task, so we should discard the head!" + text: "Switch to a completely different model architecture.", + explain: "This is premature - erratic curves are usually fixable with hyperparameter tuning." } ]} /> -### 5. The TensorFlow models from `transformers` are already Keras models. What benefit does this offer? +### 4. When should you consider using early stopping? TPUStrategy scope, including the initialization of the model." + text: "Always, as it prevents any form of overfitting.", + explain: "Early stopping is useful but not always necessary, especially if other regularization methods are working." }, { - text: "You can leverage existing methods such as compile(), fit(), and predict().", - explain: "Correct! Once you have the data, training on it requires very little work.", + text: "When validation performance stops improving or starts degrading.", + explain: "Correct! Early stopping helps prevent overfitting by stopping training when the model no longer generalizes better.", correct: true }, { - text: "You get to learn Keras as well as transformers.", - explain: "Correct, but we're looking for something else :)", - correct: true + text: "Only when training loss is still decreasing rapidly.", + explain: "If training loss is decreasing rapidly and validation performance is good, you might want to continue training." }, { - text: "You can easily compute metrics related to the dataset.", - explain: "Keras helps us with training and evaluating the model, not computing dataset-related metrics." + text: "Never, as it prevents the model from reaching its full potential.", + explain: "Early stopping is a valuable technique that often improves final model performance by preventing overfitting." } ]} /> -### 6. How can you define your own custom metric? +### 5. What indicates that your model might be underfitting? tf.keras.metrics.Metric.", - explain: "Great!", - correct: true + text: "Training accuracy is much higher than validation accuracy.", + explain: "This describes overfitting, not underfitting." }, { - text: "Using the Keras functional API.", - explain: "Try again!" + text: "Both training and validation performance are poor and plateau early.", + explain: "Correct! Underfitting occurs when the model lacks capacity to learn the patterns, resulting in poor performance on both training and validation data.", + correct: true }, { - text: "By using a callable with signature metric_fn(y_true, y_pred).", - explain: "Correct!", - correct: true + text: "The learning curves are very smooth with no fluctuations.", + explain: "Smooth curves are generally good and don't indicate underfitting." }, { - text: "By Googling it.", - explain: "That's not the answer we're looking for, but it should help you find it.", - correct: true + text: "Validation loss is decreasing faster than training loss.", + explain: "This would actually be a positive sign, not a problem." } ]} /> -{/if} diff --git a/chapters/en/chapter3/7.mdx b/chapters/en/chapter3/7.mdx new file mode 100644 index 000000000..f6bbaaba1 --- /dev/null +++ b/chapters/en/chapter3/7.mdx @@ -0,0 +1,268 @@ + + +# End-of-chapter quiz[[end-of-chapter-quiz]] + + + +Test what you learned in this chapter! + +### 1. The emotion dataset contains Twitter messages labeled with emotions. Search for it in the [Hub](https://huggingface.co/datasets), and read the dataset card. Which of these is not one of its basic emotions? + + + +### 2. Search for the ar_sarcasm dataset in the [Hub](https://huggingface.co/datasets). Which task does it support? + +dataset card!" + }, + { + text: "Named entity recognition", + explain: "That's not it — take another look at the dataset card!" + }, + { + text: "Question answering", + explain: "Alas, this question was not answered correctly. Try again!" + } + ]} +/> + +### 3. How does the BERT model expect a pair of sentences to be processed? + +[SEP] special token is needed to separate the two sentences, but that's not the only thing!" + }, + { + text: "[CLS] Tokens_of_sentence_1 Tokens_of_sentence_2", + explain: "A [CLS] special token is required at the beginning, but that's not the only thing!" + }, + { + text: "[CLS] Tokens_of_sentence_1 [SEP] Tokens_of_sentence_2 [SEP]", + explain: "That's correct!", + correct: true + }, + { + text: "[CLS] Tokens_of_sentence_1 [SEP] Tokens_of_sentence_2", + explain: "A [CLS] special token is needed at the beginning as well as a [SEP] special token to separate the two sentences, but that's not all!" + } + ]} +/> + +### 4. What are the benefits of the Dataset.map() method? + + + +### 5. What does dynamic padding mean? + + + +### 6. What is the purpose of a collate function? + +DataCollatorWithPadding specifically." + }, + { + text: "It puts together all the samples in a batch.", + explain: "You can pass the collate function as an argument of a DataLoader. We used the DataCollatorWithPadding function, which pads all items in a batch so they have the same length.", + correct: true + }, + { + text: "It preprocesses the whole dataset.", + explain: "That would be a preprocessing function, not a collate function." + }, + { + text: "It truncates the sequences in the dataset.", + explain: "A collate function is involved in handling individual batches, not the whole dataset. If you're interested in truncating, you can use the truncate argument of tokenizer." + } + ]} +/> + +### 7. What happens when you instantiate one of the AutoModelForXxx classes with a pretrained language model (such as bert-base-uncased) that corresponds to a different task than the one for which it was trained? + +AutoModelForSequenceClassification with bert-base-uncased, we got warnings when instantiating the model. The pretrained head is not used for the sequence classification task, so it's discarded and a new head is instantiated with random weights.", + correct: true + }, + { + text: "The head of the pretrained model is discarded.", + explain: "Something else needs to happen. Try again!" + }, + { + text: "Nothing, since the model can still be fine-tuned for the different task.", + explain: "The head of the pretrained model was not trained to solve this task, so we should discard the head!" + } + ]} +/> + +### 8. What's the purpose of TrainingArguments? + +Trainer.", + explain: "Nice one!", + correct: true + }, + { + text: "It specifies the size of the model.", + explain: "The model size is defined by the model configuration, not the class TrainingArguments." + }, + { + text: "It just contains the hyperparameters used for evaluation.", + explain: "In the example, we specified where the model and its checkpoints will be saved. Try again!" + }, + { + text: "It just contains the hyperparameters used for training.", + explain: "In the example, we used an evaluation_strategy as well, so this impacts evaluation. Try again!" + } + ]} +/> + +### 9. Why should you use the 🤗 Accelerate library? + +Trainer, not the 🤗 Accelerate library. Try again!" + }, + { + text: "It makes our training loops work on distributed strategies.", + explain: "With 🤗 Accelerate, your training loops will work for multiple GPUs and TPUs.", + correct: true + }, + { + text: "It provides more optimization functions.", + explain: "No, the 🤗 Accelerate library does not provide any optimization functions." + } + ]} +/> + +### 10. What is the purpose of the processing_class parameter in the Trainer? + + + +### 11. Which modern optimization technique can help with memory efficiency during training? + +