|
51 | 51 | "id": "gSHmDKNFoqjC"
|
52 | 52 | },
|
53 | 53 | "source": [
|
54 |
| - "# 1. Install Dependencies\n", |
| 54 | + "## 1. Install Dependencies\n", |
55 | 55 | "\n",
|
56 | 56 | "Letβs start by installing the essential libraries weβll need for fine-tuning! π\n"
|
57 | 57 | ]
|
|
180 | 180 | "id": "g9QXwbJ7ovM5"
|
181 | 181 | },
|
182 | 182 | "source": [
|
183 |
| - "# 2. Load Dataset π\n", |
| 183 | + "## 2. Load Dataset π\n", |
184 | 184 | "\n",
|
185 | 185 | "In this section, weβll load the [HuggingFaceM4/ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA) dataset. This dataset contains chart images paired with related questions and answers, making it ideal for training on visual question answering tasks.\n",
|
186 | 186 | "\n",
|
|
388 | 388 | "id": "YY1Y_KDtoycB"
|
389 | 389 | },
|
390 | 390 | "source": [
|
391 |
| - "# 3. Load Model and Check Performance! π€\n", |
| 391 | + "## 3. Load Model and Check Performance! π€\n", |
392 | 392 | "\n",
|
393 | 393 | "Now that weβve loaded the dataset, letβs start by loading the model and evaluating its performance using a sample from the dataset. Weβll be using [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), a Vision Language Model (VLM) capable of understanding both visual data and text.\n",
|
394 | 394 | "\n",
|
|
1165 | 1165 | "id": "YIZOIVEzQqNg"
|
1166 | 1166 | },
|
1167 | 1167 | "source": [
|
1168 |
| - "# 4. Fine-Tune the Model using TRL\n" |
| 1168 | + "## 4. Fine-Tune the Model using TRL\n" |
1169 | 1169 | ]
|
1170 | 1170 | },
|
1171 | 1171 | {
|
|
1174 | 1174 | "id": "yIrR9gP2z90z"
|
1175 | 1175 | },
|
1176 | 1176 | "source": [
|
1177 |
| - "## 4.1 Load the Quantized Model for Training βοΈ\n", |
| 1177 | + "### 4.1 Load the Quantized Model for Training βοΈ\n", |
1178 | 1178 | "\n",
|
1179 | 1179 | "Next, weβll load the quantized model using [bitsandbytes](https://huggingface.co/docs/bitsandbytes/main/en/index). If you want to learn more about quantization, check out [this blog post](https://huggingface.co/blog/merve/quantization) or [this one](https://www.maartengrootendorst.com/blog/quantization/).\n"
|
1180 | 1180 | ]
|
|
1246 | 1246 | "id": "65wfO29isQlX"
|
1247 | 1247 | },
|
1248 | 1248 | "source": [
|
1249 |
| - "## 4.2 Set Up QLoRA and SFTConfig π\n", |
| 1249 | + "### 4.2 Set Up QLoRA and SFTConfig π\n", |
1250 | 1250 | "\n",
|
1251 | 1251 | "Next, we will configure [QLoRA](https://github.com/artidoro/qlora) for our training setup. QLoRA enables efficient fine-tuning of large language models while significantly reducing the memory footprint compared to traditional methods. Unlike standard LoRA, which reduces memory usage by applying a low-rank approximation, QLoRA takes it a step further by quantizing the weights of the LoRA adapters. This leads to even lower memory requirements and improved training efficiency, making it an excellent choice for optimizing our model's performance without sacrificing quality.\n",
|
1252 | 1252 | "\n",
|
|
1361 | 1361 | "id": "pOUrD9P-y-Kf"
|
1362 | 1362 | },
|
1363 | 1363 | "source": [
|
1364 |
| - "## 4.3 Training the Model π" |
| 1364 | + "### 4.3 Training the Model π" |
1365 | 1365 | ]
|
1366 | 1366 | },
|
1367 | 1367 | {
|
|
1556 | 1556 | "id": "6yx_sGW42dN3"
|
1557 | 1557 | },
|
1558 | 1558 | "source": [
|
1559 |
| - "# 5. Testing the Fine-Tuned Model π\n", |
| 1559 | + "## 5. Testing the Fine-Tuned Model π\n", |
1560 | 1560 | "\n",
|
1561 | 1561 | "Now that we've successfully fine-tuned our Vision Language Model (VLM), it's time to evaluate its performance! In this section, we will test the model using examples from the ChartQA dataset to see how well it answers questions based on chart images. Let's dive in and explore the results! π\n",
|
1562 | 1562 | "\n"
|
|
1993 | 1993 | "id": "daUMWw5xxhSc"
|
1994 | 1994 | },
|
1995 | 1995 | "source": [
|
1996 |
| - "# 6. Compare Fine-Tuned Model vs. Base Model + Prompting π\n", |
| 1996 | + "## 6. Compare Fine-Tuned Model vs. Base Model + Prompting π\n", |
1997 | 1997 | "\n",
|
1998 | 1998 | "We have explored how fine-tuning the VLM can be a valuable option for adapting it to our specific needs. Another approach to consider is directly using prompting or implementing a RAG system, which is covered in another [recipe](https://huggingface.co/learn/cookbook/multimodal_rag_using_document_retrieval_and_vlms).\n",
|
1999 | 1999 | "\n",
|
|
2205 | 2205 | "id": "Wgv0-sy8TLPE"
|
2206 | 2206 | },
|
2207 | 2207 | "source": [
|
2208 |
| - "# 7. Continuing the Learning Journey π§βποΈ\n", |
| 2208 | + "## 7. Continuing the Learning Journey π§βποΈ\n", |
2209 | 2209 | "\n",
|
2210 | 2210 | "To further enhance your understanding and skills in working with multimodal models, check out the following resources:\n",
|
2211 | 2211 | "\n",
|
|
0 commit comments