Enhance documentation and add new notebook for structured generation using Vision Language Models

davidberenstein1957 · davidberenstein1957 · commit 70c7f74cc8fa · 2025-01-24T12:41:54.000+01:00
- Updated the table of contents to include a new section for "Structured Generation from Documents Using Vision Language Models".
- Added a new Jupyter notebook that demonstrates how to extract structured information from documents using the SmolVLM-500M-Instruct model, including installation instructions, model initialization, and example usage.
diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml
@@ -72,7 +72,7 @@
           title: Phoenix Observability Dashboard on HF Spaces
         - local: search_and_learn
           title: Scaling Test-Time Compute for Longer Thinking in LLMs
-          
+
     - title: Computer Vision Recipes
       isExpanded: false
       sections:
@@ -108,6 +108,8 @@
           title: Smol Multimodal RAG, Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU
         - local: fine_tuning_vlm_dpo_smolvlm_instruct
           title: Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU
+        - local: structured_generation_vision_languag_models
+          title: Structured Generation from Documents Using Vision Language Models
 
     - title: Search Recipes
       isExpanded: false
diff --git a/notebooks/en/structured-generation-vision-languag-models-copy.ipynb b/notebooks/en/structured-generation-vision-languag-models-copy.ipynb
@@ -0,0 +1,244 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Structured Generation from Documents Using Vision Language Models\n",
+    "\n",
+    "We will be using the SmolVLM-500M-Instruct model from HuggingFaceTB to extract structured information from documents. We will do so using the HuggingFace Transformers library and the Outlines library, which facilitates structured generation based on limiting token sampling probabilities. We will also use the Gradio library to create a simple UI for uploading and extracting structured information from documents.\n",
+    "\n",
+    "## Dependencies and imports\n",
+    "\n",
+    "First, let's install the necessary libraries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install outlines transformers torch flash-attn outlines datasets sentencepiece gradio"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's continue with importing the necessary libraries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import outlines\n",
+    "import torch\n",
+    "\n",
+    "from io import BytesIO\n",
+    "from urllib.request import urlopen\n",
+    "from PIL import Image\n",
+    "from outlines.models.transformers_vision import transformers_vision\n",
+    "from transformers import AutoModelForImageTextToText, AutoProcessor\n",
+    "from pydantic import BaseModel, Field\n",
+    "from typing import List\n",
+    "from enum import StrEnum"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Initialising our model\n",
+    "\n",
+    "We will start by initialising our model from [HuggingFaceTB/SmolVLM-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct). Outlines expects us to pass in a model class and processor class, so we will make this example a bit more generic by creating a function that returns those. Alternatively, you could look at the model and tokenizer config within the [Hub repo files](https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct/tree/main), and import those classes directly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_name = \"HuggingFaceTB/SmolVLM-Instruct\"  # original magnet model is able to be loaded without issue\n",
+    "\n",
+    "\n",
+    "def get_model_and_processor_class(model_name: str):\n",
+    "    model = AutoModelForImageTextToText.from_pretrained(model_name)\n",
+    "    processor = AutoProcessor.from_pretrained(model_name)\n",
+    "    classes = model.__class__, processor.__class__\n",
+    "    del model, processor\n",
+    "    return classes\n",
+    "\n",
+    "\n",
+    "model_class, processor_class = get_model_and_processor_class(model_name)\n",
+    "\n",
+    "if torch.cuda.is_available():\n",
+    "    device = \"cuda\"\n",
+    "elif torch.backends.mps.is_available():\n",
+    "    device = \"mps\"\n",
+    "else:\n",
+    "    device = \"cpu\"\n",
+    "\n",
+    "model = transformers_vision(\n",
+    "    model_name,\n",
+    "    model_class=model_class,\n",
+    "    device=device,\n",
+    "    model_kwargs={\"torch_dtype\": torch.bfloat16, \"device_map\": \"auto\"},\n",
+    "    processor_kwargs={\"device\": device},\n",
+    "    processor_class=processor_class,\n",
+    ")\n",
+    "model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, we are going to define a function that will define how the output of our model will be structured. We will want to extract Tags for object in the image along with a string and a confidence score."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 93,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class TagType(StrEnum):\n",
+    "    ENTITY = \"Entity\"\n",
+    "    RELATIONSHIP = \"Relationship\"\n",
+    "    STYLE = \"Style\"\n",
+    "    ATTRIBUTE = \"Attribute\"\n",
+    "    COMPOSITION = \"Composition\"\n",
+    "    CONTEXTUAL = \"Contextual\"\n",
+    "    TECHNICAL = \"Technical\"\n",
+    "    SEMANTIC = \"Semantic\"\n",
+    "\n",
+    "class ImageTag(BaseModel):\n",
+    "    tag_name: str\n",
+    "    tag_description: str\n",
+    "    tag_type: TagType\n",
+    "    confidence_score: float\n",
+    "\n",
+    "\n",
+    "class ImageData(BaseModel):\n",
+    "    tags_list: List[ImageTag] = Field(min_items=1)\n",
+    "    short_caption: str\n",
+    "\n",
+    "\n",
+    "image_objects_generator = outlines.generate.json(model, ImageData)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's come up with an extraction prompt. We will want to extract Tags for object in the image along with a string and a confidence score and provide some guidance to the model about the different tags and structrue."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 96,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prompt = \"\"\"\n",
+    "You are a structured image analysis assitant. Generate comprehensive tag list for an image classification system. Use at least 1 tag per type. Return the results as a valid JSON object.\n",
+    "\"\"\".strip()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 95,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "ImageData(tags_list=[ImageTag(tag_name='spacecraft', tag_description='You are an EVA astronaut standing on the moon', tag_type=<TagType.STYLE: 'Style'>, confidence_score=0.9471130702150571), ImageTag(tag_name='tire track', tag_description='You think tike this used to lead your way here', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=1.0), ImageTag(tag_name='space helmet', tag_description='Ozone spacesuit with white metal visor', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=0.9737292349276361), ImageTag(tag_name='space suit', tag_description='White Astronaut', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=0.9749979480665247), ImageTag(tag_name='astronaut', tag_description='Astronaut', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=0.8412833526756263)], short_caption=\"An astronaut from space sits on the lunar surface at around 200 feet below him, over a tan lunar ground with bays leading to his original path and some rocks oncrete having a shiny armor. Both left and right have a sphere that is used for eyes and protection. Left is wearing a baseball with playing field across, and other articles, the heavy one having a shiny metal visor drum on top. The astronaut's grin can be seen over the helmet as he comes out with his right arm out of the sat gadget and leaves it as leaving the shining metal bars as he is from the center of the image.\")"
+      ]
+     },
+     "execution_count": 95,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "def img_from_url(url):\n",
+    "    img_byte_stream = BytesIO(urlopen(url).read())\n",
+    "    return Image.open(img_byte_stream).convert(\"RGB\")\n",
+    "\n",
+    "\n",
+    "image_url = (\n",
+    "    \"https://upload.wikimedia.org/wikipedia/commons/9/98/Aldrin_Apollo_11_original.jpg\"\n",
+    ")\n",
+    "image = img_from_url(image_url)\n",
+    "\n",
+    "\n",
+    "def extract_objects(image, prompt):\n",
+    "    messages = [\n",
+    "        {\n",
+    "            \"role\": \"user\",\n",
+    "            \"content\": [{\"type\": \"image\"}, {\"type\": \"text\", \"text\": prompt}],\n",
+    "        },\n",
+    "    ]\n",
+    "\n",
+    "    formatted_prompt = model.processor.apply_chat_template(\n",
+    "        messages, add_generation_prompt=True\n",
+    "    )\n",
+    "\n",
+    "    result = image_objects_generator(formatted_prompt, [image])\n",
+    "    return result\n",
+    "\n",
+    "\n",
+    "extract_objects(image, prompt)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "We've seen how to extract structured information from documents using a vision language model. We can use similar extractive methods to extract structured information from documents, using somehting like `pdf2image` to convert the document to images and running information extraction on each image pdf of the page.\n",
+    "\n",
+    "```python\n",
+    "pdf_path = \"path/to/your/pdf/file.pdf\"\n",
+    "pages = convert_from_path(pdf_path)\n",
+    "for page in pages:\n",
+    "    extract_objects = extract_objects(page, prompt)\n",
+    "```\n",
+    "\n",
+    "## Next Steps\n",
+    "\n",
+    "- Take a look at the [Outlines](https://github.com/outlines-ai/outlines) library for more information on how to use it. Explore the different methods and parameters.\n",
+    "- Explore extraction on your own usecase.\n",
+    "- Use a different method of extracting structured information from documents."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}