|
6 | 6 | "source": [
|
7 | 7 | "# Structured Generation from Documents Using Vision Language Models\n",
|
8 | 8 | "\n",
|
9 |
| - "We will be using the SmolVLM-Instruct model from HuggingFaceTB to extract structured information from documents We will run the VLM using the HuggingFace Transformers library and the Outlines library, which facilitates structured generation based on limiting token sampling probabilities. \n", |
10 |
| - "This approach is based on a [outlines tutorial](https://dottxt-ai.github.io/outlines/latest/cookbook/atomic_caption/) library.\n", |
| 9 | + "We will be using the SmolVLM-Instruct model from HuggingFaceTB to extract structured information from documents We will run the VLM using the HuggingFace Transformers library and the [Outlines library](https://github.com/dottxt-ai/outlines), which facilitates structured generation based on limiting token sampling probabilities. \n", |
| 10 | + "\n", |
| 11 | + "> This approach is based on a [Outlines tutorial](https://dottxt-ai.github.io/outlines/latest/cookbook/atomic_caption/).\n", |
11 | 12 | "\n",
|
12 | 13 | "## Dependencies and imports\n",
|
13 | 14 | "\n",
|
|
20 | 21 | "metadata": {},
|
21 | 22 | "outputs": [],
|
22 | 23 | "source": [
|
23 |
| - "%pip install accelerate outlines transformers torch flash-attn outlines datasets sentencepiece" |
| 24 | + "%pip install accelerate outlines transformers torch flash-attn datasets sentencepiece" |
24 | 25 | ]
|
25 | 26 | },
|
26 | 27 | {
|
|
39 | 40 | "import outlines\n",
|
40 | 41 | "import torch\n",
|
41 | 42 | "\n",
|
42 |
| - "from io import BytesIO\n", |
43 |
| - "from urllib.request import urlopen\n", |
44 |
| - "from PIL import Image\n", |
| 43 | + "from datasets import load_dataset\n", |
45 | 44 | "from outlines.models.transformers_vision import transformers_vision\n",
|
46 | 45 | "from transformers import AutoModelForImageTextToText, AutoProcessor\n",
|
47 |
| - "from pydantic import BaseModel, Field\n", |
48 |
| - "from typing import List\n", |
49 |
| - "from enum import StrEnum" |
| 46 | + "from pydantic import BaseModel" |
50 | 47 | ]
|
51 | 48 | },
|
52 | 49 | {
|
|
62 | 59 | "cell_type": "code",
|
63 | 60 | "execution_count": null,
|
64 | 61 | "metadata": {},
|
65 |
| - "outputs": [], |
| 62 | + "outputs": [ |
| 63 | + { |
| 64 | + "name": "stderr", |
| 65 | + "output_type": "stream", |
| 66 | + "text": [ |
| 67 | + "Some kwargs in processor config are unused and will not have any effect: image_seq_len. \n", |
| 68 | + "Some kwargs in processor config are unused and will not have any effect: image_seq_len. \n" |
| 69 | + ] |
| 70 | + } |
| 71 | + ], |
66 | 72 | "source": [
|
67 |
| - "model_name = \"HuggingFaceTB/SmolVLM-Instruct\" # original magnet model is able to be loaded without issue\n", |
| 73 | + "model_name = \"HuggingFaceTB/SmolVLM-Instruct\"\n", |
68 | 74 | "\n",
|
69 | 75 | "\n",
|
70 | 76 | "def get_model_and_processor_class(model_name: str):\n",
|
|
91 | 97 | " model_kwargs={\"torch_dtype\": torch.bfloat16, \"device_map\": \"auto\"},\n",
|
92 | 98 | " processor_kwargs={\"device\": device},\n",
|
93 | 99 | " processor_class=processor_class,\n",
|
94 |
| - ")\n", |
95 |
| - "model" |
| 100 | + ")" |
96 | 101 | ]
|
97 | 102 | },
|
98 | 103 | {
|
99 | 104 | "cell_type": "markdown",
|
100 | 105 | "metadata": {},
|
101 | 106 | "source": [
|
102 |
| - "Now, we are going to define a function that will define how the output of our model will be structured. We will want to extract Tags for object in the image along with a string and a confidence score." |
| 107 | + "## Structured Generation\n", |
| 108 | + "\n", |
| 109 | + "Now, we are going to define a function that will define how the output of our model will be structured. We will be using the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset), which contains a set of images along with questions and their chosen and rejected reponses. This is an okay dataset but we want to create additional text-image-to-text data on top of the images to get our own structured dataset, and potentially fine-tune our model on it. We will use the model to generate a caption, a question and a simple quality tag for the image. " |
103 | 110 | ]
|
104 | 111 | },
|
105 | 112 | {
|
106 | 113 | "cell_type": "code",
|
107 |
| - "execution_count": 93, |
| 114 | + "execution_count": 10, |
108 | 115 | "metadata": {},
|
109 | 116 | "outputs": [],
|
110 | 117 | "source": [
|
111 |
| - "class TagType(StrEnum):\n", |
112 |
| - " ENTITY = \"Entity\"\n", |
113 |
| - " RELATIONSHIP = \"Relationship\"\n", |
114 |
| - " STYLE = \"Style\"\n", |
115 |
| - " ATTRIBUTE = \"Attribute\"\n", |
116 |
| - " COMPOSITION = \"Composition\"\n", |
117 |
| - " CONTEXTUAL = \"Contextual\"\n", |
118 |
| - " TECHNICAL = \"Technical\"\n", |
119 |
| - " SEMANTIC = \"Semantic\"\n", |
120 |
| - "\n", |
121 |
| - "class ImageTag(BaseModel):\n", |
122 |
| - " tag_name: str\n", |
123 |
| - " tag_description: str\n", |
124 |
| - " tag_type: TagType\n", |
125 |
| - " confidence_score: float\n", |
126 |
| - "\n", |
127 |
| - "\n", |
128 | 118 | "class ImageData(BaseModel):\n",
|
129 |
| - " tags_list: List[ImageTag] = Field(min_items=1)\n", |
130 |
| - " short_caption: str\n", |
131 |
| - "\n", |
| 119 | + " quality: str\n", |
| 120 | + " description: str\n", |
| 121 | + " question: str\n", |
132 | 122 | "\n",
|
133 |
| - "image_objects_generator = outlines.generate.json(model, ImageData)" |
| 123 | + "structured_generator = outlines.generate.json(model, ImageData)" |
134 | 124 | ]
|
135 | 125 | },
|
136 | 126 | {
|
137 | 127 | "cell_type": "markdown",
|
138 | 128 | "metadata": {},
|
139 | 129 | "source": [
|
140 |
| - "Now, let's come up with an extraction prompt. We will want to extract Tags for object in the image along with a string and a confidence score and provide some guidance to the model about the different tags and structrue." |
| 130 | + "Now, let's come up with an extraction prompt." |
141 | 131 | ]
|
142 | 132 | },
|
143 | 133 | {
|
144 | 134 | "cell_type": "code",
|
145 |
| - "execution_count": 96, |
| 135 | + "execution_count": 16, |
146 | 136 | "metadata": {},
|
147 | 137 | "outputs": [],
|
148 | 138 | "source": [
|
149 | 139 | "prompt = \"\"\"\n",
|
150 |
| - "You are a structured image analysis assitant. Generate comprehensive tag list for an image classification system. Use at least 1 tag per type. Return the results as a valid JSON object.\n", |
| 140 | + "You are an image analysis assisant.\n", |
| 141 | + "\n", |
| 142 | + "Provide a quality tag, a description and a question.\n", |
| 143 | + "\n", |
| 144 | + "The quality can either be \"good\", \"okay\" or \"bad\".\n", |
| 145 | + "The question should be concise and objective.\n", |
| 146 | + "\n", |
| 147 | + "Return your response as a valid JSON object.\n", |
151 | 148 | "\"\"\".strip()"
|
152 | 149 | ]
|
153 | 150 | },
|
| 151 | + { |
| 152 | + "cell_type": "markdown", |
| 153 | + "metadata": {}, |
| 154 | + "source": [ |
| 155 | + "Let's load our image dataset." |
| 156 | + ] |
| 157 | + }, |
154 | 158 | {
|
155 | 159 | "cell_type": "code",
|
156 |
| - "execution_count": 95, |
| 160 | + "execution_count": 15, |
157 | 161 | "metadata": {},
|
158 | 162 | "outputs": [
|
159 | 163 | {
|
160 | 164 | "data": {
|
161 | 165 | "text/plain": [
|
162 |
| - "ImageData(tags_list=[ImageTag(tag_name='spacecraft', tag_description='You are an EVA astronaut standing on the moon', tag_type=<TagType.STYLE: 'Style'>, confidence_score=0.9471130702150571), ImageTag(tag_name='tire track', tag_description='You think tike this used to lead your way here', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=1.0), ImageTag(tag_name='space helmet', tag_description='Ozone spacesuit with white metal visor', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=0.9737292349276361), ImageTag(tag_name='space suit', tag_description='White Astronaut', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=0.9749979480665247), ImageTag(tag_name='astronaut', tag_description='Astronaut', tag_type=<TagType.ENTITY: 'Entity'>, confidence_score=0.8412833526756263)], short_caption=\"An astronaut from space sits on the lunar surface at around 200 feet below him, over a tan lunar ground with bays leading to his original path and some rocks oncrete having a shiny armor. Both left and right have a sphere that is used for eyes and protection. Left is wearing a baseball with playing field across, and other articles, the heavy one having a shiny metal visor drum on top. The astronaut's grin can be seen over the helmet as he comes out with his right arm out of the sat gadget and leaves it as leaving the shining metal bars as he is from the center of the image.\")" |
| 166 | + "Dataset({\n", |
| 167 | + " features: ['ds_name', 'image', 'question', 'chosen', 'rejected', 'origin_dataset', 'origin_split', 'idx', 'image_path'],\n", |
| 168 | + " num_rows: 10\n", |
| 169 | + "})" |
163 | 170 | ]
|
164 | 171 | },
|
165 |
| - "execution_count": 95, |
| 172 | + "execution_count": 15, |
166 | 173 | "metadata": {},
|
167 | 174 | "output_type": "execute_result"
|
168 | 175 | }
|
169 | 176 | ],
|
170 | 177 | "source": [
|
171 |
| - "def img_from_url(url):\n", |
172 |
| - " img_byte_stream = BytesIO(urlopen(url).read())\n", |
173 |
| - " return Image.open(img_byte_stream).convert(\"RGB\")\n", |
174 |
| - "\n", |
175 |
| - "\n", |
176 |
| - "image_url = (\n", |
177 |
| - " \"https://upload.wikimedia.org/wikipedia/commons/9/98/Aldrin_Apollo_11_original.jpg\"\n", |
178 |
| - ")\n", |
179 |
| - "image = img_from_url(image_url)\n", |
180 |
| - "\n", |
181 |
| - "\n", |
182 |
| - "def extract_objects(image, prompt):\n", |
| 178 | + "dataset = load_dataset(\"openbmb/RLAIF-V-Dataset\", split=\"train[:10]\")\n", |
| 179 | + "dataset" |
| 180 | + ] |
| 181 | + }, |
| 182 | + { |
| 183 | + "cell_type": "markdown", |
| 184 | + "metadata": {}, |
| 185 | + "source": [ |
| 186 | + "Now, let's define a function that will extract the structured information from the image. We will format the prompt using the `apply_chat_template` method and pass it to the model along with the image after that." |
| 187 | + ] |
| 188 | + }, |
| 189 | + { |
| 190 | + "cell_type": "code", |
| 191 | + "execution_count": 17, |
| 192 | + "metadata": {}, |
| 193 | + "outputs": [ |
| 194 | + { |
| 195 | + "data": { |
| 196 | + "application/vnd.jupyter.widget-view+json": { |
| 197 | + "model_id": "1caa96c32bc7416ea43c192c0cd88c20", |
| 198 | + "version_major": 2, |
| 199 | + "version_minor": 0 |
| 200 | + }, |
| 201 | + "text/plain": [ |
| 202 | + "Map: 0%| | 0/10 [00:00<?, ? examples/s]" |
| 203 | + ] |
| 204 | + }, |
| 205 | + "metadata": {}, |
| 206 | + "output_type": "display_data" |
| 207 | + }, |
| 208 | + { |
| 209 | + "data": { |
| 210 | + "text/plain": [ |
| 211 | + "Dataset({\n", |
| 212 | + " features: ['ds_name', 'image', 'question', 'chosen', 'rejected', 'origin_dataset', 'origin_split', 'idx', 'image_path', 'synthetic_question', 'synthetic_description', 'synthetic_quality'],\n", |
| 213 | + " num_rows: 10\n", |
| 214 | + "})" |
| 215 | + ] |
| 216 | + }, |
| 217 | + "execution_count": 17, |
| 218 | + "metadata": {}, |
| 219 | + "output_type": "execute_result" |
| 220 | + } |
| 221 | + ], |
| 222 | + "source": [ |
| 223 | + "def extract(row):\n", |
183 | 224 | " messages = [\n",
|
184 | 225 | " {\n",
|
185 | 226 | " \"role\": \"user\",\n",
|
|
191 | 232 | " messages, add_generation_prompt=True\n",
|
192 | 233 | " )\n",
|
193 | 234 | "\n",
|
194 |
| - " result = image_objects_generator(formatted_prompt, [image])\n", |
195 |
| - " return result\n", |
| 235 | + " result = structured_generator(formatted_prompt, [row[\"image\"]])\n", |
| 236 | + " row['synthetic_question'] = result.question\n", |
| 237 | + " row['synthetic_description'] = result.description\n", |
| 238 | + " row['synthetic_quality'] = result.quality\n", |
| 239 | + " return row\n", |
196 | 240 | "\n",
|
197 | 241 | "\n",
|
198 |
| - "extract_objects(image, prompt)" |
| 242 | + "dataset = dataset.map(lambda x: extract(x))\n", |
| 243 | + "dataset" |
| 244 | + ] |
| 245 | + }, |
| 246 | + { |
| 247 | + "cell_type": "markdown", |
| 248 | + "metadata": {}, |
| 249 | + "source": [ |
| 250 | + "Let's now push our new dataset to the Hub." |
| 251 | + ] |
| 252 | + }, |
| 253 | + { |
| 254 | + "cell_type": "code", |
| 255 | + "execution_count": 18, |
| 256 | + "metadata": {}, |
| 257 | + "outputs": [ |
| 258 | + { |
| 259 | + "data": { |
| 260 | + "application/vnd.jupyter.widget-view+json": { |
| 261 | + "model_id": "843b9c88cab54402812f1b936a2dc6e0", |
| 262 | + "version_major": 2, |
| 263 | + "version_minor": 0 |
| 264 | + }, |
| 265 | + "text/plain": [ |
| 266 | + "Uploading the dataset shards: 0%| | 0/1 [00:00<?, ?it/s]" |
| 267 | + ] |
| 268 | + }, |
| 269 | + "metadata": {}, |
| 270 | + "output_type": "display_data" |
| 271 | + }, |
| 272 | + { |
| 273 | + "data": { |
| 274 | + "application/vnd.jupyter.widget-view+json": { |
| 275 | + "model_id": "57e5dea4ae504866b2d93863bcfa4408", |
| 276 | + "version_major": 2, |
| 277 | + "version_minor": 0 |
| 278 | + }, |
| 279 | + "text/plain": [ |
| 280 | + "Map: 0%| | 0/10 [00:00<?, ? examples/s]" |
| 281 | + ] |
| 282 | + }, |
| 283 | + "metadata": {}, |
| 284 | + "output_type": "display_data" |
| 285 | + }, |
| 286 | + { |
| 287 | + "data": { |
| 288 | + "application/vnd.jupyter.widget-view+json": { |
| 289 | + "model_id": "b811febb7c044100bb74bf67016f0d0d", |
| 290 | + "version_major": 2, |
| 291 | + "version_minor": 0 |
| 292 | + }, |
| 293 | + "text/plain": [ |
| 294 | + "Creating parquet from Arrow format: 0%| | 0/1 [00:00<?, ?ba/s]" |
| 295 | + ] |
| 296 | + }, |
| 297 | + "metadata": {}, |
| 298 | + "output_type": "display_data" |
| 299 | + }, |
| 300 | + { |
| 301 | + "data": { |
| 302 | + "application/vnd.jupyter.widget-view+json": { |
| 303 | + "model_id": "1fa44296ea00459b8cbb22e56739117c", |
| 304 | + "version_major": 2, |
| 305 | + "version_minor": 0 |
| 306 | + }, |
| 307 | + "text/plain": [ |
| 308 | + "README.md: 0%| | 0.00/719 [00:00<?, ?B/s]" |
| 309 | + ] |
| 310 | + }, |
| 311 | + "metadata": {}, |
| 312 | + "output_type": "display_data" |
| 313 | + }, |
| 314 | + { |
| 315 | + "data": { |
| 316 | + "text/plain": [ |
| 317 | + "CommitInfo(commit_url='https://huggingface.co/datasets/davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset/commit/f72002df2d9aef403afeaf6e27f4407ddd82c89c', commit_message='Upload dataset', commit_description='', oid='f72002df2d9aef403afeaf6e27f4407ddd82c89c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset', endpoint='https://huggingface.co', repo_type='dataset', repo_id='davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset'), pr_revision=None, pr_num=None)" |
| 318 | + ] |
| 319 | + }, |
| 320 | + "execution_count": 18, |
| 321 | + "metadata": {}, |
| 322 | + "output_type": "execute_result" |
| 323 | + } |
| 324 | + ], |
| 325 | + "source": [ |
| 326 | + "dataset.push_to_hub(\"davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset\", split=\"train\")" |
| 327 | + ] |
| 328 | + }, |
| 329 | + { |
| 330 | + "cell_type": "markdown", |
| 331 | + "metadata": {}, |
| 332 | + "source": [ |
| 333 | + "<iframe\n", |
| 334 | + " src=\"https://huggingface.co/datasets/davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset/embed/viewer/default/train?row=3\"\n", |
| 335 | + " frameborder=\"0\"\n", |
| 336 | + " width=\"100%\"\n", |
| 337 | + " height=\"560px\"\n", |
| 338 | + "></iframe>" |
| 339 | + ] |
| 340 | + }, |
| 341 | + { |
| 342 | + "cell_type": "markdown", |
| 343 | + "metadata": {}, |
| 344 | + "source": [ |
| 345 | + "The results are not perfect, but they are a good starting point to continue exploring with different models and prompts!" |
199 | 346 | ]
|
200 | 347 | },
|
201 | 348 | {
|
|
216 | 363 | "## Next Steps\n",
|
217 | 364 | "\n",
|
218 | 365 | "- Take a look at the [Outlines](https://github.com/outlines-ai/outlines) library for more information on how to use it. Explore the different methods and parameters.\n",
|
219 |
| - "- Explore extraction on your own usecase.\n", |
| 366 | + "- Explore extraction on your own usecase with your own model.\n", |
220 | 367 | "- Use a different method of extracting structured information from documents."
|
221 | 368 | ]
|
222 | 369 | }
|
|
0 commit comments