Skip to content

Commit 2fce134

Browse files
Support field fileData (direct file URL) for GeminiModel and GoogleModel (#1136)
1 parent 4af2463 commit 2fce134

40 files changed

+3256
-165
lines changed

docs/input.md

Lines changed: 21 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
Some LLMs are now capable of understanding audio, video, image and document content.
44

5+
56
## Image Input
67

78
!!! info
@@ -64,14 +65,6 @@ You can provide video input using either [`VideoUrl`][pydantic_ai.VideoUrl] or [
6465
!!! info
6566
Some models do not support document input. Please check the model's documentation to confirm whether it supports document input.
6667

67-
!!! warning
68-
When using Gemini models, the document content will always be sent as binary data, regardless of whether you use `DocumentUrl` or `BinaryContent`. This is due to differences in how Vertex AI and Google AI handle document inputs.
69-
70-
For more details, see [this discussion](https://discuss.ai.google.dev/t/i-am-using-google-generative-ai-model-gemini-1-5-pro-for-image-analysis-but-getting-error/34866/4).
71-
72-
If you are unsatisfied with this behavior, please let us know by opening an issue on
73-
[GitHub](https://github.com/pydantic/pydantic-ai/issues).
74-
7568
You can provide document input using either [`DocumentUrl`][pydantic_ai.DocumentUrl] or [`BinaryContent`][pydantic_ai.BinaryContent]. The process is similar to the examples above.
7669

7770
If you have a direct URL for the document, you can use [`DocumentUrl`][pydantic_ai.DocumentUrl]:
@@ -109,3 +102,23 @@ result = agent.run_sync(
109102
print(result.output)
110103
# > The document discusses...
111104
```
105+
106+
## User-side download vs. direct file URL
107+
108+
As a general rule, when you provide a URL using any of `ImageUrl`, `AudioUrl`, `VideoUrl` or `DocumentUrl`, PydanticAI downloads the file content and then sends it as part of the API request.
109+
110+
The situation is different for certain models:
111+
112+
- [`AnthropicModel`][pydantic_ai.models.anthropic.AnthropicModel]: if you provide a PDF document via `DocumentUrl`, the URL is sent directly in the API request, so no download happens on the user side.
113+
114+
- [`GeminiModel`][pydantic_ai.models.gemini.GeminiModel] and [`GoogleModel`][pydantic_ai.models.google.GoogleModel] on Vertex AI: any URL provided using `ImageUrl`, `AudioUrl`, `VideoUrl`, or `DocumentUrl` is sent as-is in the API request and no data is downloaded beforehand.
115+
116+
See the [Gemini API docs for Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/inference#filedata) to learn more about supported URLs, formats and limitations:
117+
118+
- Cloud Storage bucket URIs (with protocol `gs://`)
119+
- Public HTTP(S) URLs
120+
- Public YouTube video URL (maximum one URL per request)
121+
122+
However, because of crawling restrictions, it may happen that Gemini can't access certain URLs. In that case, you can instruct PydanticAI to download the file content and send that instead of the URL by setting the boolean flag `force_download` to `True`. This attribute is available on all objects that inherit from [`FileUrl`][pydantic_ai.messages.FileUrl].
123+
124+
- [`GeminiModel`][pydantic_ai.models.gemini.GeminiModel] and [`GoogleModel`][pydantic_ai.models.google.GoogleModel] on GLA: YouTube video URLs are sent directly in the request to the model.

docs/models/google.md

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -161,13 +161,6 @@ See the [Gemini API docs](https://ai.google.dev/gemini-api/docs/safety-settings)
161161

162162
`GoogleModel` supports multi-modal input, including documents, images, audio, and video. See the [input documentation](../input.md) for details and examples.
163163

164-
!!! warning
165-
When using Gemini models, document content is always sent as binary data, regardless of whether you use `DocumentUrl` or `BinaryContent`.
166-
This is due to differences in how Vertex AI and Google AI handle document inputs.
167-
168-
See [this discussion](https://discuss.ai.google.dev/t/i-am-using-google-generative-ai-model-gemini-1-5-pro-for-image-analysis-but-getting-error/34866/4)
169-
for more details.
170-
171164
## Model settings
172165

173166
You can use the [`GoogleModelSettings`][pydantic_ai.models.google.GoogleModelSettings] class to customize the model request.

pydantic_ai_slim/pydantic_ai/messages.py

Lines changed: 43 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
import base64
44
import uuid
5+
from abc import ABC, abstractmethod
56
from collections.abc import Sequence
67
from dataclasses import dataclass, field, replace
78
from datetime import datetime
@@ -80,8 +81,35 @@ def otel_event(self, _settings: InstrumentationSettings) -> Event:
8081

8182

8283
@dataclass(repr=False)
83-
class VideoUrl:
84-
"""A URL to an video."""
84+
class FileUrl(ABC):
85+
"""Abstract base class for any URL-based file."""
86+
87+
url: str
88+
"""The URL of the file."""
89+
90+
force_download: bool = False
91+
"""If the model supports it:
92+
93+
* If True, the file is downloaded and the data is sent to the model as bytes.
94+
* If False, the URL is sent directly to the model and no download is performed.
95+
"""
96+
97+
@property
98+
@abstractmethod
99+
def media_type(self) -> str:
100+
"""Return the media type of the file, based on the url."""
101+
102+
@property
103+
@abstractmethod
104+
def format(self) -> str:
105+
"""The file format."""
106+
107+
__repr__ = _utils.dataclasses_no_defaults_repr
108+
109+
110+
@dataclass(repr=False)
111+
class VideoUrl(FileUrl):
112+
"""A URL to a video."""
85113

86114
url: str
87115
"""The URL of the video."""
@@ -108,9 +136,19 @@ def media_type(self) -> VideoMediaType:
108136
return 'video/x-ms-wmv'
109137
elif self.url.endswith('.three_gp'):
110138
return 'video/3gpp'
139+
# Assume that YouTube videos are mp4 because there would be no extension
140+
# to infer from. This should not be a problem, as Gemini disregards media
141+
# type for YouTube URLs.
142+
elif self.is_youtube:
143+
return 'video/mp4'
111144
else:
112145
raise ValueError(f'Unknown video file extension: {self.url}')
113146

147+
@property
148+
def is_youtube(self) -> bool:
149+
"""True if the URL has a YouTube domain."""
150+
return self.url.startswith(('https://youtu.be/', 'https://youtube.com/', 'https://www.youtube.com/'))
151+
114152
@property
115153
def format(self) -> VideoFormat:
116154
"""The file format of the video.
@@ -119,11 +157,9 @@ def format(self) -> VideoFormat:
119157
"""
120158
return _video_format_lookup[self.media_type]
121159

122-
__repr__ = _utils.dataclasses_no_defaults_repr
123-
124160

125161
@dataclass(repr=False)
126-
class AudioUrl:
162+
class AudioUrl(FileUrl):
127163
"""A URL to an audio file."""
128164

129165
url: str
@@ -147,11 +183,9 @@ def format(self) -> AudioFormat:
147183
"""The file format of the audio file."""
148184
return _audio_format_lookup[self.media_type]
149185

150-
__repr__ = _utils.dataclasses_no_defaults_repr
151-
152186

153187
@dataclass(repr=False)
154-
class ImageUrl:
188+
class ImageUrl(FileUrl):
155189
"""A URL to an image."""
156190

157191
url: str
@@ -182,11 +216,9 @@ def format(self) -> ImageFormat:
182216
"""
183217
return _image_format_lookup[self.media_type]
184218

185-
__repr__ = _utils.dataclasses_no_defaults_repr
186-
187219

188220
@dataclass(repr=False)
189-
class DocumentUrl:
221+
class DocumentUrl(FileUrl):
190222
"""The URL of the document."""
191223

192224
url: str
@@ -215,8 +247,6 @@ def format(self) -> DocumentFormat:
215247
except KeyError as e:
216248
raise ValueError(f'Unknown document media type: {media_type}') from e
217249

218-
__repr__ = _utils.dataclasses_no_defaults_repr
219-
220250

221251
@dataclass(repr=False)
222252
class BinaryContent:

pydantic_ai_slim/pydantic_ai/models/__init__.py

Lines changed: 89 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,21 +6,23 @@
66

77
from __future__ import annotations as _annotations
88

9+
import base64
910
from abc import ABC, abstractmethod
1011
from collections.abc import AsyncIterator, Iterator
1112
from contextlib import asynccontextmanager, contextmanager
1213
from dataclasses import dataclass, field, replace
1314
from datetime import datetime
1415
from functools import cache, cached_property
16+
from typing import Generic, TypeVar, overload
1517

1618
import httpx
17-
from typing_extensions import Literal, TypeAliasType
19+
from typing_extensions import Literal, TypeAliasType, TypedDict
1820

1921
from pydantic_ai.profiles import DEFAULT_PROFILE, ModelProfile, ModelProfileSpec
2022

2123
from .._parts_manager import ModelResponsePartsManager
2224
from ..exceptions import UserError
23-
from ..messages import ModelMessage, ModelRequest, ModelResponse, ModelResponseStreamEvent
25+
from ..messages import FileUrl, ModelMessage, ModelRequest, ModelResponse, ModelResponseStreamEvent, VideoUrl
2426
from ..profiles._json_schema import JsonSchemaTransformer
2527
from ..settings import ModelSettings
2628
from ..tools import ToolDefinition
@@ -611,6 +613,91 @@ def _cached_async_http_transport() -> httpx.AsyncHTTPTransport:
611613
return httpx.AsyncHTTPTransport()
612614

613615

616+
DataT = TypeVar('DataT', str, bytes)
617+
618+
619+
class DownloadedItem(TypedDict, Generic[DataT]):
620+
"""The downloaded data and its type."""
621+
622+
data: DataT
623+
"""The downloaded data."""
624+
625+
data_type: str
626+
"""The type of data that was downloaded.
627+
628+
Extracted from header "content-type", but defaults to the media type inferred from the file URL if content-type is "application/octet-stream".
629+
"""
630+
631+
632+
@overload
633+
async def download_item(
634+
item: FileUrl,
635+
data_format: Literal['bytes'],
636+
type_format: Literal['mime', 'extension'] = 'mime',
637+
) -> DownloadedItem[bytes]: ...
638+
639+
640+
@overload
641+
async def download_item(
642+
item: FileUrl,
643+
data_format: Literal['base64', 'base64_uri', 'text'],
644+
type_format: Literal['mime', 'extension'] = 'mime',
645+
) -> DownloadedItem[str]: ...
646+
647+
648+
async def download_item(
649+
item: FileUrl,
650+
data_format: Literal['bytes', 'base64', 'base64_uri', 'text'] = 'bytes',
651+
type_format: Literal['mime', 'extension'] = 'mime',
652+
) -> DownloadedItem[str] | DownloadedItem[bytes]:
653+
"""Download an item by URL and return the content as a bytes object or a (base64-encoded) string.
654+
655+
Args:
656+
item: The item to download.
657+
data_format: The format to return the content in:
658+
- `bytes`: The raw bytes of the content.
659+
- `base64`: The base64-encoded content.
660+
- `base64_uri`: The base64-encoded content as a data URI.
661+
- `text`: The content as a string.
662+
type_format: The format to return the media type in:
663+
- `mime`: The media type as a MIME type.
664+
- `extension`: The media type as an extension.
665+
666+
Raises:
667+
UserError: If the URL points to a YouTube video or its protocol is gs://.
668+
"""
669+
if item.url.startswith('gs://'):
670+
raise UserError('Downloading from protocol "gs://" is not supported.')
671+
elif isinstance(item, VideoUrl) and item.is_youtube:
672+
raise UserError('Downloading YouTube videos is not supported.')
673+
674+
client = cached_async_http_client()
675+
response = await client.get(item.url, follow_redirects=True)
676+
response.raise_for_status()
677+
678+
if content_type := response.headers.get('content-type'):
679+
content_type = content_type.split(';')[0]
680+
if content_type == 'application/octet-stream':
681+
content_type = None
682+
683+
media_type = content_type or item.media_type
684+
685+
data_type = media_type
686+
if type_format == 'extension':
687+
data_type = data_type.split('/')[1]
688+
689+
data = response.content
690+
if data_format in ('base64', 'base64_uri'):
691+
data = base64.b64encode(data).decode('utf-8')
692+
if data_format == 'base64_uri':
693+
data = f'data:{media_type};base64,{data}'
694+
return DownloadedItem[str](data=data, data_type=data_type)
695+
elif data_format == 'text':
696+
return DownloadedItem[str](data=data.decode('utf-8'), data_type=data_type)
697+
else:
698+
return DownloadedItem[bytes](data=data, data_type=data_type)
699+
700+
614701
@cache
615702
def get_user_agent() -> str:
616703
"""Get the user agent string for the HTTP client."""

pydantic_ai_slim/pydantic_ai/models/anthropic.py

Lines changed: 3 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -31,14 +31,7 @@
3131
from ..providers import Provider, infer_provider
3232
from ..settings import ModelSettings
3333
from ..tools import ToolDefinition
34-
from . import (
35-
Model,
36-
ModelRequestParameters,
37-
StreamedResponse,
38-
cached_async_http_client,
39-
check_allow_model_requests,
40-
get_user_agent,
41-
)
34+
from . import Model, ModelRequestParameters, StreamedResponse, check_allow_model_requests, download_item, get_user_agent
4235

4336
try:
4437
from anthropic import NOT_GIVEN, APIStatusError, AsyncAnthropic, AsyncStream
@@ -372,11 +365,10 @@ async def _map_user_prompt(
372365
if item.media_type == 'application/pdf':
373366
yield BetaBase64PDFBlockParam(source={'url': item.url, 'type': 'url'}, type='document')
374367
elif item.media_type == 'text/plain':
375-
response = await cached_async_http_client().get(item.url)
376-
response.raise_for_status()
368+
downloaded_item = await download_item(item, data_format='text')
377369
yield BetaBase64PDFBlockParam(
378370
source=BetaPlainTextSourceParam(
379-
data=response.text, media_type=item.media_type, type='text'
371+
data=downloaded_item['data'], media_type=item.media_type, type='text'
380372
),
381373
type='document',
382374
)

pydantic_ai_slim/pydantic_ai/models/bedrock.py

Lines changed: 23 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -32,12 +32,7 @@
3232
UserPromptPart,
3333
VideoUrl,
3434
)
35-
from pydantic_ai.models import (
36-
Model,
37-
ModelRequestParameters,
38-
StreamedResponse,
39-
cached_async_http_client,
40-
)
35+
from pydantic_ai.models import Model, ModelRequestParameters, StreamedResponse, download_item
4136
from pydantic_ai.profiles import ModelProfileSpec
4237
from pydantic_ai.providers import Provider, infer_provider
4338
from pydantic_ai.providers.bedrock import BedrockModelProfile
@@ -55,6 +50,7 @@
5550
ConverseResponseTypeDef,
5651
ConverseStreamMetadataEventTypeDef,
5752
ConverseStreamOutputTypeDef,
53+
DocumentBlockTypeDef,
5854
GuardrailConfigurationTypeDef,
5955
ImageBlockTypeDef,
6056
InferenceConfigurationTypeDef,
@@ -507,25 +503,37 @@ async def _map_user_prompt(part: UserPromptPart, document_count: Iterator[int])
507503
else:
508504
raise NotImplementedError('Binary content is not supported yet.')
509505
elif isinstance(item, (ImageUrl, DocumentUrl, VideoUrl)):
510-
response = await cached_async_http_client().get(item.url)
511-
response.raise_for_status()
506+
downloaded_item = await download_item(item, data_format='bytes', type_format='extension')
507+
format = downloaded_item['data_type']
512508
if item.kind == 'image-url':
513509
format = item.media_type.split('/')[1]
514510
assert format in ('jpeg', 'png', 'gif', 'webp'), f'Unsupported image format: {format}'
515-
image: ImageBlockTypeDef = {'format': format, 'source': {'bytes': response.content}}
511+
image: ImageBlockTypeDef = {'format': format, 'source': {'bytes': downloaded_item['data']}}
516512
content.append({'image': image})
517513

518514
elif item.kind == 'document-url':
519515
name = f'Document {next(document_count)}'
520-
data = response.content
521-
content.append({'document': {'name': name, 'format': item.format, 'source': {'bytes': data}}})
516+
document: DocumentBlockTypeDef = {
517+
'name': name,
518+
'format': item.format,
519+
'source': {'bytes': downloaded_item['data']},
520+
}
521+
content.append({'document': document})
522522

523523
elif item.kind == 'video-url': # pragma: no branch
524524
format = item.media_type.split('/')[1]
525-
assert format in ('mkv', 'mov', 'mp4', 'webm', 'flv', 'mpeg', 'mpg', 'wmv', 'three_gp'), (
526-
f'Unsupported video format: {format}'
527-
)
528-
video: VideoBlockTypeDef = {'format': format, 'source': {'bytes': response.content}}
525+
assert format in (
526+
'mkv',
527+
'mov',
528+
'mp4',
529+
'webm',
530+
'flv',
531+
'mpeg',
532+
'mpg',
533+
'wmv',
534+
'three_gp',
535+
), f'Unsupported video format: {format}'
536+
video: VideoBlockTypeDef = {'format': format, 'source': {'bytes': downloaded_item['data']}}
529537
content.append({'video': video})
530538
elif isinstance(item, AudioUrl): # pragma: no cover
531539
raise NotImplementedError('Audio is not supported yet.')

0 commit comments

Comments
 (0)