From ec1bc3716431e10a8af079660a1f5f68ab48d6f8 Mon Sep 17 00:00:00 2001
From: data-wombat <88857060+data-wombat@users.noreply.github.com>
Date: Thu, 13 Jun 2024 13:43:33 -0700
Subject: [PATCH 1/8] Create vision.ipynb
Uploading vision capabilities Colab as part of DevRel CY
---
site/en/gemini-api/docs/vision.ipynb | 678 +++++++++++++++++++++++++++
1 file changed, 678 insertions(+)
create mode 100644 site/en/gemini-api/docs/vision.ipynb
diff --git a/site/en/gemini-api/docs/vision.ipynb b/site/en/gemini-api/docs/vision.ipynb
new file mode 100644
index 000000000..3bfdd6a6c
--- /dev/null
+++ b/site/en/gemini-api/docs/vision.ipynb
@@ -0,0 +1,678 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Tce3stUlHN0L"
+ },
+ "source": [
+ "##### Copyright 2024 Google LLC."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "cellView": "form",
+ "id": "tuOe1ymfHZPu"
+ },
+ "outputs": [],
+ "source": [
+ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+ "# you may not use this file except in compliance with the License.\n",
+ "# You may obtain a copy of the License at\n",
+ "#\n",
+ "# https://www.apache.org/licenses/LICENSE-2.0\n",
+ "#\n",
+ "# Unless required by applicable law or agreed to in writing, software\n",
+ "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+ "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+ "# See the License for the specific language governing permissions and\n",
+ "# limitations under the License."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "084u8u0DpBlo"
+ },
+ "source": [
+ "# Explore vision capabilities with the Gemini API"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ZFWzQEqNosrS"
+ },
+ "source": [
+ "
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "3c5e92a74e64"
+ },
+ "source": [
+ "The Gemini API can run inference on images and videos passed to it, in many\n",
+ "cases exhibiting the capabilities of [computer vision](https://en.wikipedia.org/wiki/Computer_vision). (Note that generative models differ in technical implementation from historical computer vision methods.) When passed an image, a series of images, or a video, Gemini can:\n",
+ "\n",
+ "* Describe or answer questions about the content\n",
+ "* Summarize the content\n",
+ "* Extrapolate from the content\n",
+ "\n",
+ "This tutorial demonstrates some possible ways to prompt the Gemini API with\n",
+ "images and video input. All output is text-only."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "VxCstRHvpX0r"
+ },
+ "source": [
+ "## Setup\n",
+ "\n",
+ "Before you use the File API, you need to install the Gemini API SDK package and configure an API key. This section describes how to complete these setup steps."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "G6J_rV2ipmj_"
+ },
+ "source": [
+ "### Install the Python SDK and import packages\n",
+ "\n",
+ "The Python SDK for the Gemini API is contained in the [google-generativeai](https://pypi.org/project/google-generativeai/) package. Install the dependency using pip."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "mN8x8DPgu9Kv"
+ },
+ "outputs": [],
+ "source": [
+ "!pip install -q -U google-generativeai"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "NInUZ4hwDq6d"
+ },
+ "source": [
+ "Import the necessary packages."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "0x3xmmWrDtEH"
+ },
+ "outputs": [],
+ "source": [
+ "import google.generativeai as genai\n",
+ "from IPython.display import Markdown"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "l8g4hTRotheH"
+ },
+ "source": [
+ "### Setup your API key\n",
+ "\n",
+ "The File API uses API keys for authentication and access. Uploaded files are associated with the project linked to the API key. Unlike other Gemini APIs that use API keys, your API key also grants access to data you've uploaded to the File API, so take extra care in keeping your API key secure. For more on keeping your keys\n",
+ "secure, see [Best practices for using API\n",
+ "keys](https://support.google.com/googleapi/answer/6310037).\n",
+ "\n",
+ "Store your API key in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or are unfamiliar with Colab Secrets, refer to the [Authentication quickstart](https://github.com/google-gemini/gemini-api-cookbook/blob/main/quickstarts/Authentication.ipynb)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "d6lYXRcjthKV"
+ },
+ "outputs": [],
+ "source": [
+ "from google.colab import userdata\n",
+ "GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')\n",
+ "\n",
+ "genai.configure(api_key=GOOGLE_API_KEY)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "c-z4zsCUlaru"
+ },
+ "source": [
+ "## Prompting with images\n",
+ "\n",
+ "In this tutorial, you will upload images using the File API or as inline data and generate content based on those images."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Technical details (images)\n",
+ "Gemini 1.5 Pro (release 008 and later) supports a maximum of 3,600 image files. Gemini Pro Vision supports a maximum of 16 image files.\n",
+ "\n",
+ "Images must be in one of the following image data [MIME types](https://developers.google.com/drive/api/guides/ref-export-formats):\n",
+ "\n",
+ "- PNG - image/png\n",
+ "- JPEG - image/jpeg\n",
+ "- WEBP - image/webp\n",
+ "- HEIC - image/heic\n",
+ "- HEIF - image/heif\n",
+ "\n",
+ "Each image is equivalent to 258 tokens.\n",
+ "\n",
+ "While there are no specific limits to the number of pixels in an image besides the model’s context window, larger images are scaled down to a maximum resolution of 3072 x 3072 while preserving their original aspect ratio, while smaller images are scaled up to 768 x 768 pixels. There is no cost reduction for images at lower sizes, other than bandwidth, or performance improvement for images at higher resolution.\n",
+ "\n",
+ "For best results:\n",
+ "\n",
+ "* Rotate images to the correct orientation before uploading.\n",
+ "* Avoid blurry images.\n",
+ "* If using a single image, place the text prompt after the image."
+ ],
+ "metadata": {
+ "id": "AKNehP2tr3Cr"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "rsdNkDszLBmQ"
+ },
+ "source": [
+ "### Upload an image file using the File API\n",
+ "\n",
+ "Use the File API to upload an image of any size. (Images greater than 20MB cannot be handled inline and must be uploaded using the File API.)\n",
+ "\n",
+ "**NOTE**: The File API lets you store up to 20GB of files per project, with a per-file maximum size of 2GB. Files are stored for 48 hours. They can be accessed in that period with your API key, but cannot be downloaded from the API. It is available at no cost in all regions where the Gemini API is available.\n",
+ "\n",
+ "Start by calling this [sketch of a jetpack](https://storage.googleapis.com/generativeai-downloads/images/jetpack.jpg)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "lC6sS6DnmGmi"
+ },
+ "outputs": [],
+ "source": [
+ "!curl -o jetpack.jpg https://storage.googleapis.com/generativeai-downloads/images/jetpack.jpg"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Upload the image using [`media.upload`](https://ai.google.dev/api/rest/v1beta/media/upload) and print the URI, which is used as a reference in Gemini API calls."
+ ],
+ "metadata": {
+ "id": "qfa2VSqEsulq"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "N9NxXGZKKusG"
+ },
+ "outputs": [],
+ "source": [
+ "# Upload the file and print a confirmation.\n",
+ "sample_file = genai.upload_file(path=\"jetpack.jpg\",\n",
+ " display_name=\"Jetpack drawing\")\n",
+ "\n",
+ "print(f\"Uploaded file '{sample_file.display_name}' as: {sample_file.uri}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "cto22vhKOvGQ"
+ },
+ "source": [
+ "The `response` shows that the File API stored the specified `display_name` for the uploaded file and a `uri` to reference the file in Gemini API calls. Use `response` to track how uploaded files are mapped to URIs.\n",
+ "\n",
+ "Depending on your use case, you can also store the URIs in structures such as a `dict` or a database."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "ds5iJlaembWe"
+ },
+ "source": [
+ "### Verify image file upload and get metadata\n",
+ "\n",
+ "You can verify the API successfully stored the uploaded file and get its metadata by calling [files.get](https://ai.google.dev/api/rest/v1beta/files/get) through the SDK. Only the `name` (and by extension, the `uri`) are unique. Use `display_name` to identify files only if you manage uniqueness yourself."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "kLFsVLFHOWSV"
+ },
+ "outputs": [],
+ "source": [
+ "file = genai.get_file(name=sample_file.name)\n",
+ "print(f\"Retrieved file '{file.display_name}' as: {sample_file.uri}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Depending on your use case, you can store the URIs in structures, such as a `dict` or a database."
+ ],
+ "metadata": {
+ "id": "BqzIGKBmnFoJ"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "EPPOECHzsIGJ"
+ },
+ "source": [
+ "### Prompt with the uploaded image and text\n",
+ "\n",
+ "After uploading the file, you can make GenerateContent requests that reference the File API URI. Select the generative model and provide it with a text prompt and the uploaded image."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ZYVFqmLkl5nE"
+ },
+ "outputs": [],
+ "source": [
+ "# Choose a Gemini model.\n",
+ "model = genai.GenerativeModel(model_name=\"gemini-1.5-pro-latest\")\n",
+ "\n",
+ "# Prompt the model with text and the previously uploaded image.\n",
+ "response = model.generate_content([sample_file, \"Describe how this product might be manufactured.\"])\n",
+ "\n",
+ "Markdown(\">\" + response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Upload one or more locally stored image files\n",
+ "\n",
+ "Alternatively, you can upload your own files. You can download and use our drawings of [piranha-infested waters](https://storage.googleapis.com/generativeai-downloads/images/piranha.jpg) and a [firefighter with a cat](https://storage.googleapis.com/generativeai-downloads/images/firefighter.jpg). First, save these files to your local directory.\n",
+ "\n",
+ "Then click **Files** on the left sidebar. For each file, click the **Upload** button, then navigate to that file's location and upload it:\n",
+ "\n",
+ "
\n",
+ "\n",
+ "When the combination of files and system instructions that you intend to send is larger than 20MB in size, use the File API to upload those files, as previously shown. Smaller files can instead be called locally from the Gemini API: Smaller files can be called locally from the Gemini API:\n"
+ ],
+ "metadata": {
+ "id": "Lm862F3zthiI"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "import PIL.Image\n",
+ "\n",
+ "sample_file_2 = PIL.Image.open('piranha.jpg')\n",
+ "sample_file_3 = PIL.Image.open('firefighter.jpg')"
+ ],
+ "metadata": {
+ "id": "XzMhQ8MWub5_"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Note that these inline data calls don't include many of the features available via the File API, such as getting file metadata, [listing](https://colab.research.google.com/drive/19xeyIMZJIk7Zn9KW5_50iZYv8OfjApL5?resourcekey=0-3JZ6U8oAFX7hqeV7gAXshw#scrollTo=VosrkvAyrx-v&line=3&uniqifier=1), or [deleting](https://colab.research.google.com/drive/19xeyIMZJIk7Zn9KW5_50iZYv8OfjApL5?resourcekey=0-3JZ6U8oAFX7hqeV7gAXshw#scrollTo=diCy9BgjLqeS&line=1&uniqifier=1) files."
+ ],
+ "metadata": {
+ "id": "F2N5bLR7wlqL"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Prompt with multiple images\n",
+ "\n",
+ "You can provide the Gemini API with any combination of images and text that fit within the model's context window. This example provides one short text prompt and the three images previously uploaded."
+ ],
+ "metadata": {
+ "id": "X3pl7mWgwt6Q"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Choose a Gemini model.\n",
+ "model = genai.GenerativeModel(model_name=\"gemini-1.5-pro-latest\")\n",
+ "\n",
+ "prompt = \"Write an advertising jingle showing how the product in the first image could solve the problems shown in the second two images.\"\n",
+ "\n",
+ "response = model.generate_content([prompt, sample_file, firefighter, piranha])\n",
+ "\n",
+ "Markdown(\">\" + response.text)"
+ ],
+ "metadata": {
+ "id": "Ou5IVsybcOys"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "TaUZc1mvLkHY"
+ },
+ "source": [
+ "## Prompting with video\n",
+ "\n",
+ "In this tutorial, you will upload a video using the File API and generate content based on those images."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## Technical details (video)\n",
+ "\n",
+ "Gemini 1.5 Pro and Flash support up to approximately an hour of video data.\n",
+ "\n",
+ "Video must be in one of the following video format [MIME types](https://developers.google.com/drive/api/guides/ref-export-formats):\n",
+ " - video/mp4\n",
+ " - video/mpeg\n",
+ " - video/mov\n",
+ " - video/avi\n",
+ " - video/x-flv\n",
+ " - video/mpg\n",
+ " - video/webm\n",
+ " - video/wmv\n",
+ " - video/3gpp\n",
+ "\n",
+ "The File API service currently extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second. These rates are subject to change in the future for improvements in inference.\n",
+ "\n",
+ "**NOTE:** The finer details of fast action sequences may be lost at the 1FPS frame sampling rate. Consider slowing down high-speed clips for improved inference quality.\n",
+ "\n",
+ "Individual frames are 258 tokens, and audio is 32 tokens per second. With metadata, each second of video becomes ~300 tokens, which means a 1M context window can fit about 55.5 minutes of video.\n",
+ "\n",
+ "To ask questions about time-stamped locations, use the format `MM:SS`, where the first two digits represent minutes and the last two digits represent seconds.\n",
+ "\n",
+ "For best results:\n",
+ "\n",
+ "* Use one video per prompt.\n",
+ "* If using a single video, place the text prompt after the video."
+ ],
+ "metadata": {
+ "id": "nDN32NDPxXGX"
+ }
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "MNvhBdoDFnTC"
+ },
+ "source": [
+ "### Upload a video file to the File API\n",
+ "\n",
+ "**NOTE**: The File API lets you store up to 20GB of files per project, with a per-file maximum size of 2GB. Files are stored for 48 hours. They can be accessed in that period with your API key, but they cannot be downloaded using any API. It is available at no cost in all regions where the Gemini API is available.\n",
+ "\n",
+ "The File API accepts video file formats directly. This example uses the short NASA film [\"Jupiter's Great Red Spot Shrinks and Grows\"](https://www.youtube.com/watch?v=JDi4IdtvDVE0). Credit: Goddard Space Flight Center (GSFC)/David Ladd (2018).\n",
+ "\n",
+ "> \"Jupiter's Great Red Spot Shrinks and Grows\" is in the public domain and does not show identifiable people. ([NASA image and media usage guidelines.](https://www.nasa.gov/nasa-brand-center/images-and-media/))\n",
+ "\n",
+ "Start by retrieving the short video:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "V4XeFdX1rxaE"
+ },
+ "outputs": [],
+ "source": [
+ "!wget https://storage.googleapis.com/generativeai-downloads/images/GreatRedSpot.mp4"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "Upload the video to the File API and print the URI."
+ ],
+ "metadata": {
+ "id": "ZusSiIg2T6ls"
+ }
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "_HzrDdp2Q1Cu"
+ },
+ "outputs": [],
+ "source": [
+ "video_file_name = \"GreatRedSpot.mp4\"\n",
+ "\n",
+ "print(f\"Uploading file...\")\n",
+ "video_file = genai.upload_file(path=video_file_name)\n",
+ "print(f\"Completed upload: {video_file.uri}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oOZmTUb4FWOa"
+ },
+ "source": [
+ "### Verify file upload and check state\n",
+ "\n",
+ "Verify the API has successfully received the files by calling the `files.get` method.\n",
+ "\n",
+ "**NOTE**: Video files have a `State` field in the File API. When a video is uploaded, it will be in the `PROCESSING` state until it is ready for inference. Only `ACTIVE` files can be used for model inference."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "SHMVCWHkFhJW"
+ },
+ "outputs": [],
+ "source": [
+ "import time\n",
+ "\n",
+ "# Check whether the file is ready to be used.\n",
+ "while video_file.state.name == \"PROCESSING\":\n",
+ " print('.', end='')\n",
+ " time.sleep(10)\n",
+ " video_file = genai.get_file(video_file.name)\n",
+ "\n",
+ "if video_file.state.name == \"FAILED\":\n",
+ " raise ValueError(video_file.state.name)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Prompt with a video and text\n",
+ "\n",
+ "Once the uploaded video is in the `ACTIVE` state, you can make `GenerateContent` requests that specify the File API URI for that video. Select the generative model and provide it with the uploaded video and a text prompt."
+ ],
+ "metadata": {
+ "id": "IYIIHsvQt0_W"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Create the prompt.\n",
+ "prompt = \"Summarize this video. Then create a quiz with answer key based on the information in the video.\"\n",
+ "\n",
+ "# Choose a Gemini model.\n",
+ "model = genai.GenerativeModel(model_name=\"gemini-1.5-pro-latest\")\n",
+ "\n",
+ "# Make the LLM request.\n",
+ "print(\"Making LLM inference request...\")\n",
+ "response = model.generate_content([video_file, prompt],\n",
+ " request_options={\"timeout\": 600})\n",
+ "\n",
+ "# Print the response, rendering any Markdown\n",
+ "Markdown(response.text)"
+ ],
+ "metadata": {
+ "id": "sHH0ZR6Yt42S"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "zS5NmQeXLqeS"
+ },
+ "source": [
+ "### Refer to timestamps in the content\n",
+ "\n",
+ "You can use timestamps of the form `MM:SS` to refer to specific moments in the video."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "ypZuGQ-2LqeS"
+ },
+ "outputs": [],
+ "source": [
+ "# Create the prompt.\n",
+ "prompt = \"What are the examples given at 01:05 and 01:19 supposed to show us?\"\n",
+ "\n",
+ "# Choose a Gemini model.\n",
+ "model = genai.GenerativeModel(model_name=\"gemini-1.5-pro-latest\")\n",
+ "\n",
+ "# Make the LLM request.\n",
+ "print(\"Making LLM inference request...\")\n",
+ "response = model.generate_content([prompt, video_file],\n",
+ " request_options={\"timeout\": 600})\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "### Transcribe video and provide visual descriptions\n",
+ "\n",
+ "If the video is not fast-paced (given that frames are sampled at 1 per second), it's possible to transcribe the video with visual descriptions for each shot."
+ ],
+ "metadata": {
+ "id": "JQE0XjgMZSJo"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# Create the prompt.\n",
+ "prompt = \"Transcribe the audio, giving timestamps. Also provide visual descriptions.\"\n",
+ "\n",
+ "# Choose a Gemini model.\n",
+ "model = genai.GenerativeModel(model_name=\"gemini-1.5-pro-latest\")\n",
+ "\n",
+ "# Make the LLM request.\n",
+ "print(\"Making LLM inference request...\")\n",
+ "response = model.generate_content([prompt, video_file],\n",
+ " request_options={\"timeout\": 600})\n",
+ "print(response.text)"
+ ],
+ "metadata": {
+ "id": "_JrcMsYnYXpJ"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "source": [
+ "## List files\n",
+ "\n",
+ "You can list all uploaded files and their URIs using `files.list_files()`."
+ ],
+ "metadata": {
+ "id": "VosrkvAyrx-v"
+ }
+ },
+ {
+ "cell_type": "code",
+ "source": [
+ "# List all files\n",
+ "for file in genai.list_files():\n",
+ " print(f\"{file.display_name}, URI: {file.uri}\")"
+ ],
+ "metadata": {
+ "id": "O82e6E2Irzlj"
+ },
+ "execution_count": null,
+ "outputs": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "diCy9BgjLqeS"
+ },
+ "source": [
+ "## Delete files\n",
+ "\n",
+ "Files are automatically deleted after 2 days. You can also manually delete them using `files.delete()`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "YYyi5PrKLqeb"
+ },
+ "outputs": [],
+ "source": [
+ "genai.delete_file(video_file.name)\n",
+ "print(f'Deleted file {video_file.uri}')"
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "name": "python3"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
From e4118cc2fd2160da21db9a90ef6b1c44cc857e1e Mon Sep 17 00:00:00 2001
From: data-wombat <88857060+data-wombat@users.noreply.github.com>
Date: Thu, 13 Jun 2024 21:37:15 +0000
Subject: [PATCH 2/8] Deleting deprecated model and making other model mentions
less specific.
---
site/en/gemini-api/docs/vision.ipynb | 150 +++++++++++++--------------
1 file changed, 75 insertions(+), 75 deletions(-)
diff --git a/site/en/gemini-api/docs/vision.ipynb b/site/en/gemini-api/docs/vision.ipynb
index 3bfdd6a6c..e0e99f0cb 100644
--- a/site/en/gemini-api/docs/vision.ipynb
+++ b/site/en/gemini-api/docs/vision.ipynb
@@ -171,9 +171,12 @@
},
{
"cell_type": "markdown",
+ "metadata": {
+ "id": "AKNehP2tr3Cr"
+ },
"source": [
"### Technical details (images)\n",
- "Gemini 1.5 Pro (release 008 and later) supports a maximum of 3,600 image files. Gemini Pro Vision supports a maximum of 16 image files.\n",
+ "Gemini 1.5 Pro and Flash support a maximum of 3,600 image files.\n",
"\n",
"Images must be in one of the following image data [MIME types](https://developers.google.com/drive/api/guides/ref-export-formats):\n",
"\n",
@@ -192,10 +195,7 @@
"* Rotate images to the correct orientation before uploading.\n",
"* Avoid blurry images.\n",
"* If using a single image, place the text prompt after the image."
- ],
- "metadata": {
- "id": "AKNehP2tr3Cr"
- }
+ ]
},
{
"cell_type": "markdown",
@@ -225,12 +225,12 @@
},
{
"cell_type": "markdown",
- "source": [
- "Upload the image using [`media.upload`](https://ai.google.dev/api/rest/v1beta/media/upload) and print the URI, which is used as a reference in Gemini API calls."
- ],
"metadata": {
"id": "qfa2VSqEsulq"
- }
+ },
+ "source": [
+ "Upload the image using [`media.upload`](https://ai.google.dev/api/rest/v1beta/media/upload) and print the URI, which is used as a reference in Gemini API calls."
+ ]
},
{
"cell_type": "code",
@@ -283,12 +283,12 @@
},
{
"cell_type": "markdown",
- "source": [
- "Depending on your use case, you can store the URIs in structures, such as a `dict` or a database."
- ],
"metadata": {
"id": "BqzIGKBmnFoJ"
- }
+ },
+ "source": [
+ "Depending on your use case, you can store the URIs in structures, such as a `dict` or a database."
+ ]
},
{
"cell_type": "markdown",
@@ -320,6 +320,9 @@
},
{
"cell_type": "markdown",
+ "metadata": {
+ "id": "Lm862F3zthiI"
+ },
"source": [
"### Upload one or more locally stored image files\n",
"\n",
@@ -330,47 +333,49 @@
"
\n",
"\n",
"When the combination of files and system instructions that you intend to send is larger than 20MB in size, use the File API to upload those files, as previously shown. Smaller files can instead be called locally from the Gemini API: Smaller files can be called locally from the Gemini API:\n"
- ],
- "metadata": {
- "id": "Lm862F3zthiI"
- }
+ ]
},
{
"cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "XzMhQ8MWub5_"
+ },
+ "outputs": [],
"source": [
"import PIL.Image\n",
"\n",
"sample_file_2 = PIL.Image.open('piranha.jpg')\n",
"sample_file_3 = PIL.Image.open('firefighter.jpg')"
- ],
- "metadata": {
- "id": "XzMhQ8MWub5_"
- },
- "execution_count": null,
- "outputs": []
+ ]
},
{
"cell_type": "markdown",
- "source": [
- "Note that these inline data calls don't include many of the features available via the File API, such as getting file metadata, [listing](https://colab.research.google.com/drive/19xeyIMZJIk7Zn9KW5_50iZYv8OfjApL5?resourcekey=0-3JZ6U8oAFX7hqeV7gAXshw#scrollTo=VosrkvAyrx-v&line=3&uniqifier=1), or [deleting](https://colab.research.google.com/drive/19xeyIMZJIk7Zn9KW5_50iZYv8OfjApL5?resourcekey=0-3JZ6U8oAFX7hqeV7gAXshw#scrollTo=diCy9BgjLqeS&line=1&uniqifier=1) files."
- ],
"metadata": {
"id": "F2N5bLR7wlqL"
- }
+ },
+ "source": [
+ "Note that these inline data calls don't include many of the features available via the File API, such as getting file metadata, [listing](https://colab.research.google.com/drive/19xeyIMZJIk7Zn9KW5_50iZYv8OfjApL5?resourcekey=0-3JZ6U8oAFX7hqeV7gAXshw#scrollTo=VosrkvAyrx-v&line=3&uniqifier=1), or [deleting](https://colab.research.google.com/drive/19xeyIMZJIk7Zn9KW5_50iZYv8OfjApL5?resourcekey=0-3JZ6U8oAFX7hqeV7gAXshw#scrollTo=diCy9BgjLqeS&line=1&uniqifier=1) files."
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {
+ "id": "X3pl7mWgwt6Q"
+ },
"source": [
"### Prompt with multiple images\n",
"\n",
"You can provide the Gemini API with any combination of images and text that fit within the model's context window. This example provides one short text prompt and the three images previously uploaded."
- ],
- "metadata": {
- "id": "X3pl7mWgwt6Q"
- }
+ ]
},
{
"cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Ou5IVsybcOys"
+ },
+ "outputs": [],
"source": [
"# Choose a Gemini model.\n",
"model = genai.GenerativeModel(model_name=\"gemini-1.5-pro-latest\")\n",
@@ -380,12 +385,7 @@
"response = model.generate_content([prompt, sample_file, firefighter, piranha])\n",
"\n",
"Markdown(\">\" + response.text)"
- ],
- "metadata": {
- "id": "Ou5IVsybcOys"
- },
- "execution_count": null,
- "outputs": []
+ ]
},
{
"cell_type": "markdown",
@@ -400,6 +400,9 @@
},
{
"cell_type": "markdown",
+ "metadata": {
+ "id": "nDN32NDPxXGX"
+ },
"source": [
"## Technical details (video)\n",
"\n",
@@ -428,10 +431,7 @@
"\n",
"* Use one video per prompt.\n",
"* If using a single video, place the text prompt after the video."
- ],
- "metadata": {
- "id": "nDN32NDPxXGX"
- }
+ ]
},
{
"cell_type": "markdown",
@@ -463,12 +463,12 @@
},
{
"cell_type": "markdown",
- "source": [
- "Upload the video to the File API and print the URI."
- ],
"metadata": {
"id": "ZusSiIg2T6ls"
- }
+ },
+ "source": [
+ "Upload the video to the File API and print the URI."
+ ]
},
{
"cell_type": "code",
@@ -520,17 +520,22 @@
},
{
"cell_type": "markdown",
+ "metadata": {
+ "id": "IYIIHsvQt0_W"
+ },
"source": [
"### Prompt with a video and text\n",
"\n",
"Once the uploaded video is in the `ACTIVE` state, you can make `GenerateContent` requests that specify the File API URI for that video. Select the generative model and provide it with the uploaded video and a text prompt."
- ],
- "metadata": {
- "id": "IYIIHsvQt0_W"
- }
+ ]
},
{
"cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "sHH0ZR6Yt42S"
+ },
+ "outputs": [],
"source": [
"# Create the prompt.\n",
"prompt = \"Summarize this video. Then create a quiz with answer key based on the information in the video.\"\n",
@@ -545,12 +550,7 @@
"\n",
"# Print the response, rendering any Markdown\n",
"Markdown(response.text)"
- ],
- "metadata": {
- "id": "sHH0ZR6Yt42S"
- },
- "execution_count": null,
- "outputs": []
+ ]
},
{
"cell_type": "markdown",
@@ -586,17 +586,22 @@
},
{
"cell_type": "markdown",
+ "metadata": {
+ "id": "JQE0XjgMZSJo"
+ },
"source": [
"### Transcribe video and provide visual descriptions\n",
"\n",
"If the video is not fast-paced (given that frames are sampled at 1 per second), it's possible to transcribe the video with visual descriptions for each shot."
- ],
- "metadata": {
- "id": "JQE0XjgMZSJo"
- }
+ ]
},
{
"cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "_JrcMsYnYXpJ"
+ },
+ "outputs": [],
"source": [
"# Create the prompt.\n",
"prompt = \"Transcribe the audio, giving timestamps. Also provide visual descriptions.\"\n",
@@ -609,36 +614,31 @@
"response = model.generate_content([prompt, video_file],\n",
" request_options={\"timeout\": 600})\n",
"print(response.text)"
- ],
- "metadata": {
- "id": "_JrcMsYnYXpJ"
- },
- "execution_count": null,
- "outputs": []
+ ]
},
{
"cell_type": "markdown",
+ "metadata": {
+ "id": "VosrkvAyrx-v"
+ },
"source": [
"## List files\n",
"\n",
"You can list all uploaded files and their URIs using `files.list_files()`."
- ],
- "metadata": {
- "id": "VosrkvAyrx-v"
- }
+ ]
},
{
"cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "O82e6E2Irzlj"
+ },
+ "outputs": [],
"source": [
"# List all files\n",
"for file in genai.list_files():\n",
" print(f\"{file.display_name}, URI: {file.uri}\")"
- ],
- "metadata": {
- "id": "O82e6E2Irzlj"
- },
- "execution_count": null,
- "outputs": []
+ ]
},
{
"cell_type": "markdown",
From 3b454042a14ba1a78caf40d670aee9f5efaf0cc7 Mon Sep 17 00:00:00 2001
From: data-wombat <88857060+data-wombat@users.noreply.github.com>
Date: Thu, 13 Jun 2024 21:41:23 +0000
Subject: [PATCH 3/8] Deleting sentence at internal reviewer request.
---
site/en/gemini-api/docs/vision.ipynb | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/site/en/gemini-api/docs/vision.ipynb b/site/en/gemini-api/docs/vision.ipynb
index e0e99f0cb..d1e9c1fbc 100644
--- a/site/en/gemini-api/docs/vision.ipynb
+++ b/site/en/gemini-api/docs/vision.ipynb
@@ -64,8 +64,7 @@
"id": "3c5e92a74e64"
},
"source": [
- "The Gemini API can run inference on images and videos passed to it, in many\n",
- "cases exhibiting the capabilities of [computer vision](https://en.wikipedia.org/wiki/Computer_vision). (Note that generative models differ in technical implementation from historical computer vision methods.) When passed an image, a series of images, or a video, Gemini can:\n",
+ "The Gemini API can run inference on images and videos passed to it. When passed an image, a series of images, or a video, Gemini can:\n",
"\n",
"* Describe or answer questions about the content\n",
"* Summarize the content\n",
From d6d6088c35cae0e23ba0d08b2489effaf545c244 Mon Sep 17 00:00:00 2001
From: data-wombat <88857060+data-wombat@users.noreply.github.com>
Date: Thu, 13 Jun 2024 22:28:12 +0000
Subject: [PATCH 4/8] Added requested bounding box section
---
site/en/gemini-api/docs/vision.ipynb | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)
diff --git a/site/en/gemini-api/docs/vision.ipynb b/site/en/gemini-api/docs/vision.ipynb
index d1e9c1fbc..aa25c4231 100644
--- a/site/en/gemini-api/docs/vision.ipynb
+++ b/site/en/gemini-api/docs/vision.ipynb
@@ -381,11 +381,35 @@
"\n",
"prompt = \"Write an advertising jingle showing how the product in the first image could solve the problems shown in the second two images.\"\n",
"\n",
- "response = model.generate_content([prompt, sample_file, firefighter, piranha])\n",
+ "response = model.generate_content([prompt, sample_file, sample_file_2, sample_file_3])\n",
"\n",
"Markdown(\">\" + response.text)"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Get bounding boxes\n",
+ "\n",
+ "You can ask the model for the coordinates of bounding boxes for objects in images."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Choose a Gemini model.\n",
+ "model = genai.GenerativeModel(model_name=\"gemini-1.5-pro-latest\")\n",
+ "\n",
+ "prompt = \"Return a bounding box for the piranha. \\n [ymin, xmin, ymax, xmax]\"\n",
+ "response = model.generate_content([sample_file_2, prompt])\n",
+ "\n",
+ "print(response.text)"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {
From 3a2167f368b72a88152b15d3849bb06d0e7007c7 Mon Sep 17 00:00:00 2001
From: Mark McDonald
Date: Fri, 14 Jun 2024 08:34:27 +0800
Subject: [PATCH 5/8] Some tiny nits
---
site/en/gemini-api/docs/vision.ipynb | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/site/en/gemini-api/docs/vision.ipynb b/site/en/gemini-api/docs/vision.ipynb
index aa25c4231..a834a4ca2 100644
--- a/site/en/gemini-api/docs/vision.ipynb
+++ b/site/en/gemini-api/docs/vision.ipynb
@@ -134,7 +134,7 @@
"id": "l8g4hTRotheH"
},
"source": [
- "### Setup your API key\n",
+ "### Set up your API key\n",
"\n",
"The File API uses API keys for authentication and access. Uploaded files are associated with the project linked to the API key. Unlike other Gemini APIs that use API keys, your API key also grants access to data you've uploaded to the File API, so take extra care in keeping your API key secure. For more on keeping your keys\n",
"secure, see [Best practices for using API\n",
@@ -308,7 +308,7 @@
},
"outputs": [],
"source": [
- "# Choose a Gemini model.\n",
+ "# Choose a Gemini API model.\n",
"model = genai.GenerativeModel(model_name=\"gemini-1.5-pro-latest\")\n",
"\n",
"# Prompt the model with text and the previously uploaded image.\n",
From f0a659d9f1906137f8a809d5cf23a2bf63995b52 Mon Sep 17 00:00:00 2001
From: Mark McDonald
Date: Fri, 14 Jun 2024 08:36:38 +0800
Subject: [PATCH 6/8] Fix notebook buttons
---
site/en/gemini-api/docs/vision.ipynb | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/site/en/gemini-api/docs/vision.ipynb b/site/en/gemini-api/docs/vision.ipynb
index a834a4ca2..9ff651580 100644
--- a/site/en/gemini-api/docs/vision.ipynb
+++ b/site/en/gemini-api/docs/vision.ipynb
@@ -48,12 +48,12 @@
"source": [
""
]
From c3dcea24024f8e6a305597da21416eb8eff604a3 Mon Sep 17 00:00:00 2001
From: Mark McDonald
Date: Fri, 14 Jun 2024 08:38:06 +0800
Subject: [PATCH 7/8] run nbfmt
---
site/en/gemini-api/docs/vision.ipynb | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/site/en/gemini-api/docs/vision.ipynb b/site/en/gemini-api/docs/vision.ipynb
index 9ff651580..dbb8ce002 100644
--- a/site/en/gemini-api/docs/vision.ipynb
+++ b/site/en/gemini-api/docs/vision.ipynb
@@ -388,7 +388,9 @@
},
{
"cell_type": "markdown",
- "metadata": {},
+ "metadata": {
+ "id": "7e16d742407a"
+ },
"source": [
"### Get bounding boxes\n",
"\n",
@@ -398,7 +400,9 @@
{
"cell_type": "code",
"execution_count": null,
- "metadata": {},
+ "metadata": {
+ "id": "778dd36334f4"
+ },
"outputs": [],
"source": [
"# Choose a Gemini model.\n",
@@ -689,7 +693,8 @@
],
"metadata": {
"colab": {
- "provenance": []
+ "name": "vision.ipynb",
+ "toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
From 101aceca7871ebf8c108fcab6dfa9e9e520ddd05 Mon Sep 17 00:00:00 2001
From: data-wombat <88857060+data-wombat@users.noreply.github.com>
Date: Fri, 14 Jun 2024 16:27:06 +0000
Subject: [PATCH 8/8] Applied all changes and commented fixes.
---
site/en/gemini-api/docs/vision.ipynb | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/site/en/gemini-api/docs/vision.ipynb b/site/en/gemini-api/docs/vision.ipynb
index dbb8ce002..19cc57f05 100644
--- a/site/en/gemini-api/docs/vision.ipynb
+++ b/site/en/gemini-api/docs/vision.ipynb
@@ -204,7 +204,7 @@
"source": [
"### Upload an image file using the File API\n",
"\n",
- "Use the File API to upload an image of any size. (Images greater than 20MB cannot be handled inline and must be uploaded using the File API.)\n",
+ "Use the File API to upload an image of any size. (Always use the File API when the combination of files and system instructions that you intend to send is larger than 20MB.)\n",
"\n",
"**NOTE**: The File API lets you store up to 20GB of files per project, with a per-file maximum size of 2GB. Files are stored for 48 hours. They can be accessed in that period with your API key, but cannot be downloaded from the API. It is available at no cost in all regions where the Gemini API is available.\n",
"\n",
@@ -331,7 +331,7 @@
"\n",
"
\n",
"\n",
- "When the combination of files and system instructions that you intend to send is larger than 20MB in size, use the File API to upload those files, as previously shown. Smaller files can instead be called locally from the Gemini API: Smaller files can be called locally from the Gemini API:\n"
+ "When the combination of files and system instructions that you intend to send is larger than 20MB in size, use the File API to upload those files, as previously shown. Smaller files can instead be called locally from the Gemini API:\n"
]
},
{
@@ -450,7 +450,7 @@
"\n",
"**NOTE:** The finer details of fast action sequences may be lost at the 1FPS frame sampling rate. Consider slowing down high-speed clips for improved inference quality.\n",
"\n",
- "Individual frames are 258 tokens, and audio is 32 tokens per second. With metadata, each second of video becomes ~300 tokens, which means a 1M context window can fit about 55.5 minutes of video.\n",
+ "Individual frames are 258 tokens, and audio is 32 tokens per second. With metadata, each second of video becomes ~300 tokens, which means a 1M context window can fit slightly less than an hour of video.\n",
"\n",
"To ask questions about time-stamped locations, use the format `MM:SS`, where the first two digits represent minutes and the last two digits represent seconds.\n",
"\n",