Add files via upload

raphaeljafriLB · web-flow · commit af9d32ace284 · 2023-01-04T12:38:51.000-05:00
diff --git a/create_data_rows_example.ipynb b/create_data_rows_example.ipynb
@@ -0,0 +1,349 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "f0948f3e-77f5-471f-8062-eb00328636bd",
+   "metadata": {},
+   "source": [
+    "<td>\n",
+    "   <a target=\"_blank\" href=\"https://labelbox.com\" ><img src=\"https://labelbox.com/blog/content/images/2021/02/logo-v4.svg\" width=190/></a>\n",
+    "</td>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e34e7399-7bdc-49b9-9173-c15de66d8ddb",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Labelpandas - The Labelbox <> Pandas Connector\n",
+    "***Instantly Load CSVs (and other Tables) into Labelbox***\n",
+    "\n",
+    "---\n",
+    "\n",
+    "This notebook is used to go over the basic use of the Labelpandas Python SDK. \n",
+    "\n",
+    "**Pandas** is a Python library that helps in loading and manipulating CSVs and tabular data more efficiently. It is one of the most widely used Python libraries in the world.\n",
+    "\n",
+    "**Labelpandas** incorporates both Labelbox and Pandas to make uploading CSVs and tabular data to Labelbox straightforward. It can handle both local file assets as well as cloud-hosted assets. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "5c4bb8a6-44d4-4076-9728-f85bad8aeaba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install labelpandas --upgrade -q"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "7ef12ac7-5c85-497a-987c-1e41af2dd715",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import labelpandas as lbpd"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1dfe1eae-e28f-4928-9058-796d097f38cc",
+   "metadata": {},
+   "source": [
+    "# Set up Labelpandas Client"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "de523a2e-484a-4e16-b976-f2c4edd7efe3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "labelbox_api_key = \"\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "5faabd84-8ae0-4b5b-8e0e-1a807317b0a5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client = lbpd.Client(lb_api_key=labelbox_api_key)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b2255bcf-3df6-43c7-ab78-262df2798e64",
+   "metadata": {},
+   "source": [
+    "# Load CSV\n",
+    "\n",
+    "To upload data rows from a csv, your csv **must** have the following:\n",
+    "\n",
+    "- Column consisting of your **row data** as a string value - this pertains to either your asset URL (pointing to cloud storage) or a local file path\n",
+    "    \n",
+    "- Column consisting of your **global key** as a string value - this is an externally facing ID that must be unique (Labelbox enforces it)\n",
+    "    - If you attempt to upload a data row with an existing global key, it will either auto-generate a suffix such as \"_1\" or it will skip it entirely\n",
+    "    \n",
+    "**To upload data rows with metadta, your csv must have one column per metadata field name**. Labelpandas matches the column names to Labelbox metadata names when uploading metadata."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "83fec8c5-7456-4dd3-a5e1-b0605db3ff35",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from io import StringIO\n",
+    "import uuid\n",
+    "\n",
+    "demo_csv = f\"\"\"global_key,row_data,split\n",
+    "{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000569539.jpg,train\n",
+    "{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000288451.jpg,train\n",
+    "{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000240902.jpg,train\n",
+    "{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_val2014_000000428116.jpg,train\n",
+    "{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_val2014_000000459566.jpg,train\n",
+    "{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000442982.jpg,train\n",
+    "{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000569538.jpg,valid\n",
+    "{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000022415.jpg,valid\n",
+    "{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_val2014_000000146981.jpg,test\n",
+    "{str(uuid.uuid4())},https://storage.googleapis.com/diagnostics-demo-data/coco/COCO_train2014_000000173046.jpg,test\"\"\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "fb03e1e3-4682-4738-8e87-974c4acb9a8c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>global_key</th>\n",
+       "      <th>row_data</th>\n",
+       "      <th>split</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>99aad74a-0ce1-41d4-b172-97abbe4ae8b2</td>\n",
+       "      <td>https://storage.googleapis.com/diagnostics-dem...</td>\n",
+       "      <td>train</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>b47b1358-3d81-4384-920a-d4a08cbe7ffe</td>\n",
+       "      <td>https://storage.googleapis.com/diagnostics-dem...</td>\n",
+       "      <td>train</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>2a75b633-2266-4ec0-8441-8e41374a04e2</td>\n",
+       "      <td>https://storage.googleapis.com/diagnostics-dem...</td>\n",
+       "      <td>train</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>7c75170d-26c8-4a8b-8e74-3d1483d7719b</td>\n",
+       "      <td>https://storage.googleapis.com/diagnostics-dem...</td>\n",
+       "      <td>train</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>53d6ecd1-cc80-42ef-9e11-2be9b7cf879d</td>\n",
+       "      <td>https://storage.googleapis.com/diagnostics-dem...</td>\n",
+       "      <td>train</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                             global_key  \\\n",
+       "0  99aad74a-0ce1-41d4-b172-97abbe4ae8b2   \n",
+       "1  b47b1358-3d81-4384-920a-d4a08cbe7ffe   \n",
+       "2  2a75b633-2266-4ec0-8441-8e41374a04e2   \n",
+       "3  7c75170d-26c8-4a8b-8e74-3d1483d7719b   \n",
+       "4  53d6ecd1-cc80-42ef-9e11-2be9b7cf879d   \n",
+       "\n",
+       "                                            row_data  split  \n",
+       "0  https://storage.googleapis.com/diagnostics-dem...  train  \n",
+       "1  https://storage.googleapis.com/diagnostics-dem...  train  \n",
+       "2  https://storage.googleapis.com/diagnostics-dem...  train  \n",
+       "3  https://storage.googleapis.com/diagnostics-dem...  train  \n",
+       "4  https://storage.googleapis.com/diagnostics-dem...  train  "
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "## You can load in csv's into pandas with df = pd.read_csv(file_path_as_string)\n",
+    "import pandas as pd\n",
+    "\n",
+    "df = pd.read_csv(StringIO(demo_csv))\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8031341e-b42f-4838-9a8a-ad06271fdcdf",
+   "metadata": {},
+   "source": [
+    "# Create a `metadata_index`\n",
+    "\n",
+    "* Your metadata_index is a dictionary where {key=`column_name` : value=`metadata_field_type`}\n",
+    "    * `column_name` must correspond to Labelbox metadata field names. Labelpandas uses these names to sync data.\n",
+    "    * `metadata_field_type` must be one of the following string values: **\"datetime\" \"enum\" \"string\" \"number\"**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "167ceb94-d3aa-40bb-a310-1158fd6d2e71",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "metadata_index={ \n",
+    "    \"split\" : \"enum\"\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e7e2121e-a3db-4734-b034-77f365a1c20a",
+   "metadata": {},
+   "source": [
+    "# Get or Create a Labelbox Dataset\n",
+    "\n",
+    "* Labelpandas will create data rows for you in existing datasets. If you don't have a dataset, create one."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "7cb9f610-d083-416d-920b-df8cf6240ca6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Creating a Labelbox dataset with name Labelpandas Demo Dataset and the default delegated access integration setting\n",
+      "Created a new dataset with ID clchxm7q011xs073n6kxe3otq\n"
+     ]
+    }
+   ],
+   "source": [
+    "dataset_name = \"Labelpandas Demo Dataset\" # Desired or existing dataset name\n",
+    "integration_name = \"DEFAULT\" # Desired delegated access integration name (ignore if using an existing dataset)\n",
+    "\n",
+    "dataset = client.base_client.get_or_create_dataset(name=dataset_name, integration=integration_name, verbose=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c6ad8272-1eb2-4ac3-ae44-072c40675646",
+   "metadata": {},
+   "source": [
+    "# Upload Data Rows from CSV to Labelbox\n",
+    "\n",
+    "**`client.create_data_rows_from_table()`** has the following arguments:\n",
+    "```\n",
+    "df              :   Required (pandas.core.frame.DataFrame) - Pandas DataFrame    \n",
+    "lb_dataset      :   Required (labelbox.schema.dataset.Dataset) - Labelbox dataset to add data rows to            \n",
+    "row_data_col    :   Required (str) - Column containing asset URL or file path\n",
+    "global_key_col  :   Optional (str) - Column name containing the data row global key - defaults to row data\n",
+    "external_id_col :   Optional (str) - Column name containing the data row external ID - defaults to global key\n",
+    "metadata_index  :   Optional (dict) - Dictionary where {key=column_name : value=metadata_type}\n",
+    "local_files     :   Optional (bool) - If True, will create urls for local files; if False, uploads `row_data_col` as urls\n",
+    "skip_duplicates :   Optional (bool) - If True, will skip duplicate global_keys, otherwise will generate a unique global_key with a suffix     \n",
+    "verbose         :   Optional (bool) - If True, prints information about code execution\n",
+    "```\n",
+    "This function will return a list of errors, if any"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "849f87af-a809-44ca-b909-4bcb2ed4b74a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Valid metadata_index\n",
+      "Creating upload list - 10 rows in Pandas DataFrame\n",
+      "Generated upload list - 10 data rows to upload\n",
+      "Beginning data row upload: uploading 10 data rows\n",
+      "Batch #1: 10 data rows\n",
+      "Success: upload batch number 1 complete\n",
+      "Upload complete\n"
+     ]
+    }
+   ],
+   "source": [
+    "upload_results = client.create_data_rows_from_table(\n",
+    "    df=df, \n",
+    "    lb_dataset=dataset, \n",
+    "    row_data_col=\"row_data\", \n",
+    "    global_key_col=\"global_key\", \n",
+    "    external_id_col=None, \n",
+    "    metadata_index=metadata_index,\n",
+    "    local_files=False,\n",
+    "    skip_duplicates=False,\n",
+    "    verbose=True)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}