[PLT-43] Vb/create datarows chunking plt 43 #1627

vbrodsky · 2024-05-23T21:39:06Z

Stories:

This pull request introduces two key improvements for the create_data_rows function:

Increased Upload Limit: We're removing the previous 150,000 data row limit, allowing for significantly larger datasets to be uploaded.
Enhanced Efficiency: The create_data_rows function now leverages our existing upsert code, which utilizes data chunking for improved performance. Additionally, it incorporates paginated results and error handling through lazy evaluation, already available in the SDK.

Challenges Addressed:

Legacy Upsert Code: The upsert functionality relied heavily on outdated code for data generation.
Solution: The code has been refactored to separate data similarities and differences. A dedicated subclass was created specifically for the create_data_rows use case.
Edge Cases in Data Row Creation: The original create_data_rows function contained various edge cases that could lead to errors.
Solution: These edge cases were identified through our comprehensive integration test suite. To address them, we've added logic to support uploading data from local files within create_data_rows. Additionally, we've verified that the remaining edge cases are now handled correctly.

Technical Considerations:

To implement paginated result and error streaming while maintaining backward compatibility for create_data_rows, we created a subclass named DataUpsertTask that inherits from the Task class. Due to limitations in extensibility of the Task class, some workarounds were necessary. The functionality has been thoroughly tested through extensive SDK integration tests.: The create_data_rows function now leverages our existing upsert code, which utilizes data chunking for improved performance. Additionally, it incorporates paginated results and error handling through lazy evaluation, already available in the SDK.

libs/labelbox/src/labelbox/schema/dataset.py

libs/labelbox/src/labelbox/schema/task.py

mrobers1982 · 2024-05-29T22:09:52Z

libs/labelbox/src/labelbox/schema/internal/data_row_create_upsert.py

+from labelbox.pydantic_compat import BaseModel
+
+
+class DataRowItemBase(BaseModel, ABC):


This class does not look to be abstract - all methods are implemented.

I'm not sure you ultimately need/want inheritance, perhaps we can take advantage of another OO design pattern.

mrobers1982 · 2024-05-29T22:13:22Z

libs/labelbox/src/labelbox/schema/internal/data_row_uploader.py

+from labelbox.schema.internal.datarow_upload_constants import MAX_DATAROW_PER_API_OPERATION
+
+
+class UploadManifest:


nit: Why not use a Pydantic model?

libs/labelbox/src/labelbox/schema/task.py

mrobers1982 · 2024-05-29T22:22:32Z

libs/labelbox/src/labelbox/schema/task.py

+        return self._results_as_list()
+
+    @property
+    def errors(self) -> Optional[List[Dict[str, Any]]]:  # type: ignore


I'm curious why we are not simply supporting a single results and errors method which returns a generator?

because it will not be backward-compatible

This is likely going to be confusing for users, i.e. results vs. results_all.

Returning a generator will likely not break things in practice.

It will, I think, if someone is trying to use returns as a List

libs/labelbox/src/labelbox/schema/internal/data_row_uploader.py

libs/labelbox/src/labelbox/schema/dataset.py

libs/labelbox/tests/integration/test_dataset.py

libs/labelbox/src/labelbox/schema/internal/data_row_uploader.py

Extract spec generation Extract data row upload logic Extract chunk generation and upload Update create data row Rename DatarowUploader --> DataRowUploader Reuse upsert backend for create_data_rows Add DataUpsertTask

mrobers1982 · 2024-05-30T19:50:43Z

libs/labelbox/src/labelbox/schema/task.py

+            cursor_path=['failedDataRowImports', 'after'],
+        )
+
+    def _results_as_list(self) -> Optional[List[Dict[str, Any]]]:


nit: I believe PaginatedCollection is an iterator, in which case you can reduce this function to a single line of code (or remove it entirely).

def _results_as_list(self) -> Optional[List[Dict[str, Any]]]: return list(self._download_results_paginated()) or None

vbrodsky requested a review from a team as a code owner May 23, 2024 21:39

vbrodsky marked this pull request as draft May 23, 2024 21:39

vbrodsky force-pushed the VB/create_datarows_chunking_PLT-43 branch 14 times, most recently from 58ed7f4 to 5764aac Compare May 29, 2024 18:31

vbrodsky commented May 29, 2024

View reviewed changes

libs/labelbox/src/labelbox/schema/dataset.py Show resolved Hide resolved

vbrodsky commented May 29, 2024

View reviewed changes

libs/labelbox/src/labelbox/schema/dataset.py Show resolved Hide resolved

vbrodsky force-pushed the VB/create_datarows_chunking_PLT-43 branch from 5764aac to 51f33ae Compare May 29, 2024 18:35

vbrodsky commented May 29, 2024

View reviewed changes

libs/labelbox/src/labelbox/schema/task.py Show resolved Hide resolved

vbrodsky commented May 29, 2024

View reviewed changes

libs/labelbox/src/labelbox/schema/task.py Show resolved Hide resolved

vbrodsky commented May 29, 2024

View reviewed changes

libs/labelbox/src/labelbox/schema/task.py Show resolved Hide resolved

vbrodsky force-pushed the VB/create_datarows_chunking_PLT-43 branch from 51f33ae to 1b0e388 Compare May 29, 2024 18:42

vbrodsky marked this pull request as ready for review May 29, 2024 18:58

vbrodsky requested review from sfendell-labelbox and mrobers1982 May 29, 2024 18:58

vbrodsky force-pushed the VB/create_datarows_chunking_PLT-43 branch from 90449d6 to aaa0648 Compare May 29, 2024 20:09

vbrodsky changed the title ~~Vb/create datarows chunking plt 43~~ [PLT-43] Vb/create datarows chunking plt 43 May 29, 2024

vbrodsky force-pushed the VB/create_datarows_chunking_PLT-43 branch from aaa0648 to 06aeeb1 Compare May 29, 2024 20:20

vbrodsky temporarily deployed to Test-PyPI May 29, 2024 22:04 — with GitHub Actions Inactive

mrobers1982 reviewed May 29, 2024

View reviewed changes