Skip to content

[PLT-43] Vb/create datarows chunking plt 43 #1627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 30, 2024

Conversation

vbrodsky
Copy link
Contributor

@vbrodsky vbrodsky commented May 23, 2024

Stories:

This pull request introduces two key improvements for the create_data_rows function:

Increased Upload Limit: We're removing the previous 150,000 data row limit, allowing for significantly larger datasets to be uploaded.
Enhanced Efficiency: The create_data_rows function now leverages our existing upsert code, which utilizes data chunking for improved performance. Additionally, it incorporates paginated results and error handling through lazy evaluation, already available in the SDK.

Challenges Addressed:

Legacy Upsert Code: The upsert functionality relied heavily on outdated code for data generation.
Solution: The code has been refactored to separate data similarities and differences. A dedicated subclass was created specifically for the create_data_rows use case.
Edge Cases in Data Row Creation: The original create_data_rows function contained various edge cases that could lead to errors.
Solution: These edge cases were identified through our comprehensive integration test suite. To address them, we've added logic to support uploading data from local files within create_data_rows. Additionally, we've verified that the remaining edge cases are now handled correctly.

Technical Considerations:

To implement paginated result and error streaming while maintaining backward compatibility for create_data_rows, we created a subclass named DataUpsertTask that inherits from the Task class. Due to limitations in extensibility of the Task class, some workarounds were necessary. The functionality has been thoroughly tested through extensive SDK integration tests.: The create_data_rows function now leverages our existing upsert code, which utilizes data chunking for improved performance. Additionally, it incorporates paginated results and error handling through lazy evaluation, already available in the SDK.

@vbrodsky vbrodsky requested a review from a team as a code owner May 23, 2024 21:39
@vbrodsky vbrodsky marked this pull request as draft May 23, 2024 21:39
@vbrodsky vbrodsky force-pushed the VB/create_datarows_chunking_PLT-43 branch 14 times, most recently from 58ed7f4 to 5764aac Compare May 29, 2024 18:31
@vbrodsky vbrodsky force-pushed the VB/create_datarows_chunking_PLT-43 branch from 5764aac to 51f33ae Compare May 29, 2024 18:35
@vbrodsky vbrodsky force-pushed the VB/create_datarows_chunking_PLT-43 branch from 51f33ae to 1b0e388 Compare May 29, 2024 18:42
@vbrodsky vbrodsky marked this pull request as ready for review May 29, 2024 18:58
@vbrodsky vbrodsky force-pushed the VB/create_datarows_chunking_PLT-43 branch from 90449d6 to aaa0648 Compare May 29, 2024 20:09
@vbrodsky vbrodsky changed the title Vb/create datarows chunking plt 43 [PLT-43] Vb/create datarows chunking plt 43 May 29, 2024
@vbrodsky vbrodsky force-pushed the VB/create_datarows_chunking_PLT-43 branch from aaa0648 to 06aeeb1 Compare May 29, 2024 20:20
from labelbox.pydantic_compat import BaseModel


class DataRowItemBase(BaseModel, ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class does not look to be abstract - all methods are implemented.

I'm not sure you ultimately need/want inheritance, perhaps we can take advantage of another OO design pattern.

from labelbox.schema.internal.datarow_upload_constants import MAX_DATAROW_PER_API_OPERATION


class UploadManifest:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why not use a Pydantic model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could ok

return self._results_as_list()

@property
def errors(self) -> Optional[List[Dict[str, Any]]]: # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we are not simply supporting a single results and errors method which returns a generator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it will not be backward-compatible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely going to be confusing for users, i.e. results vs. results_all.

Returning a generator will likely not break things in practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will, I think, if someone is trying to use returns as a List

Val Brodsky added 7 commits May 29, 2024 17:26
Extract spec generation

Extract data row upload logic

Extract chunk generation and upload

Update create data row

Rename DatarowUploader --> DataRowUploader

Reuse upsert backend for create_data_rows

Add DataUpsertTask
@vbrodsky vbrodsky force-pushed the VB/create_datarows_chunking_PLT-43 branch 2 times, most recently from 97f6774 to b7680e8 Compare May 30, 2024 17:29
@vbrodsky vbrodsky force-pushed the VB/create_datarows_chunking_PLT-43 branch from b7680e8 to 6ba61bb Compare May 30, 2024 17:37
cursor_path=['failedDataRowImports', 'after'],
)

def _results_as_list(self) -> Optional[List[Dict[str, Any]]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I believe PaginatedCollection is an iterator, in which case you can reduce this function to a single line of code (or remove it entirely).

def _results_as_list(self) -> Optional[List[Dict[str, Any]]]:
    return list(self._download_results_paginated()) or None

@vbrodsky vbrodsky merged commit 5459996 into develop May 30, 2024
20 checks passed
@vbrodsky vbrodsky deleted the VB/create_datarows_chunking_PLT-43 branch May 30, 2024 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants