-
Notifications
You must be signed in to change notification settings - Fork 68
[PLT-43] Vb/create datarows chunking plt 43 #1627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
58ed7f4
to
5764aac
Compare
5764aac
to
51f33ae
Compare
51f33ae
to
1b0e388
Compare
90449d6
to
aaa0648
Compare
aaa0648
to
06aeeb1
Compare
from labelbox.pydantic_compat import BaseModel | ||
|
||
|
||
class DataRowItemBase(BaseModel, ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class does not look to be abstract - all methods are implemented.
I'm not sure you ultimately need/want inheritance, perhaps we can take advantage of another OO design pattern.
from labelbox.schema.internal.datarow_upload_constants import MAX_DATAROW_PER_API_OPERATION | ||
|
||
|
||
class UploadManifest: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why not use a Pydantic model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could ok
return self._results_as_list() | ||
|
||
@property | ||
def errors(self) -> Optional[List[Dict[str, Any]]]: # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why we are not simply supporting a single results
and errors
method which returns a generator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it will not be backward-compatible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely going to be confusing for users, i.e. results
vs. results_all
.
Returning a generator will likely not break things in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will, I think, if someone is trying to use returns as a List
libs/labelbox/src/labelbox/schema/internal/data_row_uploader.py
Outdated
Show resolved
Hide resolved
libs/labelbox/src/labelbox/schema/internal/data_row_uploader.py
Outdated
Show resolved
Hide resolved
libs/labelbox/src/labelbox/schema/internal/data_row_uploader.py
Outdated
Show resolved
Hide resolved
libs/labelbox/src/labelbox/schema/internal/data_row_uploader.py
Outdated
Show resolved
Hide resolved
Extract spec generation Extract data row upload logic Extract chunk generation and upload Update create data row Rename DatarowUploader --> DataRowUploader Reuse upsert backend for create_data_rows Add DataUpsertTask
97f6774
to
b7680e8
Compare
b7680e8
to
6ba61bb
Compare
cursor_path=['failedDataRowImports', 'after'], | ||
) | ||
|
||
def _results_as_list(self) -> Optional[List[Dict[str, Any]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I believe PaginatedCollection
is an iterator, in which case you can reduce this function to a single line of code (or remove it entirely).
def _results_as_list(self) -> Optional[List[Dict[str, Any]]]:
return list(self._download_results_paginated()) or None
Stories:
This pull request introduces two key improvements for the create_data_rows function:
Increased Upload Limit: We're removing the previous 150,000 data row limit, allowing for significantly larger datasets to be uploaded.
Enhanced Efficiency: The create_data_rows function now leverages our existing upsert code, which utilizes data chunking for improved performance. Additionally, it incorporates paginated results and error handling through lazy evaluation, already available in the SDK.
Challenges Addressed:
Legacy Upsert Code: The upsert functionality relied heavily on outdated code for data generation.
Solution: The code has been refactored to separate data similarities and differences. A dedicated subclass was created specifically for the create_data_rows use case.
Edge Cases in Data Row Creation: The original create_data_rows function contained various edge cases that could lead to errors.
Solution: These edge cases were identified through our comprehensive integration test suite. To address them, we've added logic to support uploading data from local files within create_data_rows. Additionally, we've verified that the remaining edge cases are now handled correctly.
Technical Considerations:
To implement paginated result and error streaming while maintaining backward compatibility for create_data_rows, we created a subclass named DataUpsertTask that inherits from the Task class. Due to limitations in extensibility of the Task class, some workarounds were necessary. The functionality has been thoroughly tested through extensive SDK integration tests.: The create_data_rows function now leverages our existing upsert code, which utilizes data chunking for improved performance. Additionally, it incorporates paginated results and error handling through lazy evaluation, already available in the SDK.