Skip to content

Commit 64ada34

Browse files
authored
Merge pull request #71 from scaleapi/da/async_annotations
Da/async annotations
2 parents b664115 + 690cb9b commit 64ada34

File tree

6 files changed

+156
-17
lines changed

6 files changed

+156
-17
lines changed

.circleci/config.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ jobs:
3535
name: Pytest Test Cases
3636
command: | # Run test suite, uses NUCLEUS_TEST_API_KEY env variable
3737
mkdir test_results
38-
poetry run coverage run --include=nucleus/* -m pytest --junitxml=test_results/junit.xml
38+
poetry run coverage run --include=nucleus/* -m pytest -s -v --junitxml=test_results/junit.xml
3939
poetry run coverage report
4040
poetry run coverage html
4141

README.md

Lines changed: 45 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,13 @@ Aggregate metrics in ML are not good enough. To improve production ML, you need
66

77
Scale Nucleus helps you:
88

9-
* Visualize your data
10-
* Curate interesting slices within your dataset
11-
* Review and manage annotations
12-
* Measure and debug your model performance
9+
- Visualize your data
10+
- Curate interesting slices within your dataset
11+
- Review and manage annotations
12+
- Measure and debug your model performance
1313

1414
Nucleus is a new way—the right way—to develop ML models, helping us move away from the concept of one dataset and towards a paradigm of collections of scenarios.
1515

16-
17-
1816
## Installation
1917

2018
`$ pip install scale-nucleus`
@@ -26,65 +24,83 @@ The client abstractions serves to authenticate the user and act as the gateway
2624
for users to interact with their datasets, models, and model runs.
2725

2826
### Create a client object
27+
2928
```python
3029
import nucleus
3130
client = nucleus.NucleusClient("YOUR_API_KEY_HERE")
3231
```
3332

3433
### Create Dataset
34+
3535
```python
3636
dataset = client.create_dataset("My Dataset")
3737
```
3838

3939
### List Datasets
40+
4041
```python
4142
datasets = client.list_datasets()
4243
```
4344

4445
### Delete a Dataset
46+
4547
By specifying target dataset id.
4648
A response code of 200 indicates successful deletion.
49+
4750
```python
4851
client.delete_dataset("YOUR_DATASET_ID")
4952
```
5053

5154
### Append Items to a Dataset
55+
5256
You can append both local images and images from the web. Simply specify the location and Nucleus will automatically infer if it's remote or a local file.
57+
5358
```python
5459
dataset_item_1 = DatasetItem(image_location="./1.jpeg", reference_id="1", metadata={"key": "value"})
5560
dataset_item_2 = DatasetItem(image_location="s3://srikanth-nucleus/9-1.jpg", reference_id="2", metadata={"key": "value"})
5661
```
5762

5863
The append function expects a list of `DatasetItem` objects to upload, like this:
64+
5965
```python
6066
response = dataset.append([dataset_item_1, dataset_item_2])
6167
```
6268

6369
### Get Dataset Info
70+
6471
Tells us the dataset name, number of dataset items, model_runs, and slice_ids.
72+
6573
```python
6674
dataset.info
6775
```
6876

6977
### Access Dataset Items
78+
7079
There are three methods to access individual Dataset Items:
7180

7281
(1) Dataset Items are accessible by reference id
82+
7383
```python
7484
item = dataset.refloc("my_img_001.png")
7585
```
86+
7687
(2) Dataset Items are accessible by index
88+
7789
```python
7890
item = dataset.iloc(0)
7991
```
92+
8093
(3) Dataset Items are accessible by the dataset_item_id assigned internally
94+
8195
```python
8296
item = dataset.loc("dataset_item_id")
8397
```
8498

8599
### Add Annotations
100+
86101
Upload groundtruth annotations for the items in your dataset.
87102
Box2DAnnotation has same format as https://dashboard.scale.com/nucleus/docs/api#add-ground-truth
103+
88104
```python
89105
annotation_1 = BoxAnnotation(reference_id="1", label="label", x=0, y=0, width=10, height=10, annotation_id="ann_1", metadata={})
90106
annotation_2 = BoxAnnotation(reference_id="2", label="label", x=0, y=0, width=10, height=10, annotation_id="ann_2", metadata={})
@@ -94,6 +110,7 @@ response = dataset.annotate([annotation_1, annotation_2])
94110
For particularly large payloads, please reference the accompanying scripts in **references**
95111

96112
### Add Model
113+
97114
The model abstraction is intended to represent a unique architecture.
98115
Models are independent of any dataset.
99116

@@ -102,10 +119,12 @@ model = client.add_model(name="My Model", reference_id="newest-cnn-its-new", met
102119
```
103120

104121
### Upload Predictions to ModelRun
122+
105123
This method populates the model_run object with predictions. `ModelRun` objects need to reference a `Dataset` that has been created.
106124
Returns the associated model_id, human-readable name of the run, status, and user specified metadata.
107125
Takes a list of Box2DPredictions within the payload, where Box2DPrediction
108126
is formulated as in https://dashboard.scale.com/nucleus/docs/api#upload-model-outputs
127+
109128
```python
110129
prediction_1 = BoxPrediction(reference_id="1", label="label", x=0, y=0, width=10, height=10, annotation_id="pred_1", confidence=0.9)
111130
prediction_2 = BoxPrediction(reference_id="2", label="label", x=0, y=0, width=10, height=10, annotation_id="pred_2", confidence=0.2)
@@ -114,39 +133,51 @@ model_run = model.create_run(name="My Model Run", metadata={"timestamp": "121012
114133
```
115134

116135
### Commit ModelRun
136+
117137
The commit action indicates that the user is finished uploading predictions associated
118-
with this model run. Committing a model run kicks off Nucleus internal processes
138+
with this model run. Committing a model run kicks off Nucleus internal processes
119139
to calculate performance metrics like IoU. After being committed, a ModelRun object becomes immutable.
140+
120141
```python
121142
model_run.commit()
122143
```
123144

124145
### Get ModelRun Info
146+
125147
Returns the associated model_id, human-readable name of the run, status, and user specified metadata.
148+
126149
```python
127150
model_run.info
128151
```
129152

130153
### Accessing ModelRun Predictions
154+
131155
You can access the modelRun predictions for an individual dataset_item through three methods:
132156

133157
(1) user specified reference_id
158+
134159
```python
135160
model_run.refloc("my_img_001.png")
136161
```
162+
137163
(2) Index
164+
138165
```python
139166
model_run.iloc(0)
140167
```
168+
141169
(3) Internally maintained dataset_item_id
170+
142171
```python
143172
model_run.loc("dataset_item_id")
144173
```
145174

146175
### Delete ModelRun
176+
147177
Delete a model run using the target model_run_id.
148178

149179
A response code of 200 indicates successful deletion.
180+
150181
```python
151182
client.delete_model_run("model_run_id")
152183
```
@@ -163,14 +194,20 @@ poetry install
163194
```
164195

165196
Please install the pre-commit hooks by running the following command:
197+
166198
```python
167199
poetry run pre-commit install
168200
```
169201

170202
**Best practices for testing:**
171203
(1). Please run pytest from the root directory of the repo, i.e.
204+
172205
```
173-
poetry pytest tests/test_dataset.py
206+
poetry run pytest tests/test_dataset.py
174207
```
175208

209+
(2) To skip slow integration tests that have to wait for an async job to start.
176210

211+
```
212+
poetry run pytest -m "not integration"
213+
```

nucleus/dataset.py

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,15 @@
88
serialize_and_write_to_presigned_url,
99
)
1010

11-
from .annotation import Annotation
11+
from .annotation import Annotation, check_all_annotation_paths_remote
1212
from .constants import (
1313
DATASET_ITEM_IDS_KEY,
1414
DATASET_LENGTH_KEY,
1515
DATASET_MODEL_RUNS_KEY,
1616
DATASET_NAME_KEY,
1717
DATASET_SLICES_KEY,
1818
DEFAULT_ANNOTATION_UPDATE_MODE,
19+
JOB_ID_KEY,
1920
NAME_KEY,
2021
REFERENCE_IDS_KEY,
2122
REQUEST_ID_KEY,
@@ -143,7 +144,8 @@ def annotate(
143144
annotations: List[Annotation],
144145
update: Optional[bool] = DEFAULT_ANNOTATION_UPDATE_MODE,
145146
batch_size: int = 5000,
146-
) -> dict:
147+
asynchronous: bool = False,
148+
) -> Union[Dict[str, Any], AsyncJob]:
147149
"""
148150
Uploads ground truth annotations for a given dataset.
149151
:param annotations: ground truth annotations for a given dataset to upload
@@ -156,6 +158,19 @@ def annotate(
156158
"ignored_items": int,
157159
}
158160
"""
161+
if asynchronous:
162+
check_all_annotation_paths_remote(annotations)
163+
164+
request_id = serialize_and_write_to_presigned_url(
165+
annotations, self.id, self._client
166+
)
167+
response = self._client.make_request(
168+
payload={REQUEST_ID_KEY: request_id, UPDATE_KEY: update},
169+
route=f"dataset/{self.id}/annotate?async=1",
170+
)
171+
172+
return AsyncJob(response[JOB_ID_KEY], self._client)
173+
159174
return self._client.annotate_dataset(
160175
self.id, annotations, update=update, batch_size=batch_size
161176
)

pyproject.toml

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ exclude = '''
2121

2222
[tool.poetry]
2323
name = "scale-nucleus"
24-
version = "0.1.5"
24+
version = "0.1.16"
2525
description = "The official Python client library for Nucleus, the Data Platform for AI"
2626
license = "MIT"
2727
authors = ["Scale AI Nucleus Team <nucleusapi@scaleapi.com>"]
@@ -35,7 +35,7 @@ packages = [{include="nucleus"}]
3535
python = "^3.6.2"
3636
grequests = "^0.6.0"
3737
requests = "^2.25.1"
38-
tqdm = "^4.60.0"
38+
tqdm = "^4.41.0"
3939
dataclasses = { version = "^0.7", python = "^3.6.1, <3.7" }
4040

4141
[tool.poetry.dev-dependencies]
@@ -48,6 +48,11 @@ mypy = "^0.812"
4848
coverage = "^5.5"
4949
pre-commit = "^2.12.1"
5050

51+
[tool.pytest.ini_options]
52+
markers = [
53+
"integration: marks tests as slow (deselect with '-m \"not integration\"')",
54+
]
55+
5156

5257
[build-system]
5358
requires = ["poetry-core>=1.0.0"]

tests/test_dataset.py

Lines changed: 83 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,16 @@
1-
from nucleus.job import JobError
1+
from nucleus.annotation import (
2+
BoxAnnotation,
3+
PolygonAnnotation,
4+
SegmentationAnnotation,
5+
)
6+
from nucleus.job import AsyncJob, JobError
27
import pytest
38
import os
49

510
from .helpers import (
11+
TEST_BOX_ANNOTATIONS,
12+
TEST_POLYGON_ANNOTATIONS,
13+
TEST_SEGMENTATION_ANNOTATIONS,
614
TEST_SLICE_NAME,
715
TEST_DATASET_NAME,
816
TEST_IMG_URLS,
@@ -136,6 +144,7 @@ def test_dataset_append_local(CLIENT, dataset):
136144
assert ERROR_PAYLOAD not in resp_json
137145

138146

147+
@pytest.mark.integration
139148
def test_dataset_append_async(dataset: Dataset):
140149
job = dataset.append(make_dataset_items(), asynchronous=True)
141150
job.sleep_until_complete()
@@ -165,6 +174,7 @@ def test_dataset_append_async_with_local_path(dataset: Dataset):
165174
dataset.append(ds_items, asynchronous=True)
166175

167176

177+
@pytest.mark.integration
168178
def test_dataset_append_async_with_1_bad_url(dataset: Dataset):
169179
ds_items = make_dataset_items()
170180
ds_items[0].image_location = "https://looks.ok.but.is.not.accessible"
@@ -238,3 +248,75 @@ def test_dataset_export_autotag_scores(CLIENT):
238248
for column in ["dataset_item_ids", "ref_ids", "scores"]:
239249
assert column in scores
240250
assert len(scores[column]) > 0
251+
252+
253+
@pytest.mark.integration
254+
def test_annotate_async(dataset: Dataset):
255+
dataset.append(make_dataset_items())
256+
semseg = SegmentationAnnotation.from_json(TEST_SEGMENTATION_ANNOTATIONS[0])
257+
polygon = PolygonAnnotation(**TEST_POLYGON_ANNOTATIONS[0])
258+
bbox = BoxAnnotation(**TEST_BOX_ANNOTATIONS[0])
259+
260+
job: AsyncJob = dataset.annotate(
261+
annotations=[semseg, polygon, bbox],
262+
asynchronous=True,
263+
)
264+
job.sleep_until_complete()
265+
assert job.status() == {
266+
"job_id": job.id,
267+
"status": "Completed",
268+
"message": {
269+
"annotation_upload": {
270+
"epoch": 1,
271+
"total": 2,
272+
"errored": 0,
273+
"ignored": 0,
274+
"datasetId": dataset.id,
275+
"processed": 2,
276+
},
277+
"segmentation_upload": {
278+
"errors": [],
279+
"ignored": 0,
280+
"n_errors": 0,
281+
"processed": 1,
282+
},
283+
},
284+
}
285+
286+
287+
@pytest.mark.integration
288+
def test_annotate_async_with_error(dataset: Dataset):
289+
dataset.append(make_dataset_items())
290+
semseg = SegmentationAnnotation.from_json(TEST_SEGMENTATION_ANNOTATIONS[0])
291+
polygon = PolygonAnnotation(**TEST_POLYGON_ANNOTATIONS[0])
292+
bbox = BoxAnnotation(**TEST_BOX_ANNOTATIONS[0])
293+
bbox.reference_id = "fake_garbage"
294+
295+
job: AsyncJob = dataset.annotate(
296+
annotations=[semseg, polygon, bbox],
297+
asynchronous=True,
298+
)
299+
job.sleep_until_complete()
300+
301+
assert job.status() == {
302+
"job_id": job.id,
303+
"status": "Completed",
304+
"message": {
305+
"annotation_upload": {
306+
"epoch": 1,
307+
"total": 2,
308+
"errored": 1,
309+
"ignored": 0,
310+
"datasetId": dataset.id,
311+
"processed": 1,
312+
},
313+
"segmentation_upload": {
314+
"errors": [],
315+
"ignored": 0,
316+
"n_errors": 0,
317+
"processed": 1,
318+
},
319+
},
320+
}
321+
322+
assert "Item with id fake_garbage doesn" in str(job.errors())

0 commit comments

Comments
 (0)