Skip to content

Commit 9666866

Browse files
authored
doc edits and bump to 0.3 (#153)
1 parent 53911fa commit 9666866

File tree

5 files changed

+104
-30
lines changed

5 files changed

+104
-30
lines changed

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -13,19 +13,17 @@
1313

1414
Simple, fast integration with object storage services like Amazon S3, Google Cloud Storage, Azure Blob Storage, and S3-compliant APIs like Cloudflare R2.
1515

16-
- Sync and async API.
17-
- Streaming downloads with configurable chunking.
18-
- Streaming uploads from async or sync iterators.
19-
- Streaming `list`, with no need to paginate.
16+
- Sync and async API with **full type hinting**.
17+
- **Streaming downloads** with configurable chunking.
18+
- **Streaming uploads** from async or sync iterators.
19+
- **Streaming list**, with no need to paginate.
20+
- Automatically uses [**multipart uploads**](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) for large file objects.
21+
- Support for **conditional put** ("put if not exists"), as well as custom tags and attributes.
22+
- Optionally return list results as [Arrow](https://arrow.apache.org/), which is faster than materializing Python `dict`s.
2023
- File-like object API and [fsspec](https://github.com/fsspec/filesystem_spec) integration.
21-
- Support for conditional put ("put if not exists"), as well as custom tags and attributes.
22-
- Automatically uses [multipart uploads](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) under the hood for large file objects.
23-
- Optionally return list results as [Arrow](https://arrow.apache.org/), which is faster than materializing Python `dict`/`list` objects.
2424
- Easy to install with no required Python dependencies.
2525
- The [underlying Rust library](https://docs.rs/object_store) is production quality and used in large scale production systems, such as the Rust package registry [crates.io](https://crates.io/).
26-
- Zero-copy data exchange between Rust and Python in `get_range`, `get_ranges`, `GetResult.bytes`, and `put` via the Python [buffer protocol](https://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/).
27-
- Simple API with static type checking.
28-
- Helpers for constructing from environment variables and `boto3.Session` objects
26+
- Zero-copy data exchange between Rust and Python via the [buffer protocol](https://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/).
2927

3028
<!-- For Rust developers looking to add object_store support to their Python packages, refer to pyo3-object_store. -->
3129

docs/cookbook.md

Lines changed: 92 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -65,13 +65,13 @@ for record_batch in stream:
6565

6666
The Arrow record batch looks like the following:
6767

68-
| path | last_modified | size | e_tag | version |
69-
|:--------------------------------------------------------------------|:--------------------------|---------:|:-------------------------------------|:----------|
70-
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/AOT.tif | 2020-09-30 20:25:56+00:00 | 50510 | "2e24c2ee324ea478f2f272dbd3f5ce69" | |
71-
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B01.tif | 2020-09-30 20:22:48+00:00 | 1455332 | "a31b78e96748ccc2b21b827bef9850c1" | |
72-
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B02.tif | 2020-09-30 20:23:19+00:00 | 38149405 | "d7a92f88ad19761081323165649ce799-5" | |
73-
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B03.tif | 2020-09-30 20:23:52+00:00 | 38123224 | "4b938b6969f1c16e5dd685e6599f115f-5" | |
74-
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B04.tif | 2020-09-30 20:24:21+00:00 | 39033591 | "4781b581cd32b2169d0b3d22bf40a8ef-5" | |
68+
| path | last_modified | size | e_tag | version |
69+
| :------------------------------------------------------------------ | :------------------------ | -------: | :----------------------------------- | :------ |
70+
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/AOT.tif | 2020-09-30 20:25:56+00:00 | 50510 | "2e24c2ee324ea478f2f272dbd3f5ce69" | |
71+
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B01.tif | 2020-09-30 20:22:48+00:00 | 1455332 | "a31b78e96748ccc2b21b827bef9850c1" | |
72+
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B02.tif | 2020-09-30 20:23:19+00:00 | 38149405 | "d7a92f88ad19761081323165649ce799-5" | |
73+
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B03.tif | 2020-09-30 20:23:52+00:00 | 38123224 | "4b938b6969f1c16e5dd685e6599f115f-5" | |
74+
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B04.tif | 2020-09-30 20:24:21+00:00 | 39033591 | "4781b581cd32b2169d0b3d22bf40a8ef-5" | |
7575

7676
## Fetch objects
7777

@@ -109,6 +109,21 @@ for chunk in stream:
109109
assert total_buffer_len == meta.size
110110
```
111111

112+
### Download to disk
113+
114+
Using the response as an iterator ensures that we don't buffer the entire file
115+
into memory.
116+
117+
```py
118+
import obstore as obs
119+
120+
resp = obs.get(store, path)
121+
122+
with open("output/file", "wb") as f:
123+
for chunk in resp:
124+
f.write(chunk)
125+
```
126+
112127
## Put object
113128

114129
Use the [`obstore.put`][] function to atomically write data. `obstore.put` will automatically use [multipart uploads](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) for large input data.
@@ -160,7 +175,6 @@ content = bytes_iter()
160175
obs.put(store, path, content)
161176
```
162177

163-
164178
Or async iterables:
165179

166180
```py
@@ -178,10 +192,55 @@ obs.put(store, path, content)
178192

179193
## Copy objects from one store to another
180194

181-
Perhaps you have data in AWS S3 that you need to copy to Google Cloud Storage. It's easy to **stream** a `get` from one store directly to the `put` of another.
195+
Perhaps you have data in one store, say AWS S3, that you need to copy to another, say Google Cloud Storage.
196+
197+
### In memory
198+
199+
Download the file, collect its bytes in memory, then upload it. Note that this will materialize the entire file in memory.
200+
201+
```py
202+
import obstore as obs
203+
204+
store1 = get_object_store()
205+
store2 = get_object_store()
206+
207+
path1 = "data/file1"
208+
path2 = "data/file2"
209+
210+
buffer = obs.get(store1, path1).bytes()
211+
obs.put(store2, path2, buffer)
212+
```
213+
214+
### Local file
215+
216+
First download the file to disk, then upload it.
217+
218+
```py
219+
from pathlib import Path
220+
import obstore as obs
221+
222+
store1 = get_object_store()
223+
store2 = get_object_store()
224+
225+
path1 = "data/file1"
226+
path2 = "data/file2"
227+
228+
resp = obs.get(store1, path1)
229+
230+
with open("temporary_file", "wb") as f:
231+
for chunk in resp:
232+
f.write(chunk)
233+
234+
# Upload the path
235+
obs.put(store2, path2, Path("temporary_file"))
236+
```
237+
238+
### Streaming
239+
240+
It's easy to **stream** a download from one store directly as the upload to another. Only the given
182241

183242
!!! note
184-
Using the async API is required for this.
243+
Using the async API is currently required to use streaming copies.
185244

186245
```py
187246
import obstore as obs
@@ -190,16 +249,34 @@ store1 = get_object_store()
190249
store2 = get_object_store()
191250

192251
path1 = "data/file1"
193-
path2 = "data/file1"
252+
path2 = "data/file2"
194253

195254
# This only constructs the stream, it doesn't materialize the data in memory
196-
resp = await obs.get_async(store1, path1, timeout="2min")
255+
resp = await obs.get_async(store1, path1)
256+
# A streaming upload is created to copy the file to path2
257+
await obs.put_async(store2, path2, resp, chunk_size=chunk_size)
258+
```
259+
260+
Or, by customizing the chunk size and the upload concurrency you can control memory overhead.
261+
262+
```py
263+
resp = await obs.get_async(store1, path1)
264+
chunk_size = 5 * 1024 * 1024 # 5MB
265+
stream = resp.stream(min_chunk_size=chunk_size)
197266

198267
# A streaming upload is created to copy the file to path2
199-
await obs.put_async(store2, path2)
268+
await obs.put_async(
269+
store2,
270+
path2,
271+
stream,
272+
chunk_size=chunk_size,
273+
max_concurrency=12
274+
)
200275
```
201276

277+
This will start up to 12 concurrent uploads, each with around 5MB chunks, giving a total memory usage of up to _roughly_ 60MB for this copy.
278+
202279
!!! note
203-
You may need to increase the download timeout for large source files. The timeout defaults to 30 seconds, which may not be long enough to upload the file to the destination.
280+
You may need to increase the download timeout for large source files. The timeout defaults to 30 seconds, which may not be long enough to upload the file to the destination.
204281

205-
You may set the [`timeout` parameter][obstore.store.ClientConfig] in the `client_options` passed to the initial `get_async` call.
282+
You may set the [`timeout` parameter][obstore.store.ClientConfig] in the `client_options` passed when creating the store.

obstore/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "obstore"
3-
version = "0.3.0-beta.11"
3+
version = "0.3.0"
44
authors = { workspace = true }
55
edition = { workspace = true }
66
description = "A Python interface to the Rust object_store crate, providing a uniform API for interacting with object storage services and local files."

obstore/python/obstore/_get.pyi

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -223,9 +223,8 @@ class BytesStream:
223223
}
224224
```
225225
226-
To fix this, set the `timeout` parameter in the `client_options` passed to the
227-
initial `get` or `get_async` call. See
228-
[ClientConfig][obstore.store.ClientConfig].
226+
To fix this, set the `timeout` parameter in the
227+
[`client_options`][obstore.store.ClientConfig] passed when creating the store.
229228
"""
230229

231230
def __aiter__(self) -> BytesStream:

0 commit comments

Comments
 (0)