You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+8-10Lines changed: 8 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -13,19 +13,17 @@
13
13
14
14
Simple, fast integration with object storage services like Amazon S3, Google Cloud Storage, Azure Blob Storage, and S3-compliant APIs like Cloudflare R2.
15
15
16
-
- Sync and async API.
17
-
- Streaming downloads with configurable chunking.
18
-
- Streaming uploads from async or sync iterators.
19
-
- Streaming `list`, with no need to paginate.
16
+
- Sync and async API with **full type hinting**.
17
+
-**Streaming downloads** with configurable chunking.
18
+
-**Streaming uploads** from async or sync iterators.
19
+
-**Streaming list**, with no need to paginate.
20
+
- Automatically uses [**multipart uploads**](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) for large file objects.
21
+
- Support for **conditional put** ("put if not exists"), as well as custom tags and attributes.
22
+
- Optionally return list results as [Arrow](https://arrow.apache.org/), which is faster than materializing Python `dict`s.
20
23
- File-like object API and [fsspec](https://github.com/fsspec/filesystem_spec) integration.
21
-
- Support for conditional put ("put if not exists"), as well as custom tags and attributes.
22
-
- Automatically uses [multipart uploads](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) under the hood for large file objects.
23
-
- Optionally return list results as [Arrow](https://arrow.apache.org/), which is faster than materializing Python `dict`/`list` objects.
24
24
- Easy to install with no required Python dependencies.
25
25
- The [underlying Rust library](https://docs.rs/object_store) is production quality and used in large scale production systems, such as the Rust package registry [crates.io](https://crates.io/).
26
-
- Zero-copy data exchange between Rust and Python in `get_range`, `get_ranges`, `GetResult.bytes`, and `put` via the Python [buffer protocol](https://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/).
27
-
- Simple API with static type checking.
28
-
- Helpers for constructing from environment variables and `boto3.Session` objects
26
+
- Zero-copy data exchange between Rust and Python via the [buffer protocol](https://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/).
29
27
30
28
<!-- For Rust developers looking to add object_store support to their Python packages, refer to pyo3-object_store. -->
Using the response as an iterator ensures that we don't buffer the entire file
115
+
into memory.
116
+
117
+
```py
118
+
import obstore as obs
119
+
120
+
resp = obs.get(store, path)
121
+
122
+
withopen("output/file", "wb") as f:
123
+
for chunk in resp:
124
+
f.write(chunk)
125
+
```
126
+
112
127
## Put object
113
128
114
129
Use the [`obstore.put`][] function to atomically write data. `obstore.put` will automatically use [multipart uploads](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) for large input data.
Perhaps you have data in AWS S3 that you need to copy to Google Cloud Storage. It's easy to **stream** a `get` from one store directly to the `put` of another.
195
+
Perhaps you have data in one store, say AWS S3, that you need to copy to another, say Google Cloud Storage.
196
+
197
+
### In memory
198
+
199
+
Download the file, collect its bytes in memory, then upload it. Note that this will materialize the entire file in memory.
200
+
201
+
```py
202
+
import obstore as obs
203
+
204
+
store1 = get_object_store()
205
+
store2 = get_object_store()
206
+
207
+
path1 ="data/file1"
208
+
path2 ="data/file2"
209
+
210
+
buffer = obs.get(store1, path1).bytes()
211
+
obs.put(store2, path2, buffer)
212
+
```
213
+
214
+
### Local file
215
+
216
+
First download the file to disk, then upload it.
217
+
218
+
```py
219
+
from pathlib import Path
220
+
import obstore as obs
221
+
222
+
store1 = get_object_store()
223
+
store2 = get_object_store()
224
+
225
+
path1 ="data/file1"
226
+
path2 ="data/file2"
227
+
228
+
resp = obs.get(store1, path1)
229
+
230
+
withopen("temporary_file", "wb") as f:
231
+
for chunk in resp:
232
+
f.write(chunk)
233
+
234
+
# Upload the path
235
+
obs.put(store2, path2, Path("temporary_file"))
236
+
```
237
+
238
+
### Streaming
239
+
240
+
It's easy to **stream** a download from one store directly as the upload to another. Only the given
182
241
183
242
!!! note
184
-
Using the async API is required for this.
243
+
Using the async API is currently required to use streaming copies.
185
244
186
245
```py
187
246
import obstore as obs
@@ -190,16 +249,34 @@ store1 = get_object_store()
190
249
store2 = get_object_store()
191
250
192
251
path1 ="data/file1"
193
-
path2 ="data/file1"
252
+
path2 ="data/file2"
194
253
195
254
# This only constructs the stream, it doesn't materialize the data in memory
Or, by customizing the chunk size and the upload concurrency you can control memory overhead.
261
+
262
+
```py
263
+
resp =await obs.get_async(store1, path1)
264
+
chunk_size =5*1024*1024# 5MB
265
+
stream = resp.stream(min_chunk_size=chunk_size)
197
266
198
267
# A streaming upload is created to copy the file to path2
199
-
await obs.put_async(store2, path2)
268
+
await obs.put_async(
269
+
store2,
270
+
path2,
271
+
stream,
272
+
chunk_size=chunk_size,
273
+
max_concurrency=12
274
+
)
200
275
```
201
276
277
+
This will start up to 12 concurrent uploads, each with around 5MB chunks, giving a total memory usage of up to _roughly_ 60MB for this copy.
278
+
202
279
!!! note
203
-
You may need to increase the download timeout for large source files. The timeout defaults to 30 seconds, which may not be long enough to upload the file to the destination.
280
+
You may need to increase the download timeout for large source files. The timeout defaults to 30 seconds, which may not be long enough to upload the file to the destination.
204
281
205
-
You may set the [`timeout` parameter][obstore.store.ClientConfig] in the `client_options` passed to the initial `get_async` call.
282
+
You may set the [`timeout` parameter][obstore.store.ClientConfig] in the `client_options` passed when creating the store.
Copy file name to clipboardExpand all lines: obstore/Cargo.toml
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
[package]
2
2
name = "obstore"
3
-
version = "0.3.0-beta.11"
3
+
version = "0.3.0"
4
4
authors = { workspace = true }
5
5
edition = { workspace = true }
6
6
description = "A Python interface to the Rust object_store crate, providing a uniform API for interacting with object storage services and local files."
0 commit comments