You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
New knob for stream stall detection on S3 storage (#1170)
* New knob for stream stall detection on S3 storage
Affects: `s3_storage`, `s3_store`, `tigris_storage`, `r2_storage`
They all now support a new `network_stream_timeout_seconds` argument
with default 20. If set to 0 stream stall detection is deactivated.
* Change default to 60
* Fix backwards compatibility
* Fix S3Options constructor in pyi
* added description of new config settings
* Fix documentation link and typo
* Fix link
---------
Co-authored-by: Ryan Abernathey <ryan.abernathey@gmail.com>
Icechunk uses Zstd compression to compress its metadata files. [`CompressionConfig`](./reference.md#icechunk.CompressionConfig) allows you to configure the [compression level](./reference.md#icechunk.CompressionConfig.level) and [algorithm](./reference.md#icechunk.CompressionConfig.algorithm). Currently, the only algorithm available is [`Zstd`](https://facebook.github.io/zstd/).
The level of concurrency used in each request is controlled by the zarr `async.concurrency` config parameter.
12
+
13
+
```python
14
+
import zarr
15
+
print(zarr.config.get("async.concurrency"))
16
+
# -> 10 (default)
17
+
```
18
+
19
+
Large machines in close proximity to object storage can benefit from much more concurrency. For high-performance configuration, we recommend much higher values, e.g.
20
+
21
+
```python
22
+
zarr.config.get({"async.concurrency": 128})
23
+
```
24
+
25
+
Note that this concurrency limit is _per individual Zarr Array read/write operation_
26
+
27
+
```
28
+
# chunks fetched concurrently up to async.concurrency limit
29
+
data = array[:]
30
+
# chunks written concurrently up to async.concurrency limit
31
+
array[:] = data
32
+
```
33
+
34
+
### Dask and Multi-Tiered Concurrency
35
+
36
+
Using Dask with Zarr introduces _another_ layer of concurrency: the number of Dask threads or workers.
37
+
If each Dask task addresses multiple Zarr chunks, the amount of concurrency multiplies.
38
+
In these circumstances, it is possible to generate _too much concurrency_.
39
+
If there are **thousands** of concurrent HTTP requests in flight, they may start to stall or time out.
40
+
To prevent this, Icechunk introduces a global concurrency limit.
41
+
42
+
### Icechunk Global Concurrency Limit
43
+
44
+
Each Icechunk repo has a cap on the maximum amount of concurrent requests that will be made.
45
+
The default concurrency limit is 256.
46
+
47
+
For example, the following code sets this limit to 10
In this configuration, even if the upper layers of the stack (Dask and Zarr) issue many more concurrent requests, Icechunk will only open 10 HTTP connections to the object store at once.
58
+
59
+
### Stalled Network Streams
60
+
61
+
A stalled network stream is an HTTP connection which does not transfer any data over a certain period.
62
+
Stalled connections may occur in the following situations:
63
+
64
+
- When the client is connecting to a remote object store behind a slow network connection.
65
+
- When the client is behind a VPN or proxy server which is limiting the number or throughput of connections between the client and the remote object store.
66
+
- When the client tries to issue a high volume of concurrent requests. (Note that the global concurrency limit described above should help avoid this, but the precise limit is hardware- and network-dependent. )
67
+
68
+
By default, Icechunk detects stalled HTTP connections and raises an error when it sees one.
69
+
These errors typically contain lines like
70
+
71
+
```
72
+
|-> I/O error
73
+
|-> streaming error
74
+
`-> minimum throughput was specified at 1 B/s, but throughput of 0 B/s was observed
75
+
```
76
+
77
+
This behavior is configurable when creating a new `Storage` option, via the `network_stream_timeout_seconds` parameter.
78
+
The default is 60 seconds.
79
+
To set a different value, you may specify as follows
80
+
81
+
```python
82
+
storage= icechunk.s3_storage(
83
+
**other_storage_kwargs,
84
+
network_stream_timeout_seconds=50,
85
+
)
86
+
repo = icechunk.Repository.open(storage=storage)
87
+
```
88
+
89
+
Specifying a value of 0 disables this check entirely.
7
90
8
91
## Scalability
9
92
@@ -41,6 +124,11 @@ on Icechunk scalability.
41
124
42
125
## Splitting manifests
43
126
127
+
!!! info
128
+
129
+
This is advanced material, and you will need it only if you have arrays with more than a million chunks.
130
+
Icechunk aims to provide an excellent experience out of the box.
131
+
44
132
Icechunk stores chunk references in a chunk manifest file stored in `manifests/`.
45
133
By default, Icechunk stores all chunk references in a single manifest file per array.
46
134
For very large arrays (millions of chunks), these files can get quite large.
@@ -53,10 +141,10 @@ downloading and rewriting the entire manifest.
53
141
54
142
Note that the chunk sizes in the following examples are tiny for demonstration purposes.
55
143
56
-
57
144
### Configuring splitting
58
145
59
-
To solve this issue, Icechunk lets you __split__ the manifest files by specifying a ``ManifestSplittingConfig``.
146
+
To solve this issue, Icechunk lets you **split** the manifest files by specifying a ``ManifestSplittingConfig``.
This ends up rewriting all refs to two new manifests.
239
335
240
336
### Rewriting manifests
@@ -247,6 +343,7 @@ At that point, you will want to experiment with different manifest split configu
247
343
To force Icechunk to rewrite all chunk refs to the current splitting configuration use [`rewrite_manifests`](./reference.md#icechunk.Repository.rewrite_manifests)
248
344
249
345
To illustrate, we will use a split size of 3 --- for the current example this will consolidate to two manifests.
This example will preload all manifests that match the regex "x" when opening a Session. While this is a simple example, you can use the `ManifestPreloadCondition` class to create more complex preload conditions using the following options:
459
+
354
460
-`ManifestPreloadCondition.name_matches` takes a regular expression used to match an array's name;
355
461
-`ManifestPreloadCondition.path_matches` takes a regular expression used to match an array's path;
356
462
-`ManifestPreloadCondition.and_conditions` to combine (1), (2), and (4) together; and
@@ -414,7 +520,6 @@ This will preload all manifests that match the array name "x" while the number o
414
520
415
521
Once you find a preload configuration you like, remember to persist it on-disk using `repo.save_config`. The saved config can be overridden at runtime for different applications.
416
522
417
-
418
523
#### Default preload configuration
419
524
420
525
Icechunk has a default `preload_if` configuration that will preload all manifests that match [cf-xarrays coordinate axis regex](https://github.com/xarray-contrib/cf-xarray/blob/1591ff5ea7664a6bdef24055ef75e242cd5bfc8b/cf_xarray/criteria.py#L149-L160).
0 commit comments