You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/guides/manage-cache.md
+42-8Lines changed: 42 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -174,19 +174,30 @@ by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable to true.
174
174
175
175
## Chunk-based caching (Xet)
176
176
177
-
To provide more efficient file transfers, `hf_xet` adds a `xet` directory to the existing `huggingface_hub` cache, creating additional caching layer to enable chunk-based deduplication. This cache holds chunks, which are immutable byte ranges from files (up to 64KB) that are created using content-defined chunking. For more information on the Xet Storage system, see this [section](https://huggingface.co/docs/hub/storage-backends).
177
+
To provide more efficient file transfers, `hf_xet` adds a `xet` directory to the existing `huggingface_hub` cache, creating additional caching layer to enable chunk-based deduplication. This cache holds chunks (immutable byte ranges of files ~64KB in size) and shards (a data structure that maps files to chunks). For more information on the Xet Storage system, see this [section](https://huggingface.co/docs/hub/storage-backends).
178
178
179
-
The `xet` directory, located at `~/.cache/huggingface/xet` by default, contains two caches, utilized for uploads and downloads with the following structure
179
+
The `xet` directory, located at `~/.cache/huggingface/xet` by default, contains two caches, utilized for uploads and downloads. It has the following structure:
180
180
181
181
```bash
182
182
<CACHE_DIR>
183
-
├─ chunk_cache
184
-
├─ shard_cache
183
+
├─ xet
184
+
│ ├─ environment_identifier
185
+
│ │ ├─ chunk_cache
186
+
│ │ ├─ shard_cache
187
+
│ │ ├─ staging
185
188
```
186
189
187
-
The `xet` cache, like the rest of `hf_xet` is fully integrated with `huggingface_hub`. If you use the existing APIs for interacting with cached assets, there is no need to update your workflow. The `xet` cache is built as an optimization layer on top of the existing `hf_xet` chunk-based deduplication and `huggingface_hub` cache system.
190
+
The `environment_identifier` directory is an encoded string (it may appear on your machine as `https___cas_serv-tGqkUaZf_CBPHQ6h`). This is used during development allowing for local and production versions of the cache to exist alongside each other simultaneously. It is also used when downloading from repositories that reside in different [storage regions](https://huggingface.co/docs/hub/storage-regions). You may see multiple such entries in the `xet` directory, each corresponding to a different environment, but their internal structure is the same.
191
+
192
+
The internal directories serve the following purposes:
193
+
*`chunk-cache` contains cached data chunks that are used to speed up downloads.
194
+
*`shard-cache` contains cached shards that are utilized on the upload path.
195
+
*`staging` is a workspace designed to support resumable uploads.
196
+
197
+
These are documented below.
198
+
199
+
Note that the `xet` caching system, like the rest of `hf_xet` is fully integrated with `huggingface_hub`. If you use the existing APIs for interacting with cached assets, there is no need to update your workflow. The `xet` caches are built as an optimization layer on top of the existing `hf_xet` chunk-based deduplication and `huggingface_hub` cache system.
188
200
189
-
The `chunk-cache` directory contains cached data chunks that are used to speed up downloads while the `shard-cache` directory contains cached shards that are utilized on the upload path.
190
201
191
202
### `chunk_cache`
192
203
@@ -234,11 +245,29 @@ Shards provide a mapping between files and chunks. During uploads, each file is
234
245
235
246
All shards have an expiration date of 3-4 weeks from when they are downloaded. Shards that are expired are not loaded during upload and are deleted one week after expiration.
236
247
248
+
### `staging`
249
+
250
+
When an upload terminates before the new content has been committed to the repository, you will need to resume the file transfer. However, it is possible that some chunks were successfully uploaded prior to the interruption.
251
+
252
+
So that you do not have to restart from the beginning, the `staging` directory acts as a workspace during uploads, storing metadata for successfully uploaded chunks. The `staging` directory has the following shape:
As files are processed and chunks successfully uploaded, their metadata is stored in `xorb-metadata` as a shard. Upon resuming an upload session, each file is processed again and the shards in this directory are consulted. Any content that was successfully uploaded is skipped, and any new content is uploaded (and its metadata saved).
263
+
264
+
Meanwhile, `shard-session` stores file and chunk information for processed files. On successful completion of an upload, the content from these shards is moved to the more persistent `shard-cache`.
265
+
237
266
### Limits and Limitations
238
267
239
-
The `chunk_cache` is limited to 10GB in size while the `shard_cache`is technically without limits (in practice, the size and use of shards are such that limiting the cache is unnecessary).
268
+
The `chunk_cache` is limited to 10GB in size while the `shard_cache`has a soft limit of 4GB. By design, both caches are without high-level APIs, although their size is configurable through the `HF_XET_CHUNK_CACHE_SIZE_BYTES` and `HF_XET_SHARD_CACHE_SIZE_LIMIT` environment variables.
240
269
241
-
By design, both caches are without high-level APIs. These caches are used primarily to facilitate the reconstruction (download) or upload of a file. To interact with the assets themselves, it’s recommended that you use the [`huggingface_hub` cache system APIs](https://huggingface.co/docs/huggingface_hub/guides/manage-cache).
270
+
These caches are used primarily to facilitate the reconstruction (download) or upload of a file. To interact with the assets themselves, it’s recommended that you use the [`huggingface_hub` cache system APIs](https://huggingface.co/docs/huggingface_hub/guides/manage-cache).
242
271
243
272
If you need to reclaim the space utilized by either cache or need to debug any potential cache-related issues, simply remove the `xet` cache entirely by running `rm -rf ~/<cache_dir>/xet` where `<cache_dir>` is the location of your Hugging Face cache, typically `~/.cache/huggingface`
244
273
@@ -257,6 +286,11 @@ Example full `xet`cache directory tree:
Copy file name to clipboardExpand all lines: docs/source/en/package_reference/environment_variables.md
+8-2Lines changed: 8 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -93,9 +93,15 @@ Integer value to define the number of seconds to wait for server response when d
93
93
94
94
### HF_XET_CHUNK_CACHE_SIZE_BYTES
95
95
96
-
To set the size of the Xet cache locally. Increasing this will give more space for caching terms/chunks fetched from S3. A larger cache can better take advantage of deduplication across repos & files. If your network speed is much greater than your local disk speed (ex 10Gbps vs SSD or worse) then consider disabling the Xet cache for increased performance. To disable the Xet cache, set `HF_XET_CHUNK_CACHE_SIZE_BYTES=0`.
96
+
To set the size of the Xet chunk cache locally. Increasing this will give more space for caching terms/chunks fetched from S3. A larger cache can better take advantage of deduplication across repos & files. If your network speed is much greater than your local disk speed (ex 10Gbps vs SSD or worse) then consider disabling the Xet cache for increased performance. To disable the Xet cache, set `HF_XET_CHUNK_CACHE_SIZE_BYTES=0`.
97
97
98
-
Defaults to `10737418240` (10GiB).
98
+
Defaults to `10000000000` (10GB).
99
+
100
+
### HF_XET_SHARD_CACHE_SIZE_LIMIT
101
+
102
+
To set the size of the Xet shard cache locally. Increasing this will improve upload effeciency as chunks referenced in cached shard files are not re-uploaded. Note that the default soft limit is likely sufficient for most workloads.
0 commit comments