Skip to content

Commit 99baddf

Browse files
authored
Updating Xet caching docs (#3190)
* cache and environment variable docs updated * updating staging directory notes * cleaning up draft * tidying up * tidying up some more * pr feedback
1 parent c26b063 commit 99baddf

File tree

2 files changed

+50
-10
lines changed

2 files changed

+50
-10
lines changed

docs/source/en/guides/manage-cache.md

Lines changed: 42 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -174,19 +174,30 @@ by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable to true.
174174

175175
## Chunk-based caching (Xet)
176176

177-
To provide more efficient file transfers, `hf_xet` adds a `xet` directory to the existing `huggingface_hub` cache, creating additional caching layer to enable chunk-based deduplication. This cache holds chunks, which are immutable byte ranges from files (up to 64KB) that are created using content-defined chunking. For more information on the Xet Storage system, see this [section](https://huggingface.co/docs/hub/storage-backends).
177+
To provide more efficient file transfers, `hf_xet` adds a `xet` directory to the existing `huggingface_hub` cache, creating additional caching layer to enable chunk-based deduplication. This cache holds chunks (immutable byte ranges of files ~64KB in size) and shards (a data structure that maps files to chunks). For more information on the Xet Storage system, see this [section](https://huggingface.co/docs/hub/storage-backends).
178178

179-
The `xet` directory, located at `~/.cache/huggingface/xet` by default, contains two caches, utilized for uploads and downloads with the following structure
179+
The `xet` directory, located at `~/.cache/huggingface/xet` by default, contains two caches, utilized for uploads and downloads. It has the following structure:
180180

181181
```bash
182182
<CACHE_DIR>
183-
├─ chunk_cache
184-
├─ shard_cache
183+
├─ xet
184+
│ ├─ environment_identifier
185+
│ │ ├─ chunk_cache
186+
│ │ ├─ shard_cache
187+
│ │ ├─ staging
185188
```
186189

187-
The `xet` cache, like the rest of `hf_xet` is fully integrated with `huggingface_hub`. If you use the existing APIs for interacting with cached assets, there is no need to update your workflow. The `xet` cache is built as an optimization layer on top of the existing `hf_xet` chunk-based deduplication and `huggingface_hub` cache system.
190+
The `environment_identifier` directory is an encoded string (it may appear on your machine as `https___cas_serv-tGqkUaZf_CBPHQ6h`). This is used during development allowing for local and production versions of the cache to exist alongside each other simultaneously. It is also used when downloading from repositories that reside in different [storage regions](https://huggingface.co/docs/hub/storage-regions). You may see multiple such entries in the `xet` directory, each corresponding to a different environment, but their internal structure is the same.
191+
192+
The internal directories serve the following purposes:
193+
* `chunk-cache` contains cached data chunks that are used to speed up downloads.
194+
* `shard-cache` contains cached shards that are utilized on the upload path.
195+
* `staging` is a workspace designed to support resumable uploads.
196+
197+
These are documented below.
198+
199+
Note that the `xet` caching system, like the rest of `hf_xet` is fully integrated with `huggingface_hub`. If you use the existing APIs for interacting with cached assets, there is no need to update your workflow. The `xet` caches are built as an optimization layer on top of the existing `hf_xet` chunk-based deduplication and `huggingface_hub` cache system.
188200

189-
The `chunk-cache` directory contains cached data chunks that are used to speed up downloads while the `shard-cache` directory contains cached shards that are utilized on the upload path.
190201

191202
### `chunk_cache`
192203

@@ -234,11 +245,29 @@ Shards provide a mapping between files and chunks. During uploads, each file is
234245

235246
All shards have an expiration date of 3-4 weeks from when they are downloaded. Shards that are expired are not loaded during upload and are deleted one week after expiration.
236247

248+
### `staging`
249+
250+
When an upload terminates before the new content has been committed to the repository, you will need to resume the file transfer. However, it is possible that some chunks were successfully uploaded prior to the interruption.
251+
252+
So that you do not have to restart from the beginning, the `staging` directory acts as a workspace during uploads, storing metadata for successfully uploaded chunks. The `staging` directory has the following shape:
253+
254+
<CACHE_DIR>
255+
├─ xet
256+
│ ├─ staging
257+
│ │ ├─ shard-session
258+
│ │ │ ├─ 906ee184dc1cd0615164a89ed64e8147b3fdccd1163d80d794c66814b3b09992.mdb
259+
│ │ │ ├─ xorb-metadata
260+
│ │ │ │ ├─ 1fe4ffd5cf0c3375f1ef9aec5016cf773ccc5ca294293d3f92d92771dacfc15d.mdb
261+
262+
As files are processed and chunks successfully uploaded, their metadata is stored in `xorb-metadata` as a shard. Upon resuming an upload session, each file is processed again and the shards in this directory are consulted. Any content that was successfully uploaded is skipped, and any new content is uploaded (and its metadata saved).
263+
264+
Meanwhile, `shard-session` stores file and chunk information for processed files. On successful completion of an upload, the content from these shards is moved to the more persistent `shard-cache`.
265+
237266
### Limits and Limitations
238267

239-
The `chunk_cache` is limited to 10GB in size while the `shard_cache` is technically without limits (in practice, the size and use of shards are such that limiting the cache is unnecessary).
268+
The `chunk_cache` is limited to 10GB in size while the `shard_cache` has a soft limit of 4GB. By design, both caches are without high-level APIs, although their size is configurable through the `HF_XET_CHUNK_CACHE_SIZE_BYTES` and `HF_XET_SHARD_CACHE_SIZE_LIMIT` environment variables.
240269

241-
By design, both caches are without high-level APIs. These caches are used primarily to facilitate the reconstruction (download) or upload of a file. To interact with the assets themselves, it’s recommended that you use the [`huggingface_hub` cache system APIs](https://huggingface.co/docs/huggingface_hub/guides/manage-cache).
270+
These caches are used primarily to facilitate the reconstruction (download) or upload of a file. To interact with the assets themselves, it’s recommended that you use the [`huggingface_hub` cache system APIs](https://huggingface.co/docs/huggingface_hub/guides/manage-cache).
242271

243272
If you need to reclaim the space utilized by either cache or need to debug any potential cache-related issues, simply remove the `xet` cache entirely by running `rm -rf ~/<cache_dir>/xet` where `<cache_dir>` is the location of your Hugging Face cache, typically `~/.cache/huggingface`
244273

@@ -257,6 +286,11 @@ Example full `xet`cache directory tree:
257286
│ │ ├─ 906ee184dc1cd0615164a89ed64e8147b3fdccd1163d80d794c66814b3b09992.mdb
258287
│ │ ├─ ceeeb7ea4cf6c0a8d395a2cf9c08871211fbbd17b9b5dc1005811845307e6b8f.mdb
259288
│ │ ├─ e8535155b1b11ebd894c908e91a1e14e3461dddd1392695ddc90ae54a548d8b2.mdb
289+
│ ├─ staging
290+
│ │ ├─ shard-session
291+
│ │ │ ├─ 906ee184dc1cd0615164a89ed64e8147b3fdccd1163d80d794c66814b3b09992.mdb
292+
│ │ │ ├─ xorb-metadata
293+
│ │ │ │ ├─ 1fe4ffd5cf0c3375f1ef9aec5016cf773ccc5ca294293d3f92d92771dacfc15d.mdb
260294
```
261295

262296
To learn more about Xet Storage, see this [section](https://huggingface.co/docs/hub/storage-backends).

docs/source/en/package_reference/environment_variables.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -93,9 +93,15 @@ Integer value to define the number of seconds to wait for server response when d
9393

9494
### HF_XET_CHUNK_CACHE_SIZE_BYTES
9595

96-
To set the size of the Xet cache locally. Increasing this will give more space for caching terms/chunks fetched from S3. A larger cache can better take advantage of deduplication across repos & files. If your network speed is much greater than your local disk speed (ex 10Gbps vs SSD or worse) then consider disabling the Xet cache for increased performance. To disable the Xet cache, set `HF_XET_CHUNK_CACHE_SIZE_BYTES=0`.
96+
To set the size of the Xet chunk cache locally. Increasing this will give more space for caching terms/chunks fetched from S3. A larger cache can better take advantage of deduplication across repos & files. If your network speed is much greater than your local disk speed (ex 10Gbps vs SSD or worse) then consider disabling the Xet cache for increased performance. To disable the Xet cache, set `HF_XET_CHUNK_CACHE_SIZE_BYTES=0`.
9797

98-
Defaults to `10737418240` (10GiB).
98+
Defaults to `10000000000` (10GB).
99+
100+
### HF_XET_SHARD_CACHE_SIZE_LIMIT
101+
102+
To set the size of the Xet shard cache locally. Increasing this will improve upload effeciency as chunks referenced in cached shard files are not re-uploaded. Note that the default soft limit is likely sufficient for most workloads.
103+
104+
Defaults to `4000000000` (4GB).
99105

100106
### HF_XET_NUM_CONCURRENT_RANGE_GETS
101107

0 commit comments

Comments
 (0)