-
Notifications
You must be signed in to change notification settings - Fork 52
Sharding: Reducing Load on the Filesystem
In the initial implementation of Precomputed, the primary format CloudVolume and Neuroglancer support, very large datasets generate a commensurately large numbers of files. Image datasets, depending on how they are divided up into a regular grid of files ("chunked"), can produce tens or hundreds of millions of files per a layer. Data derived from the labels in the image, such as meshes and skeletons, range into the billions of files as they create a file for every label in every processed region at minimum. This becomes quite taxing on filesystems and object storage systems. In object storage systems, the file metadata are frequently replicated three to five times to ensure integrity via voting.
With cloud storage vendors, this complication is passed onto customers in the form of a cost per file created. It is not immediately clear how closely this cost directly relates to the actual cost of servicing requests, but it is typical as of this writing to cost $5 per million files created and $4 per ten million files read. For file creation, the costs become substantial at around the 100 million file mark ($500), and serious at the 1 billion file mark ($5,000). The dataset size where these considerations start to become important is somewhere around 100 TB of uint8 images.
Therefore, to remain cost efficient when producing these large datasets, we need a way of compacting files together while retaining random access to individual files.
The shard file format is a container for arbitrary files, but typically used for images, meshes, and skeletons. It works by grouping together a set of files based on a hash of their file name and storing them such that they can be read in isolation using partial file reads using byte offsets. The individual file is located by consulting two levels of indices. The first index is located at the beginning of the file and is of fixed size so that a reader will know beforehand how to access it. This index read points to a second index containing an arbitrarily sized listing of segment ids and byte ranges. The third read accesses the data required directly.
Image Credit: Oluwaseun Ogedengbe
See the sharding specification for implementation details. [1]
There are two aspects to reading a shard file:
- Locating the shard file.
- Reading data from it.
In order to locate the file, you must know what kind of data you're looking for. For Precomputed images, the name of the file is based on the bounding box the shard represents. For meshes and labels, it is based on a manually specified number of bits ("shard bits") as well as an adjustment factor ("preshift bits") that are extracted from a hash of the segment ID and looks like (as in the diagram above) o6e.shard
. The hash function typically used is the non-cryptographically secure murmurhash3_x86_128. [2][3]
Once you know which file the data is located in, it is necessary to know the number of "minishard bits" and "preshift bits" in order to extract the minishard number from the segment ID's hash. This allows you to read the correct byte range of the minishard index from the corresponding row of the fixed width index as in the figure above. You then read the minishard index, extract the byte range of the segment you care about, and read the data from there. Both the minishard index and the data themselves may be gzip compressed (other compressions schemes are possible, but not supported in Neuroglancer).
As of this writing, CloudVolume implements this using the ShardReader class. [4] The ShardReader
class is integrated into the sharded versions of accessors that have been implemented. As of this writing, sharded readers have been incorporated into Precomputed Images and Skeletons. We are designing support for Graphene Meshes [5].
- J. Maitin-Shepard. "Sharded Format Specification". Github. Accessed April 11, 2020. (link)
- F. Kihlander et al. "PYMMH3 - a pure python MurmurHash3 implementation". Github. Accessed April 11, 2020. https://github.com/wc-duck/pymmh3
- A. Appleby. "MurmurHash3.cpp". Github. Accessed April 11, 2020. (link)
- W. Silversmith. "ShardReader". Github. Accessed April 11, 2020. ([link])(https://github.com/seung-lab/cloud-volume/blob/ae6e2194c92ab205068fdd1372011d5973f4421f/cloudvolume/datasource/precomputed/sharding.py#L178-L297))
- W. Silversmith. "Graphene: Format for Large Scale Proofreading". Accessed April 11, 2020. (link) (Note that this is a simple description of Graphene as relates to CloudVolume. Graphene is the work of many people and my involvement has been peripheral until now.)