Skip to content

Sharding: Reducing Load on the Filesystem

William Silversmith edited this page Apr 11, 2020 · 28 revisions

Motivation

In the initial implementation of Precomputed, the primary format CloudVolume and Neuroglancer support, very large datasets generate a commensurately large numbers of files. Image datasets, depending on how they are divided up into a regular grid of files ("chunked"), can produce tens or hundreds of millions of files per a layer. Data derived from the labels in the image, such as meshes and skeletons, range into the billions of files as they create a file for every label in every processed region at minimum. This becomes quite taxing on filesystems and object storage systems. In object storage systems, the file metadata are frequently replicated three to five times to ensure integrity via voting.

With cloud storage vendors, this complication is passed onto customers in the form of a cost per file created. It is not immediately clear how closely this cost directly relates to the actual cost of servicing requests, but it is typical as of this writing to cost $5 per million files created and $4 per ten million files read. For file creation, the costs become substantial at around the 100 million file mark ($500), and serious at the 1 billion file mark ($5,000). The dataset size where these considerations start to become important is somewhere around 100 TB of uint8 images.

Therefore, to remain cost efficient when producing these large datasets, we need a way of compacting files together while retaining random access to individual files.

The Shard File

The shard file format is a container for arbitrary files, but typically used for images, meshes, and skeletons. It works by grouping together a set of files based on a hash of their file name and storing them such that they can be read in isolation using partial file reads using byte offsets. The individual file is located by consulting two levels of indices. The first index is located at the beginning of the file and is of fixed size so that a reader will know beforehand how to access it. This index read points to a second index containing an arbitrarily sized listing of segment ids and byte ranges. The third read accesses the data required directly.

Figure 7. The sharded file format. Shard files are containers that concatenate  files including meshes, skeletons, and images while allowing for random access. Three HTTP requests are needed to retrieve data from a shard. First, a header index of predefined size that lists the byte ranges of mini-shard indicies is queried for the mini-shard that the data is contained in. Secondly, this mini-shard index is used to determine the byte range of the data for a given segment. Lastly, the data is loaded from that byte range.

Image Credit: Oluwaseun Ogedengbe

See the sharding specification for implementation details. [1]

Reading a Shard File

There are two aspects to reading a shard file:

  1. Locating the shard file.
  2. Reading a file from it.

In order to locate the file, you must know what kind of data you're looking for. For Precomputed images, the name of the file is based on the bounding box the shard represents. For meshes and labels, it is based on a manually specified number of bits ("shard bits") as well as an adjustment factor ("preshift bits") that are extracted from a hash of the segment ID and looks like (as in the diagram above) o6e.shard. The hash function typically used is the non-cryptographically secure murmurhash3_x86_128. [2][3]

Once you know which file the data is located in, it is necessary to know the number of "minishard bits" and "preshift bits" in order to extract the minishard number from the segment ID's hash.

Synthesizing a Shard File

Designing a ShardingSpecification

References

  1. J. Maitin-Shepard. "Sharded Format Specification". Github. Accessed April 11, 2020. (link)
  2. F. Kihlander et al. "PYMMH3 - a pure python MurmurHash3 implementation". Github. Accessed April 11, 2020. https://github.com/wc-duck/pymmh3
  3. A. Appleby. "MurmurHash3.cpp". Github. Accessed April 11, 2020. (link)
Clone this wiki locally