Feature Proposal: Tiered Storage #29
Replies: 2 comments 2 replies
-
I'll have other comments, but first, my long-held opinion is that we need to purge the notion of "backups" entirely from the Riak lexicon. This has been an issue for a very long time. Riak's operational characteristics and target use cases are fundamentally incompatible with the traditional understanding of backups in the context of databases. A significant factor in the evolution of the eventually-consistent model is the inability to effectively back up (and restore) a viable "snapshot" of the state of a huge dataset - we were running into the limits of physics trying to back up large datasets (generally RDBs, though not always) 30 years ago, and despite faster hardware at every level of the stack we're dealing with datasets and storage requirements many orders of magnitude larger today. The speed of light is unlikely to change. We store multiple replicas in a cluster specifically to get around node-level hardware failures and maintenance. We support cross-cluster replication specifically to recover from cluster-level outages. Attempting to back up data at the cluster or backend level is not only meaningless, it pollutes design considerations with a fundamentally invalid constraint. Equally (if not more so) problematic is that operators are lulled into believing that they could back up their dataset if they wanted to, or worse that they are backing up their dataset, only to discover that there's no meaningful way to restore it should it come to that. We need to rid ourselves of the term. It's an illusion that benefits nobody. |
Beta Was this translation helpful? Give feedback.
-
I think being able to split the ledger and journal onto distinct filesystem paths is desirable and should cover the vast majority of requirements. Conversely, I don't see supporting multiple paths for journal data as being worth the effort. If at some point in the future someone needs this, they can fund its development, but it strikes me as a pretty esoteric use case (especially in light of possible in-memory buckets). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Background
Riak is used to meet data storage needs where the volume of data is relatively high. A benefit of using Riak in big-data scenarios is the ability to reduce storage costs by the distribution of storage workloads over many individual nodes, where each node has relatively inexpensive storage. The distributed nature of the database reduces the risk of bottlenecks on disk access, allowing storage to be lower cost (as increased latency can be tolerated). The distributed nature of the database and the extensive reconciliation and repair features - means that reliability can be handled through redundancy and recovery between nodes, not redundancy within a node, allowing storage to be lower cost (as reduced local disk resilience can be tolerated).
Tiered storage is primarily a solution to resolve the cost efficiency of scaling disk space in vertically-scaled databases, and as-such is not a primary requirement within Riak. However, there may be scenarios where having tiered storage features in addition to the advantages of horizontal scaling may be beneficial. There are currently two tiered-storage options in Riak:
These options are functionally different to the operator. One option (multi-backend) assumes there is some designation based on bucket such that the designation can be aligned in some way because of frequency of access, frequency of update or volume of data - to the tier of storage. The other (eleveldb) is trying to identify a part of the database which as a function of database design requires more frequent access to determine which data should be stored on the higher tier. In both cases there is no automatic alignment with available storage per tier and data distribution across tiers; the higher tier is not used greedily by the database. Ensuring data is optimally distributed across tiers requires operator intervention (and a rolling replace intervention should the strategy need to be changed); and the controls to support realignment are crude (whole buckets or whole levels need to be re-mapped).
Both solutions complicate the ability to backup, recovery and repair of the database on disk failure - given that different tiers of storage may fail independently. A goal for Riak going forward is to improve the operational simplicity of backup, recovery and repair; so there is a need consider the overheads introduced by an tiered storage solution when assessing options.
Proposal
Given the lack of clarity over the requirement for tiered storage in Riak, a relatively simple change is proposed initially.
The simple solution is to allow the two parts of the leveled backend storage system to be split across tiers. That is to say, allow the ledger (key/metadata store) to be mapped to "tier 1" storage, whilst the journal (value store) is mapped to "tier 2" storage. The journal contains the bulk of data, and the ledger normally generates the majority of I/O activity. Also there is no requirement for the data within the ledger to be protected: on startup the ledger can be rebuilt from the Journal; the Ledger is only persisted to save time at startup, and allow the size of the Ledger to outgrow available memory.
Hot backups in this tiered storage solution will still be possible, but will only require backup of "tier 2" storage. If "tier 1" storage fails on a node, then this will be recovered without intervention at startup (as the ledger is always rebuilt from the journal when it is not present). If "tier 2" storage fails on a node, then the "tier 1" storage should also be wiped - and recovery can then be undertaken from backup to "tier 2", or by intra-cluster transfer via the node repair solution.
In this proposal, write latency to "tier 2" is still on the critical path, in that each PUT must be appended to the Journal (in tier 2) on each vnode preflist. This is likely to mean that if write performance to tier 2 is constrained, database performance may be significantly impacted by the choice of "sync_on_write" strategy.
Design
This requires a change to leveled only. Currently a single path for data is passed to leveled, and this change will allow two paths to be passed instead.
Alternative Design Ideas
Split Journal Files across Tiers
The primary alternative considered was for each journal file to be split so that it is actually two CDB files, one in tier 1, and one in tier 2. On startup a function would be passed into leveled that takes {Bucket, Key, Size} as an input a returns a tier. When receiving an object, or compacting a file the function is called to determine which tier the object should be written to - that function could map on bucket for example, but potentially also on object size. When reading from the Journal, the tier 1 file will be read first, and if the object is not present the tier 2 file will be checked.
When writing the Journal (either when it is the active Journal or in compaction mode), the Journal is considered full when either CDB file reaches capacity, regardless of how empty the other file is.
This would mimic the existing multi-backend solution for tiered storage, in allowing for values to be mapped between tiers by bucket. However, it would complicate the handling of backup and recovery on failure, as both tiers would now be required to recover a node in a consistent state - if either tier is lost or corrupted, then both will need to be cleared to be recovered by repair.
Tier 1 per-Bucket Journal Cache
The secondary alternative is to have a cache of journal objects in Tier 1, in addition to the proposed solution (i.e. when the Journal is persisted to Tier 2). The tier 1 cache would have cache filter function that based on {Bucket, Key, Size} what can be used to determine if an object should be cached (so in Riak this can be used to specify buckets that are cache eligible, and objects which are too big to be cached). When looking for a given Journal Key, if the cache filter function passes, the cache will be checked first - and the object will only be fetched from the Journal when not present in the cache (and in this case will be added to the cache). As Journal Keys contain a SQN there is no need for cache invalidation on PUT to the Journal.
The cache will consist of two on-disk cache files (stale & fresh) identified by GUID, and an in-memory map JournalKey -> {FileGUID, Position, Length}. When a new object becomes cache eligible it is appended to the fresh cache file, and the in-memory map is updated. When the fresh file is at capacity, the stale file is deleted, the fresh is considered now to be stale and a new fresh file is created (and the map is GC'd of old stale references).
The operator can choose the size of the cache files (or potentially how many stale files are allowed).
In this alternative read access to tier 2 is reduced, and this can be made specific to certain buckets that require acceleration. However, reads will become writes (so correct cache sizing is important), and the challenge of the potential tier 2 write bottleneck is not addressed.
On startup the cache process should start again, erasing any previous cache files.
Testing
Caveats
Pull Requests
Planned Release for Inclusion
Beta Was this translation helpful? Give feedback.
All reactions