Feature Proposal: Tiered Storage #29

martinsumner · 2025-08-20T13:04:34Z

martinsumner
Aug 20, 2025
Maintainer

Background

Riak is used to meet data storage needs where the volume of data is relatively high. A benefit of using Riak in big-data scenarios is the ability to reduce storage costs by the distribution of storage workloads over many individual nodes, where each node has relatively inexpensive storage. The distributed nature of the database reduces the risk of bottlenecks on disk access, allowing storage to be lower cost (as increased latency can be tolerated). The distributed nature of the database and the extensive reconciliation and repair features - means that reliability can be handled through redundancy and recovery between nodes, not redundancy within a node, allowing storage to be lower cost (as reduced local disk resilience can be tolerated).

Tiered storage is primarily a solution to resolve the cost efficiency of scaling disk space in vertically-scaled databases, and as-such is not a primary requirement within Riak. However, there may be scenarios where having tiered storage features in addition to the advantages of horizontal scaling may be beneficial. There are currently two tiered-storage options in Riak:

the basic method of multi-backend, whereby buckets are mapped to different backends, and those backends may be mapped to different storage tiers.
within the (deprecated) eleveldb backend there was a tiered storage feature whereby different "levels" of the database could be mapped to different storage tiers (so recent writes would exist on one tier, and less recent writes on another).

These options are functionally different to the operator. One option (multi-backend) assumes there is some designation based on bucket such that the designation can be aligned in some way because of frequency of access, frequency of update or volume of data - to the tier of storage. The other (eleveldb) is trying to identify a part of the database which as a function of database design requires more frequent access to determine which data should be stored on the higher tier. In both cases there is no automatic alignment with available storage per tier and data distribution across tiers; the higher tier is not used greedily by the database. Ensuring data is optimally distributed across tiers requires operator intervention (and a rolling replace intervention should the strategy need to be changed); and the controls to support realignment are crude (whole buckets or whole levels need to be re-mapped).

Both solutions complicate the ability to backup, recovery and repair of the database on disk failure - given that different tiers of storage may fail independently. A goal for Riak going forward is to improve the operational simplicity of backup, recovery and repair; so there is a need consider the overheads introduced by an tiered storage solution when assessing options.

Proposal

Given the lack of clarity over the requirement for tiered storage in Riak, a relatively simple change is proposed initially.

The simple solution is to allow the two parts of the leveled backend storage system to be split across tiers. That is to say, allow the ledger (key/metadata store) to be mapped to "tier 1" storage, whilst the journal (value store) is mapped to "tier 2" storage. The journal contains the bulk of data, and the ledger normally generates the majority of I/O activity. Also there is no requirement for the data within the ledger to be protected: on startup the ledger can be rebuilt from the Journal; the Ledger is only persisted to save time at startup, and allow the size of the Ledger to outgrow available memory.

Hot backups in this tiered storage solution will still be possible, but will only require backup of "tier 2" storage. If "tier 1" storage fails on a node, then this will be recovered without intervention at startup (as the ledger is always rebuilt from the journal when it is not present). If "tier 2" storage fails on a node, then the "tier 1" storage should also be wiped - and recovery can then be undertaken from backup to "tier 2", or by intra-cluster transfer via the node repair solution.

In this proposal, write latency to "tier 2" is still on the critical path, in that each PUT must be appended to the Journal (in tier 2) on each vnode preflist. This is likely to mean that if write performance to tier 2 is constrained, database performance may be significantly impacted by the choice of "sync_on_write" strategy.

Design

This requires a change to leveled only. Currently a single path for data is passed to leveled, and this change will allow two paths to be passed instead.

Alternative Design Ideas

Split Journal Files across Tiers

The primary alternative considered was for each journal file to be split so that it is actually two CDB files, one in tier 1, and one in tier 2. On startup a function would be passed into leveled that takes {Bucket, Key, Size} as an input a returns a tier. When receiving an object, or compacting a file the function is called to determine which tier the object should be written to - that function could map on bucket for example, but potentially also on object size. When reading from the Journal, the tier 1 file will be read first, and if the object is not present the tier 2 file will be checked.

When writing the Journal (either when it is the active Journal or in compaction mode), the Journal is considered full when either CDB file reaches capacity, regardless of how empty the other file is.

This would mimic the existing multi-backend solution for tiered storage, in allowing for values to be mapped between tiers by bucket. However, it would complicate the handling of backup and recovery on failure, as both tiers would now be required to recover a node in a consistent state - if either tier is lost or corrupted, then both will need to be cleared to be recovered by repair.

Tier 1 per-Bucket Journal Cache

The secondary alternative is to have a cache of journal objects in Tier 1, in addition to the proposed solution (i.e. when the Journal is persisted to Tier 2). The tier 1 cache would have cache filter function that based on {Bucket, Key, Size} what can be used to determine if an object should be cached (so in Riak this can be used to specify buckets that are cache eligible, and objects which are too big to be cached). When looking for a given Journal Key, if the cache filter function passes, the cache will be checked first - and the object will only be fetched from the Journal when not present in the cache (and in this case will be added to the cache). As Journal Keys contain a SQN there is no need for cache invalidation on PUT to the Journal.

The cache will consist of two on-disk cache files (stale & fresh) identified by GUID, and an in-memory map JournalKey -> {FileGUID, Position, Length}. When a new object becomes cache eligible it is appended to the fresh cache file, and the in-memory map is updated. When the fresh file is at capacity, the stale file is deleted, the fresh is considered now to be stale and a new fresh file is created (and the map is GC'd of old stale references).

The operator can choose the size of the cache files (or potentially how many stale files are allowed).

In this alternative read access to tier 2 is reduced, and this can be made specific to certain buckets that require acceleration. However, reads will become writes (so correct cache sizing is important), and the challenge of the potential tier 2 write bottleneck is not addressed.

On startup the cache process should start again, erasing any previous cache files.

Testing

Caveats

Pull Requests

Planned Release for Inclusion

tburghart · 2025-08-26T10:43:38Z

tburghart
Aug 26, 2025
Maintainer

I'll have other comments, but first, my long-held opinion is that we need to purge the notion of "backups" entirely from the Riak lexicon.

This has been an issue for a very long time. Riak's operational characteristics and target use cases are fundamentally incompatible with the traditional understanding of backups in the context of databases. A significant factor in the evolution of the eventually-consistent model is the inability to effectively back up (and restore) a viable "snapshot" of the state of a huge dataset - we were running into the limits of physics trying to back up large datasets (generally RDBs, though not always) 30 years ago, and despite faster hardware at every level of the stack we're dealing with datasets and storage requirements many orders of magnitude larger today. The speed of light is unlikely to change.

We store multiple replicas in a cluster specifically to get around node-level hardware failures and maintenance. We support cross-cluster replication specifically to recover from cluster-level outages. Attempting to back up data at the cluster or backend level is not only meaningless, it pollutes design considerations with a fundamentally invalid constraint. Equally (if not more so) problematic is that operators are lulled into believing that they could back up their dataset if they wanted to, or worse that they are backing up their dataset, only to discover that there's no meaningful way to restore it should it come to that.

We need to rid ourselves of the term. It's an illusion that benefits nobody.

2 replies

martinsumner Aug 26, 2025
Maintainer Author

There are Riak installations with practical usages of backups.

There are installations where for each domain, there are multiple replicating clusters, with the one being a smaller n=1 cluster - and leveled incremental hot-backups are used to send daily deltas (in the Journal) from the n=1 cluster to very-low cost storage that can be switched into read-only mode (and that is then considered to be a protected backup).

The primary purpose of the backup is single-object historical read: it is a feature of Riak/leveled backup that you can read an individual object from a backup, without having to restore the whole database - it takes about 10s, but for some rare operational queries it does have a usage, an operator can pass {Backup-Date, Bucket, Key} to a script and get a value back from the backup system even when the object has been wiped from the database. This is nice to have in Riak, in that there are many alternative databases where the costs of providing such a service would be impractical.

In terms of full restore, this is subject to increasing focus in some sectors due to ransomware attacks. In certain environments policy is shifting, whereby it is no longer acceptable to have prevention as an answer to such attacks, every project has to assume that an attack can occur and have a restore option. The restore process may not be quick, but it has to exist, it is a compliance requirement. At sub-PB scales it is possible to have bounded backup/restore times with Riak/leveled.

tburghart Aug 26, 2025
Maintainer

Our approach is for clients to take responsibility for backups of selected subsets of objects that they care about. We also have the foundation of recycle bin functionality built in (not all parts pushed yet) with the intention that it'll handle the "accidental delete" (single-object historical read) scenario.

Full database backups are problematic because the only way to get a consistent image is to quiesce the database, which in the Riak case means shutting the cluster down (since causing all clients to disconnect alone doesn't ensure that all data propagation has completed in a determinable timeframe), then taking a snapshot of each node's data partition(s).
While modern filesystem snapshots can do that quickly, it still entails full cluster downtime, which isn't feasible in a lot of Riak use cases. Any snapshot, whether filesystem or backend-supported, from a running cluster is subject to being out-of-date relative to the rest of the cluster upon restoration, and there are myriad scenarios where even after partition repair you could end up with something other than the exact state you thought you had backed up.

For someone who truly understands all of the implications, sure, you may be able to come up with a strategy that works for you (we have, and you have, but neither are trivial and both rely on deep understanding of how Riak works), but I'm still opposed to any public suggestion that "Riak supports backups" because, basically, its data model doesn't - I don't want to deal with users complaining that "I backed up my Riak database and when I restored my backup [pick any of a zillion scenarios] doesn't work right".
Live data disappeared? Then it reappeared? Deleted data resurrected? And wtf is a sibling, anyway? All questions I don't want to deal with.

The ransomware scenario is likely equivalent to the non-replicated scenario - if you don't replicate to a physically separate cluster, you're at risk whether some hacker encrypts your filesystem or your datacenter burns down. Riak's saving grace in the ransomware case is that if someone gets in and encrypts the filesystem replication is going to stop working pronto (maybe even before Riak crashes), so the damage should be contained to the source cluster.

every project has to assume that an attack can occur and have a restore option

True, but the database itself doesn't need to provide that out of the box. I'd argue that your approach of replicating to another cluster that you can then back up, either through shutting it down and taking an FS snapshot, or a hot backup via your backend, is not what most people think of as supporting backups. And does your 10:00am backup contain data you wrote at 9:59am (even assuming your clocks are in sync)? "Maybe? 🤔" isn't the answer they're looking for.

I'm not suggesting that you don't do what you need to do to protect your data, I'm just suggesting that we get away from talking about "Riak backups" because that's not a thing.

tburghart · 2025-08-26T11:05:07Z

tburghart
Aug 26, 2025
Maintainer

I think being able to split the ledger and journal onto distinct filesystem paths is desirable and should cover the vast majority of requirements.

Conversely, I don't see supporting multiple paths for journal data as being worth the effort. If at some point in the future someone needs this, they can fund its development, but it strikes me as a pretty esoteric use case (especially in light of possible in-memory buckets).
I'm not aware of much (if any) usage of eleveldb's support for tiered storage, so I think this should go into the "if you want it, step up with a maintainer" category.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open Riak

Feature Proposal: Tiered Storage #29

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Open Riak

Feature Proposal: Tiered Storage #29

Uh oh!

martinsumner Aug 20, 2025 Maintainer

Background

Proposal

Design

Alternative Design Ideas

Split Journal Files across Tiers

Tier 1 per-Bucket Journal Cache

Testing

Caveats

Pull Requests

Planned Release for Inclusion

Replies: 2 comments · 2 replies

Uh oh!

tburghart Aug 26, 2025 Maintainer

Uh oh!

martinsumner Aug 26, 2025 Maintainer Author

Uh oh!

tburghart Aug 26, 2025 Maintainer

Uh oh!

tburghart Aug 26, 2025 Maintainer

martinsumner
Aug 20, 2025
Maintainer

Replies: 2 comments 2 replies

tburghart
Aug 26, 2025
Maintainer

martinsumner Aug 26, 2025
Maintainer Author

tburghart Aug 26, 2025
Maintainer

tburghart
Aug 26, 2025
Maintainer