Feature Proposal: In-Memory Buckets #30

martinsumner · 2025-08-21T15:51:33Z

martinsumner
Aug 21, 2025
Maintainer

Background

Prior to Riak 3.4, Riak supported an in-memory backend as part of the riak_kv_multi_backend approach.

The use of the memory backend in Riak is problematic:

There are a number of potential issues around garbage-collection and TTL, that could have a significant impact on performance.
The use of TTL, gc'ing objects at memory limits and the general ephemeral nature of the backend, made the approach incompatible with active anti-entropy and related features (i.e. aae_fold and nextgenrepl full-sync), without inefficient and undocumented operator workarounds.
Standard operator processes (e.g. cluster-wide rolling restart) have sometimes unexpected and undocumented consequences.
Continuing to support a variety of backends impacts the cost of rolling out features whilst maintaining cross-backend compatibility (e.g. query API).
Many riak_test tests fail with the backend due to the lack of persistence across restarts, so it is not possible to maintain confidence in the ongoing reliability of the backend across releases (e.g. tests of features that the backend should support may fail as the test coincidentally includes a node stop/start).
The actual performance characteristics of the backend did not always meet expectations, either due to issues with the implementation, or the nature of the shared vnode queue. In a multi-backend environment, all backends share a single vnode queue; so although in theory one might have "fast" backends (ie. memory backend) and "slow" backends (i.e. bitcask backend mapped to Tier 2 storage), the shared queue means performance is not independent between backends, as batch activity on "slow" backends would cause queueing of activity to "fast" backends.

There are known examples of the memory backend being used, but the expectations behind the choice aren't known in most cases. There are some potential reasons why choosing a memory backend may be a positive choice:

For objects subject to frequent read access; a memory backend reduces CPU costs associated with deserialisation. Note that persisted backends use the file-system page cache greedily, so frequently read objects are generally in-memory even in persisted backends.
Prioritising the use of memory to specific backends; persisted backends depend on the page cache for populating available memory, however the page cache is concerned only with frequency of access, it cannot be prioritised by bucket (as with "promoting" a bucket to the memory backend).
For objects subject to frequent mutation; a memory backend reduces CPU costs associated with compaction. Both the Bitcask and leveled backends append all writes, and compact out-of-date writes at a later date.
The in-memory backend supported the management of short TTLs on objects. See related TTL discussion for specific proposals on the use of TTL in future Riak versions.

Given the existing limitations, supporting the as-is going forward is not an option. The use of the memory backend requires either relaunch, pareto-replacement or retirement.

Proposal

The memory backend should be retired in Riak 4.0. Those currently using the backend can migrated nodes in the cluster to using a single leveled backend via a rolling replace, and TTL requirements can be resolved through new TTL bucket properties.

Design

n/a

Alternative Design Ideas

Relaunch

The relaunch of the in-memory backend would require a backlog of issues to be addressed, and also:

The potential use of ets:tab2file/3 and ets:file2tab/2 on normal shutdown to give expected behaviour in rolling restarts (or some other documented strategy to support rolling restarts when using the backend).
The generalisation of the new Query API so that it can be used in all backends (by moving functions from within the leveled backend out to the fold buffer).
A significant effort in testing and documentation of out-of-memory cases (especially with regards to the potential deletion of objects prior to TTL expiry).
A full review of riak_test to ensure tests either pass with the memory_backend a=or have specific exclusions for when used with the memory_backend.
Some actual full-scale non-functional test evidence to document the actual difference between using in-memory and persisted stores.

Even with this work, there will be unresolved issues with regards to:

The use of the backend in environments which also require AAE on other data outside of the backend.
The problem of the shared vnode queue.

A more fundamental change may still be required in the relaunch, with the encapsulation of the memory backend in a new riak_mem_vnode, that mimics the riak_kv_vnode. This might allow for:

A specific AAE mechanism for the vnode, that used a kv_index_tictactree controller but one optimised for the use with an in-memory store (i.e. potentially using a secondary index for segment -> Key/Clock mappings) to allow for faster repair.
Isolation of vnode queues between memory/persisted backends, to allow stricter guarantees of low-latency access.

Buckets would be mapped to the alternative vnode (by property), just as they are presently mapped to backends.

Pareto-Replacement - Support for Priority Buckets

The current riak_kv_vnode includes a metadata cache which is disabled by default. If it is enabled, it will cache the metadata of recently read (via HEAD) or written (via PUT) objects to accelerate the read before write on PUTs. The cache is trimmed whenever it reaches a maximum size. This could be enhanced to:

have a priority property on buckets that would determine if an object is eligible for the cache (i.e. cache only metadata of priority buckets);
use the cache when handling HEAD requests, as this is the most common operation, and would reduce greatly deserialisation costs in these cases;
remove from the cache on PUT (as recent PUTs already have their metadata cached in ETS tables in the leveled backend).

The same priority bucket property could be used also to prioritise messages in the riak_kv_vnode queue using the OTP 28.0 feature of the priority messages. This would need a new riak_core_request message type with a priority field, where the vnode_proxy will read the priority field and use it when sending messages to the vnode.

This would mean that, memory could be prioritised to accelerating reads for given buckets, and avoiding the CPU cost of deserialisation. So that overall frequently accessed buckets could get lower latency reads at lower CPU cost - without needing to implement a dedicated backend or vnode.

There are some potential issues to consider:

Effectively testing Riak to be sure that it works with and without the cache enabled on buckets (e.g. property testing of the vnode).
Checks that the current metadata cache implemented efficiently (e.g. regular calls to get the table memory size).
Overheads of cache duplication between the backend and vnode metadata cache (e.g. sst files already cache recent reads).
The potential for overuse of priority to lead unexpected and uncertain behaviour (e.g. handoff timeouts, and subsequent rework).

Pareto-Replacement - In-Memory Counters and Small-Sets

The memory backend (in theory) should provide lower cost changes to individual objects, where individual objects are changed with high-frequency (e.g. o(100) per second) - due to the ability to update in place without generating a backlog of on-disk compaction activity. However, when frequently updating individual objects, there are a set of related problems which are not directly resolved by making the update in-memory:

managing eventual consistency due to the high probability of conflicting updates;
the read/write/replicate on-network overhead of updating large objects.

For small CRDTs (especially operation-based CRDTs) there is efficiency for solving the broader problems not solved directly by making the changes in-memory. Rather than evolving CRDTs as a general solution to data-modelling in Riak, instead CRDTs would be limited to a use in a specific riak_mem_vnode for specific tasks associated with low-data-size high-frequency update problems e.g. maintaining counters of activity, sets of active session identifiers etc. The vnode would then evolve specific solutions to other related problems of scale, for example:

Batching up real-time replication events to reduce round trip times for sink workers.
Managing anti-entropy and inter-cluster reconciliation in a way more suited to small object counts with high frequency changes.
Storing CRDTs in a more efficient way (i.e. as with the existing work on bigsets).

There are potential issues to consider:

CRDTs are non-trivial from the database developer perspective, there is significant design and development load in ensure completeness and efficiency of any solution.
Take-up of CRDTs has been low, (and take-up of CRDT features in other databases such as Orleans) perhaps because of general developer unfamiliarity with using CRDTs.

Testing

Caveats

Pull Requests

Planned Release for Inclusion

tburghart · 2025-08-26T12:43:10Z

tburghart
Aug 26, 2025
Maintainer

I like the idea of an in-memory bucket property. A distinct riak_mem_vnode could be the best overall solution, but another approach might be to map the object value into metadata, leveraging fast storage of the leveled ledger proposed in Tiered Storage and in-memory metadata storage implemented in Bitcask 3.x.

I'm not considering what complexities might present themselves with storing CRDTs in metadata.

I've thought for a long time that some of OTP's newer capabilities might be leveraged to create a very fast and efficient memory backend. Who knows how big a map() can get while maintaining acceptable performance? What might be done with atomic arrays (after all, they store 64-bit integers, easily coerced to/from Erlang terms via a tiny NIF)?

And as anti-NIF as I am, I think a NIF could be constructed to manage an in-memory hashmap where the leaf node values are Erlang terms instead of Bitcask's C structs, with efficient usage of enif_alloc/realloc/free, and serialized through a proc (whether gen_server or bespoke) to mitigate the need for mutexes. In fact, a riak_mem_vnode, itself a proc, could serve that purpose.

I think there's a lot of benefit to be gained if we can just store riak_objects directly without any serialization to external representations, and avoiding ets object copies would be a big win.

I'm not really concerned about loss of the data on a shutdown/startup cycle, that comes with the territory IMO.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open Riak

Feature Proposal: In-Memory Buckets #30

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Open Riak

Feature Proposal: In-Memory Buckets #30

Uh oh!

martinsumner Aug 21, 2025 Maintainer

Background

Proposal

Design

Alternative Design Ideas

Relaunch

Pareto-Replacement - Support for Priority Buckets

Pareto-Replacement - In-Memory Counters and Small-Sets

Testing

Caveats

Pull Requests

Planned Release for Inclusion

Replies: 1 comment

Uh oh!

tburghart Aug 26, 2025 Maintainer

martinsumner
Aug 21, 2025
Maintainer

tburghart
Aug 26, 2025
Maintainer