Feature Proposal: In-Memory Buckets #30
Replies: 1 comment
-
I like the idea of an in-memory bucket property. A distinct
I've thought for a long time that some of OTP's newer capabilities might be leveraged to create a very fast and efficient memory backend. Who knows how big a And as anti-NIF as I am, I think a NIF could be constructed to manage an in-memory hashmap where the leaf node values are Erlang terms instead of Bitcask's C structs, with efficient usage of enif_alloc/realloc/free, and serialized through a proc (whether gen_server or bespoke) to mitigate the need for mutexes. In fact, a I think there's a lot of benefit to be gained if we can just store I'm not really concerned about loss of the data on a shutdown/startup cycle, that comes with the territory IMO. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Background
Prior to Riak 3.4, Riak supported an in-memory backend as part of the riak_kv_multi_backend approach.
The use of the memory backend in Riak is problematic:
riak_test
tests fail with the backend due to the lack of persistence across restarts, so it is not possible to maintain confidence in the ongoing reliability of the backend across releases (e.g. tests of features that the backend should support may fail as the test coincidentally includes a node stop/start).There are known examples of the memory backend being used, but the expectations behind the choice aren't known in most cases. There are some potential reasons why choosing a memory backend may be a positive choice:
Given the existing limitations, supporting the as-is going forward is not an option. The use of the memory backend requires either relaunch, pareto-replacement or retirement.
Proposal
The memory backend should be retired in Riak 4.0. Those currently using the backend can migrated nodes in the cluster to using a single leveled backend via a rolling replace, and TTL requirements can be resolved through new TTL bucket properties.
Design
n/a
Alternative Design Ideas
Relaunch
The relaunch of the in-memory backend would require a backlog of issues to be addressed, and also:
ets:tab2file/3
andets:file2tab/2
on normal shutdown to give expected behaviour in rolling restarts (or some other documented strategy to support rolling restarts when using the backend).Even with this work, there will be unresolved issues with regards to:
A more fundamental change may still be required in the relaunch, with the encapsulation of the memory backend in a new
riak_mem_vnode
, that mimics theriak_kv_vnode
. This might allow for:kv_index_tictactree
controller but one optimised for the use with an in-memory store (i.e. potentially using a secondary index for segment -> Key/Clock mappings) to allow for faster repair.Buckets would be mapped to the alternative vnode (by property), just as they are presently mapped to backends.
Pareto-Replacement - Support for Priority Buckets
The current riak_kv_vnode includes a metadata cache which is disabled by default. If it is enabled, it will cache the metadata of recently read (via HEAD) or written (via PUT) objects to accelerate the read before write on PUTs. The cache is trimmed whenever it reaches a maximum size. This could be enhanced to:
The same priority bucket property could be used also to prioritise messages in the riak_kv_vnode queue using the OTP 28.0 feature of the priority messages. This would need a new riak_core_request message type with a priority field, where the vnode_proxy will read the priority field and use it when sending messages to the vnode.
This would mean that, memory could be prioritised to accelerating reads for given buckets, and avoiding the CPU cost of deserialisation. So that overall frequently accessed buckets could get lower latency reads at lower CPU cost - without needing to implement a dedicated backend or vnode.
There are some potential issues to consider:
Pareto-Replacement - In-Memory Counters and Small-Sets
The memory backend (in theory) should provide lower cost changes to individual objects, where individual objects are changed with high-frequency (e.g. o(100) per second) - due to the ability to update in place without generating a backlog of on-disk compaction activity. However, when frequently updating individual objects, there are a set of related problems which are not directly resolved by making the update in-memory:
For small CRDTs (especially operation-based CRDTs) there is efficiency for solving the broader problems not solved directly by making the changes in-memory. Rather than evolving CRDTs as a general solution to data-modelling in Riak, instead CRDTs would be limited to a use in a specific riak_mem_vnode for specific tasks associated with low-data-size high-frequency update problems e.g. maintaining counters of activity, sets of active session identifiers etc. The vnode would then evolve specific solutions to other related problems of scale, for example:
There are potential issues to consider:
Testing
Caveats
Pull Requests
Planned Release for Inclusion
Beta Was this translation helpful? Give feedback.
All reactions