Feature Proposal: TTL via Bucket Property #28

martinsumner · 2025-08-15T15:47:02Z

martinsumner
Aug 15, 2025
Maintainer

Background

TTL support presently exists in Riak through different methods:

the bitcask and memory backend support a fixed TTL for all objects in the backend;
the eleveldb and leveled backends support bucket-level and object-level expiry (but this isn't wired into Riak).

Object TTL is a hard problem in Riak due to AAE. To build an efficient anti-entropy system the caching of state in the vnode is required, and in the case of parallel-mode AAE a secondary storage of keys and clocks is required. If the backend "disappears" an object how is that state sync'd into the AAE system so that the AAE system correctly represents the state of the vnode backend? When the vnode backend and the AAE state gets out of sync, it is only fixed by AAE rebuilds (tree or store rebuilds), but these happen independently by vnode and are deliberately spaced out - and so will prompt false repair work (i.e. object recovery).

There is also the question of when on object should expire. Should this be based on a TTL from the PUT time at the vnode, or a TTL from the last-modified date of the object. In the former case objects will outlive their expiry on vnodes following non-API update means (e.g. read repair, handoff). In the latter case, a single node with a (slow) skewed clock could prompt PUTs that are immediately expired on all vnodes.

Proposal

The proposal is to add two bucket properties:

sweeper_object_ttl;
backend_object_ttl.

For a given bucket only one, or none of these properties should be set. Both settings will be false by default.

The sweeper_object_ttl is a TTL where garbage collection will be handled by a sweeper process. The TTL will be configured in days, as it is assumed that this method will not be used for short-lived objects (except in riak_test).

Objects in buckets with a sweeper_object_ttl will be tracked and recovered via intra-cluster or inter-cluster anti-entropy,
and the solution requires that tictacaae is enabled. Garbage collection will be triggered periodically, but garbage collection will be triggered so that it is applied across all vnodes and clusters from the same trigger.

If an object is fetched beyond the TTL, but the garbage collection has not occurred a not_found response will be returned, but with the vector clock of the current object in the metadata (as would occur when fetching a tombstone) - in order to protect against sibling creation when a key is re-used on not_found. Object keys for expired objects, which have not yet been garbage collected, may be returned in index queries.

The backend_object_ttl is a TTL where garbage collection will be handled by the backend. The TTL will be configured in seconds, as it is assumed that this method will be used for short-lived objects that do no require anti-entropy protection. The backend will be responsible for expiring both index entries and the object at the same time point, so that index queries and object presence will generally appear to be in sync. The backend_object_ttl will work when tictacaae is not active, and if tictacaae is active the aae_tree_exclude bucket property must also be set.

Design

The changes required to support the sweeper_object_ttl will be:

A change to the riak_kv_get_fsm to check for the property, and if it is present, before replying to the client any fetched object should have it LMD checked. If the LMD indicates the object is beyond its TTL, then the objects contents will be emptied and the X-Riak-Deleted key added to each content metadata. The riak_kv_get_fsm will not prompt for any garbage collection. The client should receive a not_found with a vector clock (of the object), just as with a tombstone.
A change to the aae_fold so that internally a fold can be run over a BucketType, not just a Bucket.
The addition of a sweeper process, that will look through all defined bucket properties to return a list of BucketTypes or legacy Buckets have an integer-value set for sweeper_object_ttl. A random bucket type or bucket will be then selected, and an erase_keys aae_fold run for that selection with a last modified date range based on the TTL. The erase_keys fold will then prompt the garbage collection, using the riak_kv_eraser queue. Once this is complete (in that the fold is finished and the deletes are queued), the sweeper will sleep for a period calculated by looking at the number of buckets that need sweeping, so that it can get through all the sweep load in one day (each node will have its own sweeper, so on average each bucket will be swept once for each node in the cluster).

The existing eraser process handles deletion from AAE, replication etc. Reaping will be dependent on delete_mode, though as part of this change it may make sense to look at reusing the sweeper to also perform per bucket reap work (i.e. add a sweeper_tomb_ttl bucket property to automate the reap of tombstones left in keep mode.

The backend_object_ttl requires a change to riak_kv_put_fsm to check for the property, and then an option to be added to the options in the put request being passed to the riak_kv_vnodes. If the riak_kv_vnode sees a TTL option added (with a TTL value), before performing the PUT it should check for the TTL backend capability, and if it is present request a temp_put into the backend not a put. The TTL used in the backend should be taken from the option. Only leveled will initially have the TTL backend capability.

The riak_kv_vnode:handle_handoff_data/2 function will need to check bucket properties for inbound data to set the TTL PUT option, but in this case the TTL must be calculated from the object LMD and the backend_object_ttl for the bucket (as there will only be some of the TTL remaining). There will need to be a change to diff_index_specs calculation when the temp_put is used to handle when TTL objects are mutated, as index entries that already exist will need to be re-added (so that the new index entries have the updated TTL).

Alternative Design Ideas

An alternative is to simply support the sweeper_object_ttl only (and not a backend_object_ttl), but the use of a sweeper for small TTLs (i.e. order 1 hour) and very large TTLs (i.e. more than 1 year) may make scheduling problematic. There may be confusion about two methods, but the separation between use-cases (long-lived requiring anti-entropy vs short-lived without entropy protection) may make the confusion manageable.

An alternative to having support for backend_object_ttl would be a more complete solution to the problem of low-latency access to temporary objects (i.e. for web session storage) i.e. a replacement for the multi-backend/memory-backend approach whereby as well as managing expiry the solution should guarantee the objects are stored in-memory (for performance), and access to the store can be prioritised to avoid the issue of the shared vnode queue (i.e. through OTP 28 priority messages, or by having a dedicated additional vnode in riak for this purpose).

Testing

Caveats

For the sweeper_object_ttl property 2i queries will reflect the results within the store including any expired objects which have not yet been garbage collected. There is no simple way of dealing with this within Riak, the client will need to handle a 2i query result whre the object is then discovered to be not_found, but the scenario can happen anyway. This will also be true for aae_fold (although the LMD range could be used if this was a problem in this case). the behaviour for 2i queries will therefore differ between sweeper_object_ttl and backend_object_ttl where the index entries will expire with the object.

There are presently in Riak invalid property combinations (e.g. with allow_mult/lww), and this introduces further potential rules where one property requires another either be present or not present. It is assumed that documentation of these invalid property combinations continues to be a sufficient answer, and there is no need for validation on setting the combination.

The solution depends on aae_fold queries for LMD "older than" a given date. It should be noted that although aae_fold queries run faster when checking for LMD "more recent than" a given date, all "older than" queries are in effect a full-scan of the key range (and there is no way of optimising this).

In multi-cluster environments there is the issue that of erase/reap jobs which progress at a slower place in sink clusters than the source cluster - which may lead to full-sync activity and false resurrection. There is no workaround to this other than the tuning of tombstone_pause. As this feature would now erase unprompted by an operator, require a workarounds with operator tuning is less acceptable.

Pull Requests

Planned Release for Inclusion

tburghart · 2025-08-26T09:59:39Z

tburghart
Aug 26, 2025
Maintainer

I'm not too concerned about the scenario where a client obtains a reference to an object, whether via 2i or other means, then on GET receives a not_found - this is a fundamental characteristic of the eventually-consistent (or any non-transactional) model and clients must be prepared for it to occur.

I do think that rejecting creation of a bucket type with an invalid combination of properties would be a good thing to have, but we still need a clearly documented resolution of conflicts in extant bucket types. That is, the relationship between properties should be clearly documented and enforced in code. For instance:

backend_object_ttl implies aae_tree_exclude
allow_mult takes precedence over lww
etc ...

This is in addition to my previously stated desire that we clearly document the bucket property inheritance graph and ensure that what's documented is, in fact, what the code enforces in all cases.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open Riak

Feature Proposal: TTL via Bucket Property #28

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Open Riak

Feature Proposal: TTL via Bucket Property #28

Uh oh!

martinsumner Aug 15, 2025 Maintainer

Background

Proposal

Design

Alternative Design Ideas

Testing

Caveats

Pull Requests

Planned Release for Inclusion

Replies: 1 comment

Uh oh!

tburghart Aug 26, 2025 Maintainer

martinsumner
Aug 15, 2025
Maintainer

tburghart
Aug 26, 2025
Maintainer