Feature Proposal: TTL via Bucket Property #28
Replies: 1 comment
-
I'm not too concerned about the scenario where a client obtains a reference to an object, whether via I do think that rejecting creation of a bucket type with an invalid combination of properties would be a good thing to have, but we still need a clearly documented resolution of conflicts in extant bucket types. That is, the relationship between properties should be clearly documented and enforced in code. For instance:
This is in addition to my previously stated desire that we clearly document the bucket property inheritance graph and ensure that what's documented is, in fact, what the code enforces in all cases. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Background
TTL support presently exists in Riak through different methods:
Object TTL is a hard problem in Riak due to AAE. To build an efficient anti-entropy system the caching of state in the vnode is required, and in the case of parallel-mode AAE a secondary storage of keys and clocks is required. If the backend "disappears" an object how is that state sync'd into the AAE system so that the AAE system correctly represents the state of the vnode backend? When the vnode backend and the AAE state gets out of sync, it is only fixed by AAE rebuilds (tree or store rebuilds), but these happen independently by vnode and are deliberately spaced out - and so will prompt false repair work (i.e. object recovery).
There is also the question of when on object should expire. Should this be based on a TTL from the PUT time at the vnode, or a TTL from the last-modified date of the object. In the former case objects will outlive their expiry on vnodes following non-API update means (e.g. read repair, handoff). In the latter case, a single node with a (slow) skewed clock could prompt PUTs that are immediately expired on all vnodes.
Proposal
The proposal is to add two bucket properties:
sweeper_object_ttl
;backend_object_ttl
.For a given bucket only one, or none of these properties should be set. Both settings will be false by default.
The
sweeper_object_ttl
is a TTL where garbage collection will be handled by a sweeper process. The TTL will be configured in days, as it is assumed that this method will not be used for short-lived objects (except inriak_test
).Objects in buckets with a
sweeper_object_ttl
will be tracked and recovered via intra-cluster or inter-cluster anti-entropy,and the solution requires that tictacaae is enabled. Garbage collection will be triggered periodically, but garbage collection will be triggered so that it is applied across all vnodes and clusters from the same trigger.
If an object is fetched beyond the TTL, but the garbage collection has not occurred a
not_found
response will be returned, but with the vector clock of the current object in the metadata (as would occur when fetching a tombstone) - in order to protect against sibling creation when a key is re-used on not_found. Object keys for expired objects, which have not yet been garbage collected, may be returned in index queries.The
backend_object_ttl
is a TTL where garbage collection will be handled by the backend. The TTL will be configured in seconds, as it is assumed that this method will be used for short-lived objects that do no require anti-entropy protection. The backend will be responsible for expiring both index entries and the object at the same time point, so that index queries and object presence will generally appear to be in sync. Thebackend_object_ttl
will work when tictacaae is not active, and if tictacaae is active theaae_tree_exclude
bucket property must also be set.Design
The changes required to support the
sweeper_object_ttl
will be:riak_kv_get_fsm
to check for the property, and if it is present, before replying to the client any fetched object should have it LMD checked. If the LMD indicates the object is beyond its TTL, then the objects contents will be emptied and theX-Riak-Deleted
key added to each content metadata. Theriak_kv_get_fsm
will not prompt for any garbage collection. The client should receive a not_found with a vector clock (of the object), just as with a tombstone.sweeper_object_ttl
. A random bucket type or bucket will be then selected, and anerase_keys
aae_fold run for that selection with a last modified date range based on the TTL. The erase_keys fold will then prompt the garbage collection, using the riak_kv_eraser queue. Once this is complete (in that the fold is finished and the deletes are queued), the sweeper will sleep for a period calculated by looking at the number of buckets that need sweeping, so that it can get through all the sweep load in one day (each node will have its own sweeper, so on average each bucket will be swept once for each node in the cluster).The existing eraser process handles deletion from AAE, replication etc. Reaping will be dependent on delete_mode, though as part of this change it may make sense to look at reusing the sweeper to also perform per bucket reap work (i.e. add a
sweeper_tomb_ttl
bucket property to automate the reap of tombstones left in keep mode.The
backend_object_ttl
requires a change toriak_kv_put_fsm
to check for the property, and then an option to be added to the options in the put request being passed to the riak_kv_vnodes. If the riak_kv_vnode sees a TTL option added (with a TTL value), before performing the PUT it should check for the TTL backend capability, and if it is present request a temp_put into the backend not a put. The TTL used in the backend should be taken from the option. Only leveled will initially have the TTL backend capability.The
riak_kv_vnode:handle_handoff_data/2
function will need to check bucket properties for inbound data to set the TTL PUT option, but in this case the TTL must be calculated from the object LMD and thebackend_object_ttl
for the bucket (as there will only be some of the TTL remaining). There will need to be a change to diff_index_specs calculation when the temp_put is used to handle when TTL objects are mutated, as index entries that already exist will need to be re-added (so that the new index entries have the updated TTL).Alternative Design Ideas
An alternative is to simply support the
sweeper_object_ttl
only (and not abackend_object_ttl
), but the use of a sweeper for small TTLs (i.e. order 1 hour) and very large TTLs (i.e. more than 1 year) may make scheduling problematic. There may be confusion about two methods, but the separation between use-cases (long-lived requiring anti-entropy vs short-lived without entropy protection) may make the confusion manageable.An alternative to having support for
backend_object_ttl
would be a more complete solution to the problem of low-latency access to temporary objects (i.e. for web session storage) i.e. a replacement for the multi-backend/memory-backend approach whereby as well as managing expiry the solution should guarantee the objects are stored in-memory (for performance), and access to the store can be prioritised to avoid the issue of the shared vnode queue (i.e. through OTP 28 priority messages, or by having a dedicated additional vnode in riak for this purpose).Testing
Caveats
For the
sweeper_object_ttl
property 2i queries will reflect the results within the store including any expired objects which have not yet been garbage collected. There is no simple way of dealing with this within Riak, the client will need to handle a 2i query result whre the object is then discovered to be not_found, but the scenario can happen anyway. This will also be true for aae_fold (although the LMD range could be used if this was a problem in this case). the behaviour for 2i queries will therefore differ betweensweeper_object_ttl
andbackend_object_ttl
where the index entries will expire with the object.There are presently in Riak invalid property combinations (e.g. with allow_mult/lww), and this introduces further potential rules where one property requires another either be present or not present. It is assumed that documentation of these invalid property combinations continues to be a sufficient answer, and there is no need for validation on setting the combination.
The solution depends on aae_fold queries for LMD "older than" a given date. It should be noted that although aae_fold queries run faster when checking for LMD "more recent than" a given date, all "older than" queries are in effect a full-scan of the key range (and there is no way of optimising this).
In multi-cluster environments there is the issue that of erase/reap jobs which progress at a slower place in sink clusters than the source cluster - which may lead to full-sync activity and false resurrection. There is no workaround to this other than the tuning of
tombstone_pause
. As this feature would now erase unprompted by an operator, require a workarounds with operator tuning is less acceptable.Pull Requests
Planned Release for Inclusion
Beta Was this translation helpful? Give feedback.
All reactions