Feature Proposal: Riak Large Object Support? #24

tburghart · 2025-06-28T11:46:25Z

tburghart
Jun 28, 2025
Maintainer

Comments Requested

This is nowhere near fully baked, community input will be greatly appreciated.

IF we can come up with something that looks workable, we may commit the resources to implement it.

Overview

We've been kicking around the idea of Large Object (LO) support in Riak's existing KV API.

Conceptually, it would look at the size of an incoming PUT object and, if it exceeds some threshhold, stream the incoming data to the LO module (perhaps riak_lo) that would:

Create a manifest object with the original bucket/key ID.
Receive data from the stream, creating chunk objects with generated keys.
Finalize the manifest as the list of chunk keys.

On GET, if the retrieved object is a manifest, the chunks are retrieved and streamed back.

Object metadata would tie all of the pieces together and flag whether the object is "complete" so that a sweeper can clean up incomplete objects regularly.

Alternatives

Why Not Riak CS?

The similarity with Riak CS is obvious, but the goal is simplicity and transparency.
We have users who want to simply access the Riak HTTP REST API and put whatever they want in there, whether it's a kilobyte or a gigabyte.

There are some significant efficiencies to be had by operating within the KV service itself, including leveaging the existing AuthN/AuthZ structure and the preflists for chunk objects as they're created and retrieved.

Why Not a Smart Client?

There's been discussion of a smart client that performs the transparent chunking externally.

This is not unlike the CS approach, conceptually, and brings along all of the issues of incomplete PUTs, combined with no persistent knowledge of what has to be cleaned up when, for instance, the client process goes down.

The Devil is in The Details

Some thoughts, not necessarily organized ...

Data Streaming

Streaming input/output data through the HTTP API should not be particularly challenging.

Streaming transparently through the Protobuf API may involve significant complexity on the server side, and would still leave the client with a synchronous API.

Chunking Threshold Flexibility

When an object arrives that is N% above the chunking threshold, when is it worth the overhead to chunk it?

What happens when an update changes an object's size across the chunking threshold?

Incomplete PUTs

This is probably The Big One™.
One of the benefits of KV's current all-or-nothing object receivers is that we have all of the information before we start writing the object.
That's simply not viable for extremely large objects (though it might be for objects that are a small multiple of chunk size), where it could take a lot of memory and time to receive the entire object before beginning the chunking and storage process.

This ineveitably means we're going to have incomplete objects that need to be cleaned up.
If the connection is dropped before the LO module gets all of the data, it knows what has been orphaned, but if Riak goes down before cleanup is complete we still need to find those objects.

A detailed metadata schema would need to be developed for some manner of periodic sweeper to find (perhaps using 2i) and delete stranded objects.

Load Management

Consider the load on a Riak node when, in a single GET/PUT operation, it becomes the coordinater for hundreds of per-chunk GET/PUT operations.

Worse, consider the load when it writes 80% of those chunks, the connection is dropped, and it has to go back and delete them all, possibly while the client reconnects and retries the operation.

Can LOs be Updated?

How do you update an object that's split across dozens or hundreds of chunks?

Handling siblings would be mind-bendingly complex.

Initial thought is that LOs would have to be flagged as write-once or LWW.

martinsumner · 2025-06-30T10:22:08Z

martinsumner
Jun 30, 2025
Maintainer

Soma random thoughts of my own.

I'm keen for this to happen, it is something I've thought about myself on different occasions. Alternative solutions (e.g. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-use-s3-too.html) can have relatively small maximum sizes and push complexity back to the developer when limits are hit.
The old luwak code still exists. There are differences in approach here - luwak keeps a tree with hashes of individual content parts, allowing for efficient partial updates and range gets. There is a paragraph in the README about duplication between parts and also conflict resolution - though I'm not sure about how things like hash collisions are handled. I think I initially prefer the idea in this proposal that from an API perspective there shouldn't be a difference between objects above and below the threshold.
One interesting point is how big is "too big" at present. Large objects were a big problem in leveldb due to write amplification. However, as of 3.4 we're only focused on leveled/bitcask and they can currently handle relatively large objects. By default the limits are currently set to warn at 5MB block at 50MB - https://github.com/OpenRiak/riak_kv/blob/openriak-3.2/priv/riak_kv.schema#L683-L695. But could larger already be supported?
- The HEAD vs GET race already improves are handling of large object fetching significantly: the whole preflist is no longer blocked on GET reading from disk; the TCP incast issue is no longer a problem.
- There is selective_sync now, to make it easier for large objects to be written un-sync'd to the page cache.
- It should be possible to change the vnode/backend relationship with the leveled backend so that for large objects both the GET and PUT could be non-blocking of the vnode i.e. use an async version of the backend GET/PUT once over a certain threshold.
- Generally available bandwidth, and disk speeds in data centres (or cloud) are already 10 x better than when the thresholds were set.
- The Erlang distribution protocol now has improved handling to interleave large and small objects.
Perhaps thresholds, with some subtle changes, are already an order of magnitude too low.
If large(r) objects are to be supported via the current API, it might make the idea of changing the API to support pure metadata changes i.e. updating the metadata without changing the value, for example to make updating indexes on large objects much easier, not requiring a re-write of the value. This has been discussed before for leveled, potentially to allow for mini-CRDTs in the metadata (e.g. counters).

1 reply

tburghart Jun 30, 2025
Maintainer Author

To the last point first, I'm very much in favor of being able to update metadata separately from values, and in fact I believe that capability to be key to supporting chunked LOs, as I see being able to update records from "incomplete" to "complete" as a core requirement for orphan management.

The more I think about it, the more convinced I am that LOs would need to enforce some constraints, which I believe can be expressed within the current range of error results, to make the implementation manageable.

Assume that riak_kv.chunk_threshold is the primary size we care about. It's almost certainly the same as riak_kv.chunk_size, and probably riak_kv.max_object_size too. The two chunk size settings might not even exist, and simply be derived from riak_kv.max_object_size when riak_kv.large_objects is enabled, removing any indication of the implementation strategy from the published config schema (probably additional knobs under the covers, either in hidden schema or advanced.config).

Siblings are not allowed for chunked objects. For the user, this is easily expressed as "objects with sizes over riak_kv.max_object_size behave as though their allow_mult bucket property is false." Realistically, who wants their GET result to include multiple copies of their GB object anyway?
I don't think it's worth trying to implement partial updates, a la the "file" model. The complexity is enormous, and I'm reasonably sure that the common use cases for LOs (especially if we bump riak_kv.max_object_size up) don't justify it anyway. We're already dumping a significant added workload on the Riak node (more on that below), the cost of diffing all the chunk hashes and figuring out what to update could add a lot with little payoff.
Counter to the above, the added workload could be cheap or nearly free, at least on physical machines vs VMs. With the proliferation of CPU cores, the model of a vnode being a single process (hence using a single core) leads to CPU under-utilization with currently deployed cluster and ring sizes. We have yet to explore ring sizes over 1K to use more CPU, but we have IO constraint concerns going in that direction. Those "free" cores could potentially crunch data in RAM with little observable impact on throughput.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open Riak

Feature Proposal: Riak Large Object Support? #24

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Open Riak

Feature Proposal: Riak Large Object Support? #24

Uh oh!

tburghart Jun 28, 2025 Maintainer

Comments Requested

Overview

Alternatives

Why Not Riak CS?

Why Not a Smart Client?

The Devil is in The Details

Data Streaming

Chunking Threshold Flexibility

Incomplete PUTs

Load Management

Can LOs be Updated?

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

martinsumner Jun 30, 2025 Maintainer

Uh oh!

tburghart Jun 30, 2025 Maintainer Author

tburghart
Jun 28, 2025
Maintainer

Replies: 1 comment 1 reply

martinsumner
Jun 30, 2025
Maintainer

tburghart Jun 30, 2025
Maintainer Author