Carrying binary data in `Dict` #56

iboB · 2024-08-27T06:48:32Z

iboB
Aug 27, 2024
Maintainer

@pminev observed that we seem to be doing excessive copying of binary data in ac::Dict.

Currently it's stored in std::vector<uint8_t> typedef-ed as ac::Blob.

While blobs can be moved in and out of Dict the most obvious problem is that if the user produces data which is not std::vector<uint8_t>, but say std::vector<float>, they will have to copy it into the Dict (copying further down the line, when the inference code consumes the data can be avoided as there we can control the lifetime of the Dict itself).

This become a non-trivial problem if the data is large. I'd say that our current use case of audio is just on the edge of size which is annoying to have to copy. If we ever do video inputs, this copy would really be frustrating.

Luckily the blob type is a customization point of nlohmann::json and we can place practically anything there and adapt it through a specialization of binary_t::container_type.

To me allowing straight-up refs as binary data is a definite no. We can't impose the lifetime management of the data to the user.

A relatively easy solution which would allow us to move any type of data into a Dict would be to use itlib::pod_vector or other type-punnable or recastable containers that may exist. This however is a miniscule improvement as it would still impose the use of itlib::pod_vector (or something else concrete) as a container. Moreover it would allow data to be moved inside the Dict but if the user wants to keep the data for longer (seems like a common use case as they would likely want to display the input) they will still have to copy it.

A more powerful solution would be to design our own buffer container type which can hold multiple things through variant or even any and use a shared_ptr of it as a Blob. This is a decent solution if C++ is the only target language.

However it is not. We expect buffers which come from Java, Swift, and many other languages. For a complete solution here, we need to adapt further. We need to work with external ref counts.

So, we would have to create our own customization point for blobs which allow them to take and adapt pretty much anything to a buffer (dynamic polymorphism, yay).

Edit:

Two more things come to mind:

Sharing the buffer introduces the potential for data races. CoW needs to be implemented for the uber-wrapper.
The same goes in reverse if a model produces blobs (TTS, image or video generation). Luckily again, the binary blob in Dict is mutable so given that we have CoW, we should allow the output buffer to come from the caller.

iboB · 2024-09-21T15:40:54Z

iboB
Sep 21, 2024
Maintainer Author

Allowing the output buffer to come from the caller is no easy task however. How should we carry it?

As a part of Dict? Possible but it's not clean. Dicts will have to be treated as I/O arguments of sorts.

Maybe an optional additional argument to ops with buffers... or even a buffer factory?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Carrying binary data in `Dict` #56

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Carrying binary data in Dict #56

Uh oh!

Uh oh!

iboB Aug 27, 2024 Maintainer

Replies: 1 comment

Uh oh!

iboB Sep 21, 2024 Maintainer Author

Carrying binary data in `Dict` #56

iboB
Aug 27, 2024
Maintainer

iboB
Sep 21, 2024
Maintainer Author