Lance Blob v2 #3996
Replies: 9 comments 28 replies
-
We are very pleased to see this design, as we have a strong demand for the management of image files. I would like to know what the usage boundaries of this design are and what size of files are suitable for this storage method. For example, a high-definition movie may be 20GB, so is this suitable? |
Beta Was this translation helpful? Give feedback.
-
@jackye1995 Is the proposal only a design? Or is there one prototype implementation? |
Beta Was this translation helpful? Give feedback.
-
When an object is relatively large, I prefer to store a reference to the object in Lance rather than the object itself. This allows me to load the object asynchronously or in a streaming manner, instead of loading the entire object—for example, when I am previewing the dataset. Additionally, I hope that the reference to the object can be managed by Lance, rather than being a user-defined object storage path. |
Beta Was this translation helpful? Give feedback.
-
I'm considering whether it's feasible for Lance to support different kinds of blobs, ranging from KiB to TiB. We can perform small blob merges as in our existing implementation, while also supporting storing large blobs as a whole. I can see that both of those blobs have their own use cases.
I'm not sure it's a good idea to support All these components together make Lance a simple object storage system: we can store, retrieve, and manage different size blobs along with their metadata using Lance's API in python or SQL. Is this something expected to you, @westonpace @jackye1995? |
Beta Was this translation helpful? Give feedback.
-
One challenge for "Blob as Special Data" is alignment. Currently, each column is written in fragments today. We require that a fragment has data for all columns. These data must have the same number of rows. For example, a fragment with 1M may have three data files, each data file must have 1M rows. If we have special blob files, we potentially break this alignment. In my original proposal, I suggested we fix this by putting blob data in a completely separate dataset. This way we don't have to complicate the table format much at all, but it adds a lot of complexity to the compute / query engine (need to join the different datasets). Another approach could be to support a DataFile with many smaller files. So a "data file" with 1M rows might actually consist of 10 different lance files, each with 100K rows. This adds more complexity to the table format and the compute engine but might be less complexity than making it a separate dataset. |
Beta Was this translation helpful? Give feedback.
-
One challenge with "linked files" is handling cleanup. When a row is deleted we need to make sure we delete the corresponding linked file (or maybe we don't, depending on user configuration). There's also the question of update detection. If updates are made to linked files then those updates will not be tracked by lance. We also have to handle the possibility of orphans (linked file is now gone). Also, for ingestion, we may want to support a unified ingestion mode where a user provides rows with all columns and we take the linked columns and write them out as new standalone files in object storage. |
Beta Was this translation helpful? Give feedback.
-
@westonpace Does it mean we need to introduce a new special encoding for this solution? I prefer this design for storing huge blobs(if we can find a way to solve these issues), so that we can manage everything based on Transactional semantics. The |
Beta Was this translation helpful? Give feedback.
-
The first solution uses two datasets to manage normal data and special data ("big blob"). I think the issue would ultimately become how to perform ACID transactions on multiple datasets. Otherwise, the alignment between the two datasets cannot be guaranteed. Do I understand correctly? @westonpace Solution 2 utilizes the alignment inside of fragments but introduces splitting rows within fragments, which is interesting to me. |
Beta Was this translation helpful? Give feedback.
-
Very nice idea! Generally owned manifest seems to be a good solution. RocksDB Blob and even SSD FTL owns blob in this way. Maybe the problem about "Compaction is very expensive (or anything else triggering rewrite)" is that, it's a trade between space-amplification and write-amplify. For reference, previously I found https://arxiv.org/abs/2005.00044 a good sample for manage space for this |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Currently the main multimodal data type support in Lance comes from
In general, we have a very solid strategy for "medium sized" blobs (>128 bytes, <1MB) that are used for training purposes, but we want to have a better strategy for "large sized" blobs (>1MB) which are typically raw images and videos.
Some challenges for large blobs:
Questions to answer
I think we need to ask ourselves 2 questions separately:
Q1: should we blobs in a Lance dataset vs in a separated Volume/Bucket
Regardless of storing blobs inline or out of line and then use a pointer to the stored location, I think it in general makes sense to store it in a Lance dataset, instead of having a combination of a Lance dataset + a Volume/Bucket that has the blobs separately. It is in general much cleaner for governance and sharing, the blobs can be managed just like any other column data without the need to worry about working across 2 systems.
Q2: should we store blobs inline or out of line
We could just store objects out of line within a directory in the Lance table and then point to it, or we can continue to store everything inline, or we can have a hybrid approach depending on some threshold. The key tradeoff seems to be "number of files" vs "rewrite cost". If we store everything out of line, there will be a lot of objects, which introduces a lot of IOps that could become bottleneck in ML training (especially on GPU), and imposes scalability issue for listing related operations like cleanup. If we store inline, we could batch many images and videos in the same Lance file, we can target ~100GB files for blobs, which would be a 5000x reduction in # of files if using 20MB videos or 100K for 1MB images. There are much fewer IOps, listing is much more efficient, but any rewrite of the these files would become very expensive.
In general, I think we are still working towards storing everything inline given its benefit. In the next section, we have 2 proposals. The Blob Store proposal is essentially a hybrid approach where data are still stored inline, but in a separated dataset, and the row ID is used as pointer to fetch corresponding data. The Blob Manifest proposal is a pure inline approach, but trying to avoid rewrite of large data file if possible.
Proposals
There are some ongoing ideas about how to improve the experience:
Blob Store
There were some initial implementation of the blob store feature from @westonpace that tries to store blobs as a separated internal dataaset within a Lance dataset (not fully completed):
The blob dataset can be committed, compacted, updated, deleted, cleaned up just like a normal Lance dataset, but should always be in sync with the source Lance dataset it belongs to.
Commit
When doing a commit, the blob dataset will need to operate in “unversioned” mode. The local dataset is the primary owner of versions. When we mutate the remote dataset we do so as a staged commit. We provide a read version (as UUID), an operation, and we get a output version (as UUID). We set this output version in the local manifest (as the remote dataset version) and commit the local manifest.
Scan
When scanning data we will need to join the data from the local dataset with the data from the remote dataset. We will do this via a join on the primary key. We scan the local data, as normal, then treat this as the left input to a left outer join. We then do an “indexed scan” of the remote data, and set this as the right input. In many cases (at least, with our blob store) this outer join can be an ordered outer join (merge join) and thus pretty cheap.
To join efficiently, we will need the move and update stable row ID feature, such that we can quickly find the related rows in the blob dataset to take.
Blob Manifest
Another approach would be to introduce some sort of blob manifest concept in a fragment, so that we don't always need to rewrite the big blob data files. This would be an alternative to the blob store approach, but I have not thought through the pros and cons yet. Just put it here for people to think about first.
Compaction
Update
Beta Was this translation helpful? Give feedback.
All reactions