Lance Blob v2 #3996

jackye1995 · 2025-06-12T23:21:09Z

jackye1995
Jun 12, 2025
Maintainer

Currently the main multimodal data type support in Lance comes from

the blob encoding in Lance file
the Lance BlobFile API to access blobs

In general, we have a very solid strategy for "medium sized" blobs (>128 bytes, <1MB) that are used for training purposes, but we want to have a better strategy for "large sized" blobs (>1MB) which are typically raw images and videos.

Some challenges for large blobs:

Compaction is very expensive (or anything else triggering rewrite)
Difficult to get good sized files when stored inline. 1M rows at 1MB is 1TB which is stretching what object storage can do. With 20MB rows we end up having to make files that are too small (narrow columns are split into too many small reads)
many users are today dumping blobs directly into object storage, and we want to ideally handle that use case well.

Questions to answer

I think we need to ask ourselves 2 questions separately:

Q1: should we blobs in a Lance dataset vs in a separated Volume/Bucket

Regardless of storing blobs inline or out of line and then use a pointer to the stored location, I think it in general makes sense to store it in a Lance dataset, instead of having a combination of a Lance dataset + a Volume/Bucket that has the blobs separately. It is in general much cleaner for governance and sharing, the blobs can be managed just like any other column data without the need to worry about working across 2 systems.

Q2: should we store blobs inline or out of line

We could just store objects out of line within a directory in the Lance table and then point to it, or we can continue to store everything inline, or we can have a hybrid approach depending on some threshold. The key tradeoff seems to be "number of files" vs "rewrite cost". If we store everything out of line, there will be a lot of objects, which introduces a lot of IOps that could become bottleneck in ML training (especially on GPU), and imposes scalability issue for listing related operations like cleanup. If we store inline, we could batch many images and videos in the same Lance file, we can target ~100GB files for blobs, which would be a 5000x reduction in # of files if using 20MB videos or 100K for 1MB images. There are much fewer IOps, listing is much more efficient, but any rewrite of the these files would become very expensive.

In general, I think we are still working towards storing everything inline given its benefit. In the next section, we have 2 proposals. The Blob Store proposal is essentially a hybrid approach where data are still stored inline, but in a separated dataset, and the row ID is used as pointer to fetch corresponding data. The Blob Manifest proposal is a pure inline approach, but trying to avoid rewrite of large data file if possible.

Proposals

There are some ongoing ideas about how to improve the experience:

Blob Store

There were some initial implementation of the blob store feature from @westonpace that tries to store blobs as a separated internal dataaset within a Lance dataset (not fully completed):

The blob dataset can be committed, compacted, updated, deleted, cleaned up just like a normal Lance dataset, but should always be in sync with the source Lance dataset it belongs to.

Commit

When doing a commit, the blob dataset will need to operate in “unversioned” mode. The local dataset is the primary owner of versions. When we mutate the remote dataset we do so as a staged commit. We provide a read version (as UUID), an operation, and we get a output version (as UUID). We set this output version in the local manifest (as the remote dataset version) and commit the local manifest.

Scan

When scanning data we will need to join the data from the local dataset with the data from the remote dataset. We will do this via a join on the primary key. We scan the local data, as normal, then treat this as the left input to a left outer join. We then do an “indexed scan” of the remote data, and set this as the right input. In many cases (at least, with our blob store) this outer join can be an ordered outer join (merge join) and thus pretty cheap.

To join efficiently, we will need the move and update stable row ID feature, such that we can quickly find the related rows in the blob dataset to take.

Blob Manifest

Another approach would be to introduce some sort of blob manifest concept in a fragment, so that we don't always need to rewrite the big blob data files. This would be an alternative to the blob store approach, but I have not thought through the pros and cons yet. Just put it here for people to think about first.

Compaction

Update

coolderli · 2025-07-04T08:07:41Z

coolderli
Jul 4, 2025

We are very pleased to see this design, as we have a strong demand for the management of image files. I would like to know what the usage boundaries of this design are and what size of files are suitable for this storage method. For example, a high-definition movie may be 20GB, so is this suitable?

1 reply

jackye1995 Jul 9, 2025
Maintainer Author

Yeah the goal is to see if we can push the boundary of storing larger size contents. But typically we see users store clips, for example if there is a movie, people store 1 row per 10 second, so it's not like you are storing a 20GB column entry.

yanghua · 2025-07-09T08:00:15Z

yanghua
Jul 9, 2025
Collaborator

@jackye1995 Is the proposal only a design? Or is there one prototype implementation?

1 reply

jackye1995 Jul 9, 2025
Maintainer Author

This is just in early thinking phase, I put it up after some discussions during the Data & AI summit last month with some customers and fellows, have not circled back to it yet to finish the thinking.

coolderli · 2025-07-22T11:39:47Z

coolderli
Jul 22, 2025

When an object is relatively large, I prefer to store a reference to the object in Lance rather than the object itself. This allows me to load the object asynchronously or in a streaming manner, instead of loading the entire object—for example, when I am previewing the dataset. Additionally, I hope that the reference to the object can be managed by Lance, rather than being a user-defined object storage path.

5 replies

yanghua Jul 22, 2025
Collaborator

When an object is relatively large, I prefer to store a reference to the object in Lance rather than the object itself.

If we do this, what benefits would we bring from Lance(that's to say, why use Lance)? Why can't Iceberg do this? Can we also utilize the take blob API feature?

westonpace Jul 22, 2025
Maintainer

I think both approaches should be fine.

This allows me to load the object asynchronously or in a streaming manner, instead of loading the entire object

We can do this in Lance with large_binary stored as blobs today. I'm hoping to expand it to all binary types in #4290

One advantage to storing separately is that we don't yet support streaming writes of blob data (though I hope to add that eventually in #4005

Finally, storing blobs separately helps to avoid the compaction problems described elsewhere in this issue.

Can we also utilize the take blob API feature?

We should be able to improve Lance's support for reference-in-dataset so that we can use the same take API for both scenarios. We would need a layer on top of the scanner to do the join with storage.

yanghua Jul 23, 2025
Collaborator

I thought @coolderli's idea is to store a pure, raw, big object in object store and let Lance reference and manage it. Is it your thought? If not, it would be clearer to say that you prefer BlobStore.

coolderli Jul 23, 2025

@yanghua Yes, you know my heart. To solve the IO issues, we store images in zip files and then record their addresses in a metadata file in Parquet format. For example, /mnt/text/images.zip#aaa.image. This has solved our small file problem and IO issues, but the problem is that packaging the files and recording metadata at the same time is quite cumbersome.

Meanwhile, we have some relatively large audio and video files, which we store separately and record their metadata for. For instance, /mnt/video/my_video.mp4. This leads to inconsistent storage management.

One of the reasons why our users prefer storing files separately rather than in Parquet is that they need to conduct analysis and quality screening before or after processing, and they can directly open the files in their preferred way.

westonpace Jul 28, 2025
Maintainer

Discussed more here: #3996 (reply in thread)

Xuanwo · 2025-07-28T15:35:23Z

Xuanwo
Jul 28, 2025
Collaborator

I'm considering whether it's feasible for Lance to support different kinds of blobs, ranging from KiB to TiB. We can perform small blob merges as in our existing implementation, while also supporting storing large blobs as a whole.

I can see that both of those blobs have their own use cases.

small blobs can scan fast and read without extra IOP
large blobs can store without change and ingest at high speed

I'm not sure it's a good idea to support link blobs which just refer to a path s3://bucket/path/to/file?

All these components together make Lance a simple object storage system: we can store, retrieve, and manage different size blobs along with their metadata using Lance's API in python or SQL.

Is this something expected to you, @westonpace @jackye1995?

7 replies

Xuanwo Jul 28, 2025
Collaborator

Thanks a lot for the nice table!

Also, apologies if I am reading this incorrectly.

Sorry for the confusion.

This sounds similar to the approach posed in the title which I'm labeling "Blob as Special Data".

My current idea is a combination of all three suggestions you made here.

The pattern I imagenge here is:

Small blobs are stored as "Normal Data"
Large blobs are stored as "Special Data"
Blobs that users want to manage themselves are saved as a URI, also known as "Linked Data"

Ideally, users shouldn't need to know about "Normal Data" or "Special Data". We can handle that internally or offer it as an option. However, users can still save a blob along with a URI if they want.

My only concern so far is that the implementation might be complex. Sorry, I haven't thoroughly visited the existing implementation yet; I'm just considering it from the user's API perspective.

westonpace Jul 28, 2025
Maintainer

Started some separate threads for discussions of the design challenges.

westonpace Jul 28, 2025
Maintainer

Blob API read/write patterns that I have seen asked for...

Want to fetch a blob as a file object so I can read it with a streaming API
Want to get a URI to a blob
When I delete a blob I want the delete to be materialized immediately
I want to upload a large blob with a streaming API, I don't want the entire blob to be materialized in memory (e.g. as a binary array)

Xuanwo Jul 29, 2025
Collaborator

Most requests make sense to me.

Want to get a URI to a blob

How will they use this URI? Is it still intended for use within Lance, or do they want other services to interpret this URI?

coolderli Aug 1, 2025

Most requests make sense to me.

Want to get a URI to a blob

How will they use this URI? Is it still intended for use within Lance, or do they want other services to interpret this URI?

@Xuanwo I think this is a matter of users' usage habits. For audio, video, or books, using some codecs, such as players, for preview would greatly facilitate users in data analysis and exploration.

westonpace · 2025-07-28T16:46:23Z

westonpace
Jul 28, 2025
Maintainer

One challenge for "Blob as Special Data" is alignment. Currently, each column is written in fragments today. We require that a fragment has data for all columns. These data must have the same number of rows. For example, a fragment with 1M may have three data files, each data file must have 1M rows.

If we have special blob files, we potentially break this alignment. In my original proposal, I suggested we fix this by putting blob data in a completely separate dataset. This way we don't have to complicate the table format much at all, but it adds a lot of complexity to the compute / query engine (need to join the different datasets).

Another approach could be to support a DataFile with many smaller files. So a "data file" with 1M rows might actually consist of 10 different lance files, each with 100K rows. This adds more complexity to the table format and the compute engine but might be less complexity than making it a separate dataset.

2 replies

Xuanwo Jul 28, 2025
Collaborator

Oh, it seems our understanding is misaligned.

My "Blob as Special Data" actually shares the same implementation as "External Blobs"; the only difference is that it's managed by Lance and not editable from outside. I expect the Lance table to have a special directory called path_to_special_data/, and we write data with a UUID there. After the write succeeds, we create this blob with a URI.

westonpace Jul 28, 2025
Maintainer

My "Blob as Special Data" actually shares the same implementation as "External Blobs"; the only difference is that it's managed by Lance and not editable from outside. I expect the Lance table to have a special directory called path_to_special_data/, and we write data with a UUID there. After the write succeeds, we create this blob with a URI.

Got it, yes that is what I called "external blobs". I think you could probably subdivide into "external blobs managed by lance" and "external blobs managed by users" but we the differences between those two cases are probably pretty minor.

westonpace · 2025-07-28T16:47:48Z

westonpace
Jul 28, 2025
Maintainer

One challenge with "linked files" is handling cleanup. When a row is deleted we need to make sure we delete the corresponding linked file (or maybe we don't, depending on user configuration).

There's also the question of update detection. If updates are made to linked files then those updates will not be tracked by lance. We also have to handle the possibility of orphans (linked file is now gone).

Also, for ingestion, we may want to support a unified ingestion mode where a user provides rows with all columns and we take the linked columns and write them out as new standalone files in object storage.

3 replies

Xuanwo Jul 28, 2025
Collaborator

Agreed. Every issue we encounter with symlink also occurs with linked files.

When a row is deleted we need to make sure we delete the corresponding linked file (or maybe we don't, depending on user configuration).

Yes, we'll need a configuration from the user for this. I avoid deleting by default since these are managed by the users themselves.

If updates are made to linked files then those updates will not be tracked by lance.

I'm considering storing ETag and LastModified for this file. However, since a linked file is managed by the users themselves, I think it's reasonable to simply return errors.

Also, for ingestion, we may want to support a unified ingestion mode where a user provides rows with all columns and we take the linked columns and write them out as new standalone files in object storage.

Oh, yes, that perfectly matches what I'm thinking for "Blob as Special Data".

westonpace Jul 28, 2025
Maintainer

Additional concerns:

If a dataset has billions of rows then we will have billions of files. Can cloud storage support this? I think so but attempting to list this bucket will be impossible. We will need to make sure we update the cleanup operation accordingly (e.g. must use prefix to avoid listing path_to_special_data.) Also, we will probably still need some kind of way to do cleanup of the blobs.
Need to make sure this path is only for blobs that are at least 1MB. If blobs are smaller then we will have runt files and read performance will suffer (though maybe full reads of a blob column are rare enough it doesn't matter too much)

Xuanwo Jul 29, 2025
Collaborator

Can cloud storage support this? I think so but attempting to list this bucket will be impossible.

Yes, cloud storage can support this. However, it's also true that listing the entire bucket isn't practical.

cleanup operation

Are the cleanup operations similar to deleting unused blobs, like a vacuum process?

And it's true that this process can be challenging. The same problems exist in other table formats like Iceberg and Delta.

One point I'd like to make is that this issue exists even when we don't include external blobs. Even if we write everything inline in blobs, cleanup can become difficult over a large dataset over time because transaction might be aborted, leaving files that are already written but not tracked.

So my current understanding is that we need an additional service (possibly a cloud service) that uses S3 inventory or S3 metadata tables to get a global view of our buckets, allowing us to perform vacuum operations as part of a daily routine.

I believe it's fair for a dataset with billions of rows (files) to handle complexity like an additional service.

Need to make sure this path is only for blobs that are at least 1MB. If blobs are smaller then we will have runt files and read performance will suffer

Agreed. Returning to my proposed idea:

If users write the data directly, we will decide whether to write it inline or externally.
If users really intend to write small data using external blobs by attaching a URI, we should clearly communicate the results.

yanghua · 2025-07-29T02:48:43Z

yanghua
Jul 29, 2025
Collaborator

Another approach could be to support a DataFile with many smaller files. So a "data file" with 1M rows might actually consist of 10 different lance files, each with 100K rows. This adds more complexity to the table format and the compute engine but might be less complexity than making it a separate dataset.

@westonpace Does it mean we need to introduce a new special encoding for this solution?

I prefer this design for storing huge blobs(if we can find a way to solve these issues), so that we can manage everything based on Transactional semantics.

The linked file or symlink would pose a big challenge in maintaining consistency.

8 replies

yanghua Jul 29, 2025
Collaborator

Just use Lance API.

In this case, why do we name it external blobs? It's all managed by Lance, right?

Xuanwo Jul 29, 2025
Collaborator

In this case, why do we name it external blobs? It's all managed by Lance, right?

As mentioned in #3996 (reply in thread), we're using this term to ensure consistency in our discussions. It's fine if we decide on a different name once the Lance Blob V2 design is finalized.

yanghua Jul 29, 2025
Collaborator

It seems the "Users managed" external blobs can also be supported easily in Apache Iceberg. It looks like a business design. We have a "uri" field to store the object's URI, which is in the object store.

jackye1995 Jul 29, 2025
Maintainer Author

It seems the "Users managed" external blobs can also be supported easily in Apache Iceberg

I mean, it can be supported by anything, since it's not a format specific concept anymore. You just need a table thing that can store a column named uri or something similar.

Does it mean we need to introduce a new special encoding for this solution?

I think it will be more like a table format feature to support basically one more level of "manifest", since " a DataFile with many smaller files" this DataFile is essentially a manifest.

yanghua Jul 29, 2025
Collaborator

I mean, it can be supported by anything, since it's not a format specific concept anymore. You just need a table thing that can store a column named uri or something similar.

Yes, we have seen some customers do this with Iceberg and Hive. We tried to persuade them to switch the solution to lance (via storing as a blob field and avoiding consistency). Lol.

I know the "users managed external blobs" is only an option for users. I hope Lance Blob can introduce more attractive features to attract users to adopt it as much as possible.

majin1102 · 2025-07-30T14:59:40Z

majin1102
Jul 30, 2025

The first solution uses two datasets to manage normal data and special data ("big blob"). I think the issue would ultimately become how to perform ACID transactions on multiple datasets. Otherwise, the alignment between the two datasets cannot be guaranteed. Do I understand correctly？ @westonpace

Solution 2 utilizes the alignment inside of fragments but introduces splitting rows within fragments, which is interesting to me.

1 reply

jackye1995 Jul 30, 2025
Maintainer Author

I think the issue would ultimately become how to perform ACID transactions on multiple datasets. Otherwise, the alignment between the two datasets cannot be guaranteed. Do I understand correctly?

In the blob dataset approach, the blob dataset is always committed in detached mode and ultimately committed as a part of the main dataset, so it won't have ACID issue. I think when we say "alignment", we mean that every time you want to read both the main dataset + the blob dataset within, you need to logically do a join to get the data in both dataset.

For example, SELECT * FROM dataset WHERE id = '123' becomes a

SELECT dataset_main.*, dataset_blob.* FROM dataset_main, dataset_blob 
WHERE dataset_main.id = '123' 
AND dataset_main.id = dataset_blob.id

You don't have to actually do a join with shuffle, by leveraging the stable row ID index to know the excat row to take in the blob dataset. But it is quite complicated for engine to implement (especially if we think from a Lakehouse perspective that we want a lot of engine to integrate with it), and even if engines can implement it correctly, it has performance implications.

mapleFU · 2025-07-31T02:33:22Z

mapleFU
Jul 31, 2025

Very nice idea! Generally owned manifest seems to be a good solution. RocksDB Blob and even SSD FTL owns blob in this way. Maybe the problem about "Compaction is very expensive (or anything else triggering rewrite)" is that, it's a trade between space-amplification and write-amplify.

For reference, previously I found https://arxiv.org/abs/2005.00044 a good sample for manage space for this

0 replies

Lance Blob v2 #3996

Uh oh!

Uh oh!

jackye1995 Jun 12, 2025 Maintainer

Questions to answer

Q1: should we blobs in a Lance dataset vs in a separated Volume/Bucket

Q2: should we store blobs inline or out of line

Proposals

Blob Store

Commit

Scan

Blob Manifest

Compaction

Update

Replies: 9 comments · 28 replies

Uh oh!

coolderli Jul 4, 2025

Uh oh!

jackye1995 Jul 9, 2025 Maintainer Author

Uh oh!

yanghua Jul 9, 2025 Collaborator

Uh oh!

jackye1995 Jul 9, 2025 Maintainer Author

Uh oh!

coolderli Jul 22, 2025

Uh oh!

yanghua Jul 22, 2025 Collaborator

Uh oh!

westonpace Jul 22, 2025 Maintainer

Uh oh!

yanghua Jul 23, 2025 Collaborator

Uh oh!

coolderli Jul 23, 2025

Uh oh!

westonpace Jul 28, 2025 Maintainer

Uh oh!

Uh oh!

Xuanwo Jul 28, 2025 Collaborator

Uh oh!

Uh oh!

Xuanwo Jul 28, 2025 Collaborator

Uh oh!

westonpace Jul 28, 2025 Maintainer

Uh oh!

westonpace Jul 28, 2025 Maintainer

Uh oh!

Xuanwo Jul 29, 2025 Collaborator

Uh oh!

coolderli Aug 1, 2025

Uh oh!

westonpace Jul 28, 2025 Maintainer

Uh oh!

Xuanwo Jul 28, 2025 Collaborator

Uh oh!

westonpace Jul 28, 2025 Maintainer

Uh oh!

Uh oh!

westonpace Jul 28, 2025 Maintainer

Uh oh!

Xuanwo Jul 28, 2025 Collaborator

Uh oh!

westonpace Jul 28, 2025 Maintainer

Uh oh!

Xuanwo Jul 29, 2025 Collaborator

Uh oh!

yanghua Jul 29, 2025 Collaborator

Uh oh!

yanghua Jul 29, 2025 Collaborator

Uh oh!

Xuanwo Jul 29, 2025 Collaborator

Uh oh!

Uh oh!

yanghua Jul 29, 2025 Collaborator

jackye1995
Jun 12, 2025
Maintainer

Replies: 9 comments 28 replies

coolderli
Jul 4, 2025

jackye1995 Jul 9, 2025
Maintainer Author

yanghua
Jul 9, 2025
Collaborator

jackye1995 Jul 9, 2025
Maintainer Author

coolderli
Jul 22, 2025

yanghua Jul 22, 2025
Collaborator

westonpace Jul 22, 2025
Maintainer

yanghua Jul 23, 2025
Collaborator

westonpace Jul 28, 2025
Maintainer

Xuanwo
Jul 28, 2025
Collaborator

Xuanwo Jul 28, 2025
Collaborator

westonpace Jul 28, 2025
Maintainer

westonpace Jul 28, 2025
Maintainer

Xuanwo Jul 29, 2025
Collaborator

westonpace
Jul 28, 2025
Maintainer

Xuanwo Jul 28, 2025
Collaborator

westonpace Jul 28, 2025
Maintainer

westonpace
Jul 28, 2025
Maintainer

Xuanwo Jul 28, 2025
Collaborator

westonpace Jul 28, 2025
Maintainer

Xuanwo Jul 29, 2025
Collaborator

yanghua
Jul 29, 2025
Collaborator

yanghua Jul 29, 2025
Collaborator

Xuanwo Jul 29, 2025
Collaborator

yanghua Jul 29, 2025
Collaborator