Question: Intergration Plan/Solution for lance with Iceberg #4110

littleDrew · 2025-07-01T07:26:47Z

littleDrew
Jul 1, 2025

Here I see @westonpace is working at Intergrate lance with Iceberg, here mainly regard lance as an basic file format like parquet/orc.

Here i want to have a discuss or know previous analysis about select which way to intergrate with iceberg:

1st way: use lance file format to intergrate with iceberg, thus like following implementation by @westonpace
2nd way: use lance table format to intergrate with iceberg

Here i thought, considering about performance issue, it seems the 2st way(use lance table format) will be better for good performance

Thus want to have a dissuss as for implementation way. if use current way, can we achive good performance, or have we evalute the pros and cons of both implementation way.

public enum FileFormat {
  PUFFIN("puffin", false),
  ORC("orc", true),
  PARQUET("parquet", true),
  AVRO("avro", true),
  LANCE("lance", true)  // add lance as an file format of iceberg instead of table format
}

westonpace/iceberg@84bf5c5

Hope we have a disscus for this together, as i also have noticed lance have a good performance in proccessing/reading multi-dimentional data. @dacort @eddyxu @niyue @rok @dnsco

Answered by jackye1995

Jul 21, 2025

As for above iceberg integration implementation, is that just a demo implementation?

yes it's just a demo implementation.

What's kinds of performance benefit may obtain from option 1

I think you still get better random access, but you don't have the full table index to help with.

Here i see above westonpace refered The generic file reader API does not have APIs for random access , can you explain this in detail

The Iceberg reader API is mostly OLAP-centric, focusing on scan instead of take. It assumes that you have to read a range of file in memory, perform scan and then perform whatever filtering necessary.

i want to know how much will it effet with java -call> rust instead of nat…

View full answer

westonpace · 2025-07-01T16:34:14Z

westonpace
Jul 1, 2025
Maintainer

I did run an initial experiment to proof out the file reader API (which worked well). Since then I've spoken with @jackye1995 as well on this topic and he had some interesting ideas too.

Regarding approaches 1 & 2 it depends on the goal.

Approach 1 is simpler. It is a very direct implementation of the file reader API and matches the spirit of the API. It does have the potential to speed up a standard Iceberg workload by using lance files instead of parquet, providing some performance benefits. However, it won't be able to utilize the performance benefits of the table format. For example, secondary indices (vector index, btree index, bitmap index, etc.) won't be used. Also, the performance benefits will depend on the plans made by the compute engine. The generic file reader API does not have APIs for random access and so compute engines will not be able to exploit the file format either. It will not allow existing lance datasets (tables) to be read by Iceberg because lance datasets don't put all columns into a single file.

Approach 2 is more complex. However, it comes with more advantages. It would allow existing lance datasets to be read by Iceberg. It does not match the file reader API exactly because we aren't mapping files 1:1 but we could potentially trick the file reader into working here. For example, a URI file:///some/dataset/path/dataset.lance?fragment=5 could be used to indicate fragment 5 in a dataset. I think reading could be done, without changes to the file reader API, but I'm not 100% certain. I think the write path would require at least some changes (e.g. to let the format choose the URI) to the file reader API.

0 replies

jackye1995 · 2025-07-01T16:39:34Z

jackye1995
Jul 1, 2025
Maintainer

I have a Google doc design previously about approach 2: https://docs.google.com/document/d/1AUAO2M_kmVvoCA2GYCBFvAcaM8zieU2fJmHVYuiew9o/edit?tab=t.0#heading=h.3fnc1e5nn72y

Somehow I always thought I have put it up but clearly I did not...

0 replies

jackye1995 · 2025-07-01T16:55:19Z

jackye1995
Jul 1, 2025
Maintainer

I think option 2 definitely makes more sense to me.

From implementation perspective, the only missing API feature is that the Iceberg FileIO needs to be accessible in a reader or writer such that it can be used to open another file when necessary. It is not a big feature to be added from Iceberg code API perspective.

Given this is not exactly a plugin model but a implement and then gets accepted model, having the community accept this direction might be the more challenging part.

0 replies

jackye1995 · 2025-07-01T17:42:15Z

jackye1995
Jul 1, 2025
Maintainer

One other thing to think about is, is it a matter of option1 vs option2, or is this incremental steps? Maybe first doing option 1 is good enough to unlock some integrations and build a foundation, and then we can do option 2 on top.

0 replies

littleDrew · 2025-07-17T12:43:59Z

littleDrew
Jul 17, 2025
Author

Hello @westonpace and @jackye1995, thanks for your reply.

I have see the design doc provided by @jackye1995, here clearly show two design approach/options for integrating lance format with iceberg, but these two options in different level: option1, integrate lance file with iceberg; options 2, integrate lance segment with iceberg(treat lance segment as iceberg Data File).

As i see here iceberg and paimon community firstly do integration with lance base on option 1 as following, here i have some question and want to discuss how to implement well and achive primary benefits on options 1. Hope to have a disscuss @westonpace @jackye1995

iceberg integration implementation: apache/iceberg@main...westonpace:iceberg:feat/lance-reader-writer

paimon integration implementation: https://github.com/apache/paimon/pull/5809/files

My questions are as follows:

As for above iceberg integration implementation, is that just a demo implementation? If fullly implement option 1, is there others things/features need to implement @westonpace
What's kinds of performance benefit may obtain from option 1 (using lance file format instead of parquet), and why we can obtain these benefit. Here i thought the good performance of random access in lance may because of serveral part of index ( table level index、file level index(chunk index) and fragment level index(i am not sure whether exist)), here in options 1, directly integrate with lance file, may be can only benefit from file level chunk index ? is there other benifit we can obtain from integrate with lance file ?
Here i see above westonpace refered The generic file reader API does not have APIs for random access , can you explain this in detail : ). Do you means the implemented reader directly related with the decoding method and chunk index in lance file. and thus i want to know how much work does it need to implement a reader to support random access(or like native lance reader/writer).
Here iceberg mainly implement with Java, and Lance with Rust, here i see lance repo provide jni implementation as following, and i want to know how much will it effet with java -call> rust instead of native rust call in lance, do you have a approximate or tested performance result on random access or full scan.

https://github.com/lancedb/lance/tree/main/java/core/lance-jni

Above in option 1, do you mean integrate iceberg only with lance Data File(put all data of table columns in one lance file), but can not integrate delete file in the same time, thus unable support hight performance delete ? and unable to support hight performance column update(as all column in one lance data file) ?

Hoping for your kindly reply @westonpace @jackye1995 @dacort @eddyxu

2 replies

littleDrew Jul 20, 2025
Author

Hi @westonpace @jackye1995 @Xuanwo @wjones127 , could you share your idea as for above question, thanks.

jackye1995 Jul 21, 2025
Maintainer

As for above iceberg integration implementation, is that just a demo implementation?

yes it's just a demo implementation.

What's kinds of performance benefit may obtain from option 1

I think you still get better random access, but you don't have the full table index to help with.

Here i see above westonpace refered The generic file reader API does not have APIs for random access , can you explain this in detail

The Iceberg reader API is mostly OLAP-centric, focusing on scan instead of take. It assumes that you have to read a range of file in memory, perform scan and then perform whatever filtering necessary.

i want to know how much will it effet with java -call> rust instead of native rust call in lance, do you have a approximate or tested performance result on random access or full scan.

Lance file reader has specific interfaces to take a set of indices to optimize random access (see for example FileReader::read_stream and ReadBatchParams::Indices). This is not quite the same as the Iceberg reader whose sematics is OLAP centric. So if you want to leverage Lance for that, you would probably want to propose similar changes in Iceberg.

thus unable support hight performance delete ? and unable to support hight performance column update(as all column in one lance data file) ?

For delete, Iceberg also supports delete vector, so maybe you will just leverage that. For column update, yes.

Answer selected by littleDrew

Question: Intergration Plan/Solution for lance with Iceberg #4110

Uh oh!

Uh oh!

littleDrew Jul 1, 2025

Replies: 5 comments · 2 replies

Uh oh!

westonpace Jul 1, 2025 Maintainer

Uh oh!

jackye1995 Jul 1, 2025 Maintainer

Uh oh!

Uh oh!

jackye1995 Jul 1, 2025 Maintainer

Uh oh!

jackye1995 Jul 1, 2025 Maintainer

Uh oh!

Uh oh!

littleDrew Jul 17, 2025 Author

Uh oh!

littleDrew Jul 20, 2025 Author

Uh oh!

jackye1995 Jul 21, 2025 Maintainer

littleDrew
Jul 1, 2025

Replies: 5 comments 2 replies

westonpace
Jul 1, 2025
Maintainer

jackye1995
Jul 1, 2025
Maintainer

jackye1995
Jul 1, 2025
Maintainer

jackye1995
Jul 1, 2025
Maintainer

littleDrew
Jul 17, 2025
Author

littleDrew Jul 20, 2025
Author

jackye1995 Jul 21, 2025
Maintainer