Question: Intergration Plan/Solution for lance with Iceberg #4110
-
Here I see @westonpace is working at Intergrate lance with Iceberg, here mainly regard lance as an basic file format like parquet/orc. Here i want to have a discuss or know previous analysis about select which way to intergrate with iceberg:
Here i thought, considering about performance issue, it seems the 2st way(use lance table format) will be better for good performance Thus want to have a dissuss as for implementation way. if use current way, can we achive good performance, or have we evalute the pros and cons of both implementation way. public enum FileFormat {
PUFFIN("puffin", false),
ORC("orc", true),
PARQUET("parquet", true),
AVRO("avro", true),
LANCE("lance", true) // add lance as an file format of iceberg instead of table format
} Hope we have a disscus for this together, as i also have noticed lance have a good performance in proccessing/reading multi-dimentional data. @dacort @eddyxu @niyue @rok @dnsco |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 2 replies
-
I did run an initial experiment to proof out the file reader API (which worked well). Since then I've spoken with @jackye1995 as well on this topic and he had some interesting ideas too. Regarding approaches 1 & 2 it depends on the goal. Approach 1 is simpler. It is a very direct implementation of the file reader API and matches the spirit of the API. It does have the potential to speed up a standard Iceberg workload by using lance files instead of parquet, providing some performance benefits. However, it won't be able to utilize the performance benefits of the table format. For example, secondary indices (vector index, btree index, bitmap index, etc.) won't be used. Also, the performance benefits will depend on the plans made by the compute engine. The generic file reader API does not have APIs for random access and so compute engines will not be able to exploit the file format either. It will not allow existing lance datasets (tables) to be read by Iceberg because lance datasets don't put all columns into a single file. Approach 2 is more complex. However, it comes with more advantages. It would allow existing lance datasets to be read by Iceberg. It does not match the file reader API exactly because we aren't mapping files 1:1 but we could potentially trick the file reader into working here. For example, a URI |
Beta Was this translation helpful? Give feedback.
-
I have a Google doc design previously about approach 2: https://docs.google.com/document/d/1AUAO2M_kmVvoCA2GYCBFvAcaM8zieU2fJmHVYuiew9o/edit?tab=t.0#heading=h.3fnc1e5nn72y Somehow I always thought I have put it up but clearly I did not... |
Beta Was this translation helpful? Give feedback.
-
I think option 2 definitely makes more sense to me. From implementation perspective, the only missing API feature is that the Iceberg FileIO needs to be accessible in a reader or writer such that it can be used to open another file when necessary. It is not a big feature to be added from Iceberg code API perspective. Given this is not exactly a plugin model but a implement and then gets accepted model, having the community accept this direction might be the more challenging part. |
Beta Was this translation helpful? Give feedback.
-
One other thing to think about is, is it a matter of option1 vs option2, or is this incremental steps? Maybe first doing option 1 is good enough to unlock some integrations and build a foundation, and then we can do option 2 on top. |
Beta Was this translation helpful? Give feedback.
-
Hello @westonpace and @jackye1995, thanks for your reply. I have see the design doc provided by @jackye1995, here clearly show two design approach/options for integrating lance format with iceberg, but these two options in different level: option1, integrate lance file with iceberg; options 2, integrate lance segment with iceberg(treat lance segment as iceberg Data File). As i see here iceberg and paimon community firstly do integration with lance base on option 1 as following, here i have some question and want to discuss how to implement well and achive primary benefits on options 1. Hope to have a disscuss @westonpace @jackye1995
My questions are as follows:
Hoping for your kindly reply @westonpace @jackye1995 @dacort @eddyxu |
Beta Was this translation helpful? Give feedback.
yes it's just a demo implementation.
I think you still get better random access, but you don't have the full table index to help with.
The Iceberg reader API is mostly OLAP-centric, focusing on scan instead of take. It assumes that you have to read a range of file in memory, perform scan and then perform whatever filtering necessary.