-
Couldn't load subscription status.
- Fork 461
Description
Sadly parquet doesn't support UnionType yet (apache/parquet-format#44), and I thought to try out whether Lance does support it, as it would be yet another nice argument to switch to it.
It's all a bit hacky what I built (unnecessary Rust -> Python -> Rust switcheroo), but I had to construct a UnionArray from Rust since pyarrow's from_pylist also doesn't seem to support it (apache/arrow#44182):
#[pyfunction]
pub fn build_union_array() -> PyArrowType<ArrayData> {
let int_array = Int32Array::from(vec![Some(1), None, Some(34)]);
let float_array = Float64Array::from(vec![None, Some(3.2), None]);
let type_ids = [0_i8, 1, 0].into_iter().collect::<ScalarBuffer<i8>>();
let union_fields = [
(0, Arc::new(Field::new("A", DataType::Int32, false))),
(1, Arc::new(Field::new("B", DataType::Float64, false))),
]
.into_iter()
.collect::<UnionFields>();
let children = vec![Arc::new(int_array) as Arc<dyn Array>, Arc::new(float_array)];
let array = UnionArray::try_new(union_fields, type_ids, None, children).unwrap();
PyArrowType(array.to_data())
}With that, I was able to build a UnionArray, which I used to generate a pyarrow.Table and tried to write that to a Lance dataset with this code:
import pyarrow as pa
import lib
import lance
array = lib.build_union_array()
print(array)
print(array.type)
table = pa.Table.from_arrays([array], names=["foo"])
lance.write_dataset(table, "lance")But unfortunately that yielded:
-- is_valid: all not null
-- type_ids: [
0,
1,
0
]
-- child 0 type: int32
[
1,
null,
34
]
-- child 1 type: double
[
null,
3.2,
null
]
sparse_union<A: int32 not null=0, B: double not null=1>
Traceback (most recent call last):
File "[...]", line 10, in <module>
lance.write_dataset(table, "lance")
File "[...]/.venv/lib/python3.12/site-packages/lance/dataset.py", line 5264, in write_dataset
inner_ds = _write_dataset(reader, uri, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: LanceError(Schema): Unsupported data type: Union([(0, Field { name: "A", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), (1, Field { name: "B", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} })], Sparse), /home/runner/work/lance/lance/rust/lance-core/src/datatypes.rs:174:31
It seems that it's not part of this mapping yet (
lance/rust/lance-core/src/datatypes.rs
Line 171 in afc0f98
| _ => { |