Skip to content

UnionType is not supported #5090

@jonded94

Description

@jonded94

Sadly parquet doesn't support UnionType yet (apache/parquet-format#44), and I thought to try out whether Lance does support it, as it would be yet another nice argument to switch to it.

It's all a bit hacky what I built (unnecessary Rust -> Python -> Rust switcheroo), but I had to construct a UnionArray from Rust since pyarrow's from_pylist also doesn't seem to support it (apache/arrow#44182):

#[pyfunction]
pub fn build_union_array() -> PyArrowType<ArrayData> {
    let int_array = Int32Array::from(vec![Some(1), None, Some(34)]);
    let float_array = Float64Array::from(vec![None, Some(3.2), None]);
    let type_ids = [0_i8, 1, 0].into_iter().collect::<ScalarBuffer<i8>>();

    let union_fields = [
        (0, Arc::new(Field::new("A", DataType::Int32, false))),
        (1, Arc::new(Field::new("B", DataType::Float64, false))),
    ]
    .into_iter()
    .collect::<UnionFields>();

    let children = vec![Arc::new(int_array) as Arc<dyn Array>, Arc::new(float_array)];

    let array = UnionArray::try_new(union_fields, type_ids, None, children).unwrap();

    PyArrowType(array.to_data())
}

With that, I was able to build a UnionArray, which I used to generate a pyarrow.Table and tried to write that to a Lance dataset with this code:

import pyarrow as pa
import lib
import lance

array = lib.build_union_array()
print(array)
print(array.type)
table = pa.Table.from_arrays([array], names=["foo"])

lance.write_dataset(table, "lance")

But unfortunately that yielded:

-- is_valid: all not null
-- type_ids:   [
    0,
    1,
    0
  ]
-- child 0 type: int32
  [
    1,
    null,
    34
  ]
-- child 1 type: double
  [
    null,
    3.2,
    null
  ]
sparse_union<A: int32 not null=0, B: double not null=1>
Traceback (most recent call last):
  File "[...]", line 10, in <module>
    lance.write_dataset(table, "lance")
  File "[...]/.venv/lib/python3.12/site-packages/lance/dataset.py", line 5264, in write_dataset
    inner_ds = _write_dataset(reader, uri, params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: LanceError(Schema): Unsupported data type: Union([(0, Field { name: "A", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), (1, Field { name: "B", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} })], Sparse), /home/runner/work/lance/lance/rust/lance-core/src/datatypes.rs:174:31

It seems that it's not part of this mapping yet (

), but probably it's also not supported by the actual data serializers yet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions