Skip to content

Parquet Exporter better handling of adaptive OTAP Schema #863

@albertlockett

Description

@albertlockett

There are two ways that OTAP Arrow schema can change between record batches:

  1. A dictionary builder may overflow, causing the type for this column to either get upgraded to a new dictionary key type, or the native array type
  2. Optional fields may or may not be present

Currently the Parquet Exporter does not have the correct logic to handle these cases. It expects every record batch to have the same Schema (due to this assumption in the internal ArrowWriter).

Fixes Needed:

There are multiple fixes needed to support adaptive schemas:

A. Handle compatible value sources for a single column (Dict8, Dict16, Native) when writing:

We can get this "for free" by upgrading to arrow-rs 56, which has these changes: apache/arrow-rs#8005 <-- I screwed up the impl of this, additional changes needed here:
https://github.com/apache/arrow-rs/blob/2418c59efa50edfd456dcc042e2bf84692398745/parquet/src/arrow/arrow_writer/levels.rs#L552-L563

B. Handle optional columns when writing:

There are probably a few options for this:

I think option 2 would be ideal, but not sure if what we're trying to achieve is too special to be accepted into the default ArrowWriter. If so, we should go with option 3.

C. Try to support easier read back

parquet-rs's ArrowWriter writes the Arrow schema into the parquet file's metadata, and it will try to convert the parquet data back into the correct Arrow type if this metadata is present. Currently, it will just use the schema from the first batch. This is a problem, because the schema we read may not be the schema that we want. For example, if a dictionary column with a U8 key has more than 255 values, it may cause an overflow on read.

Note: this is a problem generally for parquet-rs, and isn't strictly related to otel-arrow. The same thing would happen if the batches had more than 255 different values between them.

  • We could disable writing the arrow schema metadata. The downside is that when reading back, dictionaries might not get used where they otherwise could.
  • We could try to dynamically keep track of the arrow type of column and write either the dictionary with the highest cardinality key, otherwise the native types

The solution will probably depend on how we resolve the issue above related to optional fields.

Metadata

Metadata

Assignees

Labels

arrow-rsbugSomething isn't workingparquet-exporterParquet Exporter related taskspipelineRust Pipeline Related TasksrustPull requests that update Rust code

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions