Parquet Exporter better handling of adaptive OTAP Schema

There are two ways that OTAP Arrow schema can change between record batches:

1. A dictionary builder may overflow, causing the type for this column to either get upgraded to a new dictionary key type, or the native array type
2. Optional fields may or may not be present

Currently the Parquet Exporter does not have the correct logic to handle these cases. It expects every record batch to have the same Schema (due to this assumption in the internal `ArrowWriter`).

### Fixes Needed:
There are multiple fixes needed to support adaptive schemas:

#### A. Handle compatible value sources for a single column (Dict8, Dict16, Native) when writing:
~We can get this "for free" by upgrading to arrow-rs 56, which has these changes: https://github.com/apache/arrow-rs/pull/8005~ <-- I screwed up the impl of this, additional changes needed here:
https://github.com/apache/arrow-rs/blob/2418c59efa50edfd456dcc042e2bf84692398745/parquet/src/arrow/arrow_writer/levels.rs#L552-L563
 
#### B. Handle optional columns when writing:
There are probably a few options for this:
- 1. In otel-arrow, add all optional the columns to the schema as entirely null columns. The downside of doing this is that we'll allocate memory for these all null arrays.
- 2. Try to support this in parquet-rs:
  - While writing may be able to have the `ArrowRowGroupWriter` resolve the `ArrowColumnWriter` by name here, instead of going by schema position. We'd also need to be able to initialize new `ArrowColumnWriters` adaptively as new columns are added, and be able to give them null prefixes.
https://github.com/apache/arrow-rs/blob/a9b6077b2d8d5b5aee0e97e7d4335878e8da1876/parquet/src/arrow/arrow_writer/mod.rs#L811-L820
  - We may need to have `ArrowWriter` keep track of a dynamic arrow schema, instead of using the one it's initialized with:
https://github.com/apache/arrow-rs/blob/a9b6077b2d8d5b5aee0e97e7d4335878e8da1876/parquet/src/arrow/arrow_writer/mod.rs#L179    
- 3. Have Otel-Arrow's parquet exporter managing the `ArrowColumnWriters` and dynamic schema itself, rather than trying to push all this code into `ArrowWriter`

I think option 2 would be ideal, but not sure if what we're trying to achieve is too special to be accepted into the default `ArrowWriter`. If so, we should go with option 3.

#### C. Try to support easier read back
parquet-rs's `ArrowWriter` writes the Arrow schema into the parquet file's metadata, and it will try to convert the parquet data back into the correct Arrow type if this metadata is present.  Currently, it will just use the schema from the first batch. This is a problem, because the schema we read may not be the schema that we want. For example, if a dictionary column with a U8 key has more than 255 values, it may cause an overflow on read.

Note: this is a problem generally for parquet-rs, and isn't strictly related to otel-arrow. The same thing would happen if the batches had more than 255 different values between them.

- We could disable writing the arrow schema metadata. The downside is that when reading back, dictionaries might not get used where they otherwise could.
- We could try to dynamically keep track of the arrow type of column and write either the dictionary with the highest cardinality key, otherwise the native types

The solution will probably depend on how we resolve the issue above related to optional fields.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet Exporter better handling of adaptive OTAP Schema #863

Fixes Needed:

A. Handle compatible value sources for a single column (Dict8, Dict16, Native) when writing:

B. Handle optional columns when writing:

C. Try to support easier read back

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parquet Exporter better handling of adaptive OTAP Schema #863

Description

Fixes Needed:

A. Handle compatible value sources for a single column (Dict8, Dict16, Native) when writing:

B. Handle optional columns when writing:

C. Try to support easier read back

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions