-
Notifications
You must be signed in to change notification settings - Fork 50
Description
There are two ways that OTAP Arrow schema can change between record batches:
- A dictionary builder may overflow, causing the type for this column to either get upgraded to a new dictionary key type, or the native array type
- Optional fields may or may not be present
Currently the Parquet Exporter does not have the correct logic to handle these cases. It expects every record batch to have the same Schema (due to this assumption in the internal ArrowWriter
).
Fixes Needed:
There are multiple fixes needed to support adaptive schemas:
A. Handle compatible value sources for a single column (Dict8, Dict16, Native) when writing:
We can get this "for free" by upgrading to arrow-rs 56, which has these changes: apache/arrow-rs#8005 <-- I screwed up the impl of this, additional changes needed here:
https://github.com/apache/arrow-rs/blob/2418c59efa50edfd456dcc042e2bf84692398745/parquet/src/arrow/arrow_writer/levels.rs#L552-L563
B. Handle optional columns when writing:
There are probably a few options for this:
-
- In otel-arrow, add all optional the columns to the schema as entirely null columns. The downside of doing this is that we'll allocate memory for these all null arrays.
-
- Try to support this in parquet-rs:
- While writing may be able to have the
ArrowRowGroupWriter
resolve theArrowColumnWriter
by name here, instead of going by schema position. We'd also need to be able to initialize newArrowColumnWriters
adaptively as new columns are added, and be able to give them null prefixes.
https://github.com/apache/arrow-rs/blob/a9b6077b2d8d5b5aee0e97e7d4335878e8da1876/parquet/src/arrow/arrow_writer/mod.rs#L811-L820 - We may need to have
ArrowWriter
keep track of a dynamic arrow schema, instead of using the one it's initialized with:
https://github.com/apache/arrow-rs/blob/a9b6077b2d8d5b5aee0e97e7d4335878e8da1876/parquet/src/arrow/arrow_writer/mod.rs#L179
-
- Have Otel-Arrow's parquet exporter managing the
ArrowColumnWriters
and dynamic schema itself, rather than trying to push all this code intoArrowWriter
- Have Otel-Arrow's parquet exporter managing the
I think option 2 would be ideal, but not sure if what we're trying to achieve is too special to be accepted into the default ArrowWriter
. If so, we should go with option 3.
C. Try to support easier read back
parquet-rs's ArrowWriter
writes the Arrow schema into the parquet file's metadata, and it will try to convert the parquet data back into the correct Arrow type if this metadata is present. Currently, it will just use the schema from the first batch. This is a problem, because the schema we read may not be the schema that we want. For example, if a dictionary column with a U8 key has more than 255 values, it may cause an overflow on read.
Note: this is a problem generally for parquet-rs, and isn't strictly related to otel-arrow. The same thing would happen if the batches had more than 255 different values between them.
- We could disable writing the arrow schema metadata. The downside is that when reading back, dictionaries might not get used where they otherwise could.
- We could try to dynamically keep track of the arrow type of column and write either the dictionary with the highest cardinality key, otherwise the native types
The solution will probably depend on how we resolve the issue above related to optional fields.