You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add Enum type support to arrow-avro and Minor Decimal type fix (#7852)
# Which issue does this PR close?
- Part of #4886
- Related to #6965
# Rationale for this change
The `arrow-avro` crate currently lacks support for the Avro `enum` type,
which is a standard and commonly used type in Avro schemas. This
omission prevents users from reading Avro files containing enums,
limiting the crate's utility.
This change introduces support for decoding Avro enums by mapping them
to the Arrow `DictionaryArray` type. This is a logical and efficient
representation. Implementing this feature brings the `arrow-avro` crate
closer to full Avro specification compliance and makes it more robust
for real-world use cases.
# What changes are included in this PR?
This PR introduces comprehensive support for Avro enum decoding along
with a minor Avro decimal decoding fix. The key changes are:
1. **Schema Parsing (`codec.rs`):**
* A new `Codec::Enum(Arc<[String]>)` variant was added to represent a
parsed enum and its associated symbols.
* The `make_data_type` function now parses `ComplexType::Enum` schemas.
It also stores the original symbols as a JSON string in the `Field`'s
metadata under the key `"avro.enum.symbols"` to ensure schema fidelity
and enable lossless round-trip conversions.
* The `Codec::data_type` method was updated to map the internal
`Codec::Enum` to the corresponding Arrow
`DataType::Dictionary(Box<Int32>, Box<Utf8>)`.
2. **Decoding Logic (`reader/record.rs`):**
* A new `Decoder::Enum(Vec<i32>, Arc<[String]>)` variant was added to
manage the state of decoding enum values.
* The `Decoder` was enhanced to create, decode, and flush `Enum` types:
* `try_new` creates the decoder.
* `decode` reads the Avro `int` index from the byte buffer.
* `flush` constructs the final `DictionaryArray<Int32Type>` using the
collected indices as keys and the stored symbols as the dictionary
values.
* `append_null` was extended to handle nullable enums.
3. **Minor Decimal Type Decoding Fix (`codec.rs`)**
* A minor decimal decoding fix was implemented in `make_data_type` due
to the `(Some("decimal"), c @ Codec::Fixed(sz))` branch of `match
(t.attributes.logical_type, &mut field.codec)` not being reachable. This
issue was caught by the new decimal integration tests in
`arrow-avro/src/reader/mod.rs`.
# Are these changes tested?
* Yes, test coverage was provided for the new `Enum` type:
* New unit tests were added to `record.rs` to specifically validate both
non-nullable and nullable enum decoding logic.
* The existing integration test suite in `arrow-avro/src/reader/mod.rs`
was used to validate the end-to-end functionality with a new
`avro/simple_enum.avro` test case, ensuring compatibility with the
overall reader infrastructure.
* New tests were also included for the `Decimal` and `Fixed` types:
* This integration test suite was also extended to include tests for
`avro/simple_fixed.avro`, `avro/fixed_length_decimal.avro`,
`avro/fixed_length_decimal_legacy.avro`, `avro/int32_decimal.avro`,
`avro/int64_decimal.avro`
# Are there any user-facing changes?
N/A
0 commit comments