Replies: 2 comments 7 replies
-
cc @mapleFU |
Beta Was this translation helpful? Give feedback.
-
You have to look in the actual column data to find this - it's not in the footer FileMetadata. A column chunk includes a list of pages, with an optional dictionary page first. Each of these, including the dictionary are prepended with an uncompressed PageHeader in Thrift format. The DataPageHeader [1] in the PageHeaders that are data pages (PageHeader.type == DataPage) will have a DataPageHeader, in which you can find the encodings for rep/def levels. I've noticed in Overture Maps parquet files which are also encoded with parquet-mr (aka parquet-java), BIT_PACKED is stated as the encoding for the definition levels when the definition levels are empty. This might be an encoding artifact, but I haven't confirmed it. Given that definition_level_encoding is a required metadata field, this is certainly a possibility. [1] From the Parquet schema: /** Data page header */
struct DataPageHeader {
/**
* Number of values, including NULLs, in this data page.
*
* If a OffsetIndex is present, a page must begin at a row
* boundary (repetition_level = 0). Otherwise, pages may begin
* within a row (repetition_level > 0).
**/
1: required i32 num_values
/** Encoding used for this data page **/
2: required Encoding encoding
/** Encoding used for definition levels **/
3: required Encoding definition_level_encoding;
/** Encoding used for repetition levels **/
4: required Encoding repetition_level_encoding;
/** Optional statistics for the data in this page **/
5: optional Statistics statistics;
}
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I'm working with a Parquet file written via the Java Parquet library (v1.15.0). When I inspect the metadata using PyArrow, I see the following for a single INT32 column:
'encodings': [('PLAIN', 'BIT_PACKED')]
However, my understanding is that for INT32 columns with no dictionary encoding and no RLE usage, only PLAIN should be relevant.
My questions are:
What does it mean when both PLAIN and BIT_PACKED appear in the encodings list?
Is BIT_PACKED actually used in this case (e.g., for definition levels or repetition levels)?
Is there a reliable way to check whether BIT_PACKED was applied to actual data, or just to the repetition/definition levels?
If the column is required and has no nulls, would any level encoding (BIT_PACKED or RLE) still apply?
Thanks in advance for any insights!
Beta Was this translation helpful? Give feedback.
All reactions