Skip to content

Commit df837a4

Browse files
zhuqi-lucasalamb
andauthored
fix: Incorrect inlined string view comparison after Add prefix compar… (#7875)
…e for inlined # Which issue does this PR close? - Closes [#7874](#7874) # Rationale for this change ## Change Summary Rework `inline_key_fast` to avoid reversing the inline data bytes by removing the global `.to_be()` on the entire 128‑bit word and instead manually constructing the big‑endian key in two parts: the 96‑bit data portion and the 32‑bit length tiebreaker. --- ### Problem In the original implementation: ```rust let inline_u128 = u128::from_le_bytes(raw_bytes).to_be(); ``` - **What went wrong**: Calling `.to_be()` on the full 16‑byte value flips _all_ bytes, including the 12 bytes of inline data. - **Consequences**: Multi‑byte strings are compared in reverse order — e.g. `"backend one"` would sort as if it were `"eno dnekcab"` — so lexicographical ordering is completely inverted. - **Corner cases exposed**: **“backend one” vs. “backend two”**: suffixes “one”/“two” compare incorrectly once reversed. --- ### Solution ```rust #[inline(always)] pub fn inline_key_fast(raw: u128) -> u128 { // 1. Decompose `raw` into little‑endian bytes: // - raw_bytes[0..4] = length in LE // - raw_bytes[4..16] = inline string data let raw_bytes = raw.to_le_bytes(); // 2. Numerically truncate to get the low 32‑bit length (endianness‑free). let length = raw as u32; // 3. Build a 16‑byte buffer in big‑endian order: // - buf[0..12] = inline string bytes (in original order) // - buf[12..16] = length.to_be_bytes() (BE) let mut buf = [0u8; 16]; buf[0..12].copy_from_slice(&raw_bytes[4..16]); // inline data // Why convert length to big-endian for comparison? // // Rust (on most platforms) stores integers in little-endian format, // meaning the least significant byte is at the lowest memory address. // For example, an u32 value like 0x22345677 is stored in memory as: // // [0x77, 0x56, 0x34, 0x22] // little-endian layout // ^ ^ ^ ^ // LSB ↑↑↑ MSB // // This layout is efficient for arithmetic but *not* suitable for // lexicographic (dictionary-style) comparison of byte arrays. // // To compare values by byte order—e.g., for sorted keys or binary trees— // we must convert them to **big-endian**, where: // // - The most significant byte (MSB) comes first (index 0) // - The least significant byte (LSB) comes last (index N-1) // // In big-endian, the same u32 = 0x22345677 would be represented as: // // [0x22, 0x34, 0x56, 0x77] // // This ordering aligns with natural string/byte sorting, so calling // `.to_be_bytes()` allows us to construct // keys where standard numeric comparison (e.g., `<`, `>`) behaves // like lexicographic byte comparison. buf[12..16].copy_from_slice(&length.to_be_bytes()); // length in BE // 4. Deserialize the buffer as a big‑endian u128: // buf[0] is MSB, buf[15] is LSB. // Details: // Note on endianness and layout: // // Although `buf[0]` is stored at the lowest memory address, // calling `u128::from_be_bytes(buf)` interprets it as the **most significant byte (MSB)**, // and `buf[15]` as the **least significant byte (LSB)**. // // This is the core principle of **big-endian decoding**: // - Byte at index 0 maps to bits 127..120 (highest) // - Byte at index 1 maps to bits 119..112 // - ... // - Byte at index 15 maps to bits 7..0 (lowest) // // So even though memory layout goes from low to high (left to right), // big-endian treats the **first byte** as highest in value. // // This guarantees that comparing two `u128` keys is equivalent to lexicographically // comparing the original inline bytes, followed by length. u128::from_be_bytes(buf) } ``` --- ### Testing All existing tests — including the “backend one” vs. “backend two” and `"bar"` vs. `"bar\0"` cases — now pass, confirming both lexicographical correctness and proper length‑based tiebreaking. # What changes are included in this PR? # Are these changes tested? Yes # Are there any user-facing changes? No --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
1 parent aef3bdd commit df837a4

File tree

2 files changed

+150
-32
lines changed

2 files changed

+150
-32
lines changed

arrow-array/src/array/byte_view_array.rs

Lines changed: 120 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -583,7 +583,7 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
583583
/// little-endian `u128` representation, converts them to big-endian ordering, and packs them
584584
/// into a single `u128` value suitable for fast, branchless comparisons.
585585
///
586-
/// ### Why include length?
586+
/// # Why include length?
587587
///
588588
/// A pure 96-bit content comparison can’t distinguish between two values whose inline bytes
589589
/// compare equal—either because one is a true prefix of the other or because zero-padding
@@ -605,29 +605,85 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
605605
/// key("bar\0") = 0x0000000000000000000062617200000004
606606
/// ⇒ key("bar") < key("bar\0")
607607
/// ```
608+
/// # Inlining and Endianness
609+
///
610+
/// - We start by calling `.to_le_bytes()` on the `raw` `u128`, because Rust’s native in‑memory
611+
/// representation is little‑endian on x86/ARM.
612+
/// - We extract the low 32 bits numerically (`raw as u32`)—this step is endianness‑free.
613+
/// - We copy the 12 bytes of inline data (original order) into `buf[0..12]`.
614+
/// - We serialize `length` as big‑endian into `buf[12..16]`.
615+
/// - Finally, `u128::from_be_bytes(buf)` treats `buf[0]` as the most significant byte
616+
/// and `buf[15]` as the least significant, producing a `u128` whose integer value
617+
/// directly encodes “inline data then length” in big‑endian form.
618+
///
619+
/// This ensures that a simple `u128` comparison is equivalent to the desired
620+
/// lexicographical comparison of the inline bytes followed by length.
608621
#[inline(always)]
609622
pub fn inline_key_fast(raw: u128) -> u128 {
610-
// Convert the raw u128 (little-endian) into bytes for manipulation
623+
// 1. Decompose `raw` into little‑endian bytes:
624+
// - raw_bytes[0..4] = length in LE
625+
// - raw_bytes[4..16] = inline string data
611626
let raw_bytes = raw.to_le_bytes();
612627

613-
// Extract the length (first 4 bytes), convert to big-endian u32, and promote to u128
614-
let len_le = &raw_bytes[0..4];
615-
let len_be = u32::from_le_bytes(len_le.try_into().unwrap()).to_be() as u128;
616-
617-
// Extract the inline string bytes (next 12 bytes), place them into the lower 12 bytes of a 16-byte array,
618-
// padding the upper 4 bytes with zero to form a little-endian u128 value
619-
let mut inline_bytes = [0u8; 16];
620-
inline_bytes[4..16].copy_from_slice(&raw_bytes[4..16]);
621-
622-
// Convert to big-endian to ensure correct lexical ordering
623-
let inline_u128 = u128::from_le_bytes(inline_bytes).to_be();
624-
625-
// Shift right by 32 bits to discard the zero padding (upper 4 bytes),
626-
// so that the inline string occupies the high 96 bits
627-
let inline_part = inline_u128 >> 32;
628-
629-
// Combine the inline string part (high 96 bits) and length (low 32 bits) into the final key
630-
(inline_part << 32) | len_be
628+
// 2. Numerically truncate to get the low 32‑bit length (endianness‑free).
629+
let length = raw as u32;
630+
631+
// 3. Build a 16‑byte buffer in big‑endian order:
632+
// - buf[0..12] = inline string bytes (in original order)
633+
// - buf[12..16] = length.to_be_bytes() (BE)
634+
let mut buf = [0u8; 16];
635+
buf[0..12].copy_from_slice(&raw_bytes[4..16]); // inline data
636+
637+
// Why convert length to big-endian for comparison?
638+
//
639+
// Rust (on most platforms) stores integers in little-endian format,
640+
// meaning the least significant byte is at the lowest memory address.
641+
// For example, an u32 value like 0x22345677 is stored in memory as:
642+
//
643+
// [0x77, 0x56, 0x34, 0x22] // little-endian layout
644+
// ^ ^ ^ ^
645+
// LSB ↑↑↑ MSB
646+
//
647+
// This layout is efficient for arithmetic but *not* suitable for
648+
// lexicographic (dictionary-style) comparison of byte arrays.
649+
//
650+
// To compare values by byte order—e.g., for sorted keys or binary trees—
651+
// we must convert them to **big-endian**, where:
652+
//
653+
// - The most significant byte (MSB) comes first (index 0)
654+
// - The least significant byte (LSB) comes last (index N-1)
655+
//
656+
// In big-endian, the same u32 = 0x22345677 would be represented as:
657+
//
658+
// [0x22, 0x34, 0x56, 0x77]
659+
//
660+
// This ordering aligns with natural string/byte sorting, so calling
661+
// `.to_be_bytes()` allows us to construct
662+
// keys where standard numeric comparison (e.g., `<`, `>`) behaves
663+
// like lexicographic byte comparison.
664+
buf[12..16].copy_from_slice(&length.to_be_bytes()); // length in BE
665+
666+
// 4. Deserialize the buffer as a big‑endian u128:
667+
// buf[0] is MSB, buf[15] is LSB.
668+
// Details:
669+
// Note on endianness and layout:
670+
//
671+
// Although `buf[0]` is stored at the lowest memory address,
672+
// calling `u128::from_be_bytes(buf)` interprets it as the **most significant byte (MSB)**,
673+
// and `buf[15]` as the **least significant byte (LSB)**.
674+
//
675+
// This is the core principle of **big-endian decoding**:
676+
// - Byte at index 0 maps to bits 127..120 (highest)
677+
// - Byte at index 1 maps to bits 119..112
678+
// - ...
679+
// - Byte at index 15 maps to bits 7..0 (lowest)
680+
//
681+
// So even though memory layout goes from low to high (left to right),
682+
// big-endian treats the **first byte** as highest in value.
683+
//
684+
// This guarantees that comparing two `u128` keys is equivalent to lexicographically
685+
// comparing the original inline bytes, followed by length.
686+
u128::from_be_bytes(buf)
631687
}
632688
}
633689

@@ -1164,22 +1220,35 @@ mod tests {
11641220
///
11651221
/// This also includes a specific test for the “bar” vs. “bar\0” case, demonstrating why
11661222
/// the length field is required even when all inline bytes fit in 12 bytes.
1223+
///
1224+
/// The test includes strings that verify correct byte order (prevent reversal bugs),
1225+
/// and length-based tie-breaking in the composite key.
1226+
///
1227+
/// The test confirms that `inline_key_fast` produces keys which sort consistently
1228+
/// with the expected lexicographical order of the raw byte arrays.
11671229
#[test]
11681230
fn test_inline_key_fast_various_lengths_and_lexical() {
1169-
/// Helper to create a raw u128 value representing an inline ByteView
1170-
/// - `length`: number of meaningful bytes (≤ 12)
1171-
/// - `data`: the actual inline data
1231+
/// Helper to create a raw u128 value representing an inline ByteView:
1232+
/// - `length`: number of meaningful bytes (must be ≤ 12)
1233+
/// - `data`: the actual inline data bytes
1234+
///
1235+
/// The first 4 bytes encode length in little-endian,
1236+
/// the following 12 bytes contain the inline string data (unpadded).
11721237
fn make_raw_inline(length: u32, data: &[u8]) -> u128 {
11731238
assert!(length as usize <= 12, "Inline length must be ≤ 12");
1174-
assert!(data.len() == length as usize, "Data must match length");
1239+
assert!(
1240+
data.len() == length as usize,
1241+
"Data length must match `length`"
1242+
);
11751243

11761244
let mut raw_bytes = [0u8; 16];
1177-
raw_bytes[0..4].copy_from_slice(&length.to_le_bytes()); // little-endian length
1245+
raw_bytes[0..4].copy_from_slice(&length.to_le_bytes()); // length stored little-endian
11781246
raw_bytes[4..(4 + data.len())].copy_from_slice(data); // inline data
11791247
u128::from_le_bytes(raw_bytes)
11801248
}
11811249

1182-
// Test inputs: include the specific "bar" vs "bar\0" case, plus length and lexical variations
1250+
// Test inputs: various lengths and lexical orders,
1251+
// plus special cases for byte order and length tie-breaking
11831252
let test_inputs: Vec<&[u8]> = vec![
11841253
b"a",
11851254
b"aa",
@@ -1193,23 +1262,42 @@ mod tests {
11931262
b"abcdefghi",
11941263
b"abcdefghij",
11951264
b"abcdefghijk",
1196-
b"abcdefghijkl", // 12 bytes, max inline
1265+
b"abcdefghijkl",
1266+
// Tests for byte-order reversal bug:
1267+
// Without the fix, "backend one" would compare as "eno dnekcab",
1268+
// causing incorrect sort order relative to "backend two".
1269+
b"backend one",
1270+
b"backend two",
1271+
// Tests length-tiebreaker logic:
1272+
// "bar" (3 bytes) and "bar\0" (4 bytes) have identical inline data,
1273+
// so only the length differentiates their ordering.
11971274
b"bar",
1198-
b"bar\0", // special case to test length tiebreaker
1275+
b"bar\0",
1276+
// Additional lexical and length tie-breaking cases with same prefix, in correct lex order:
1277+
b"than12Byt",
1278+
b"than12Bytes",
1279+
b"than12Bytes\0",
1280+
b"than12Bytesx",
1281+
b"than12Bytex",
1282+
b"than12Bytez",
1283+
// Additional lexical tests
11991284
b"xyy",
12001285
b"xyz",
1286+
b"xza",
12011287
];
12021288

1203-
// Monotonic key order: content then length,and cross-check against GenericBinaryArray comparison
1289+
// Create a GenericBinaryArray for cross-comparison of lex order
12041290
let array: GenericBinaryArray<i32> =
12051291
GenericBinaryArray::from(test_inputs.iter().map(|s| Some(*s)).collect::<Vec<_>>());
12061292

12071293
for i in 0..array.len() - 1 {
12081294
let v1 = array.value(i);
12091295
let v2 = array.value(i + 1);
1210-
// Ensure lexical ordering matches
1296+
1297+
// Assert the array's natural lexical ordering is correct
12111298
assert!(v1 < v2, "Array compare failed: {v1:?} !< {v2:?}");
1212-
// Ensure fast key compare matches
1299+
1300+
// Assert the keys produced by inline_key_fast reflect the same ordering
12131301
let key1 = GenericByteViewArray::<BinaryViewType>::inline_key_fast(make_raw_inline(
12141302
v1.len() as u32,
12151303
v1,
@@ -1218,6 +1306,7 @@ mod tests {
12181306
v2.len() as u32,
12191307
v2,
12201308
));
1309+
12211310
assert!(
12221311
key1 < key2,
12231312
"Key compare failed: key({v1:?})=0x{key1:032x} !< key({v2:?})=0x{key2:032x}",

arrow-row/src/lib.rs

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2800,6 +2800,34 @@ mod tests {
28002800
.collect()
28012801
}
28022802

2803+
fn generate_fixed_stringview_column(len: usize) -> StringViewArray {
2804+
let edge_cases = vec![
2805+
Some("bar".to_string()),
2806+
Some("bar\0".to_string()),
2807+
Some("LongerThan12Bytes".to_string()),
2808+
Some("LongerThan12Bytez".to_string()),
2809+
Some("LongerThan12Bytes\0".to_string()),
2810+
Some("LongerThan12Byt".to_string()),
2811+
Some("backend one".to_string()),
2812+
Some("backend two".to_string()),
2813+
Some("a".repeat(257)),
2814+
Some("a".repeat(300)),
2815+
];
2816+
2817+
// Fill up to `len` by repeating edge cases and trimming
2818+
let mut values = Vec::with_capacity(len);
2819+
for i in 0..len {
2820+
values.push(
2821+
edge_cases
2822+
.get(i % edge_cases.len())
2823+
.cloned()
2824+
.unwrap_or(None),
2825+
);
2826+
}
2827+
2828+
StringViewArray::from(values)
2829+
}
2830+
28032831
fn generate_dictionary<K>(
28042832
values: ArrayRef,
28052833
len: usize,
@@ -2880,7 +2908,7 @@ mod tests {
28802908

28812909
fn generate_column(len: usize) -> ArrayRef {
28822910
let mut rng = rng();
2883-
match rng.random_range(0..16) {
2911+
match rng.random_range(0..17) {
28842912
0 => Arc::new(generate_primitive_array::<Int32Type>(len, 0.8)),
28852913
1 => Arc::new(generate_primitive_array::<UInt32Type>(len, 0.8)),
28862914
2 => Arc::new(generate_primitive_array::<Int64Type>(len, 0.8)),
@@ -2916,6 +2944,7 @@ mod tests {
29162944
})),
29172945
14 => Arc::new(generate_string_view(len, 0.8)),
29182946
15 => Arc::new(generate_byte_view(len, 0.8)),
2947+
16 => Arc::new(generate_fixed_stringview_column(len)),
29192948
_ => unreachable!(),
29202949
}
29212950
}

0 commit comments

Comments
 (0)