Skip to content

Perf: Optimize comparison kernels for inlined views #7731

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 23, 2025
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 36 additions & 4 deletions arrow-ord/src/cmp.rs
Original file line number Diff line number Diff line change
Expand Up @@ -565,24 +565,46 @@ impl<'a, T: ByteViewType> ArrayOrd for &'a GenericByteViewArray<T> {
/// Item.0 is the array, Item.1 is the index
type Item = (&'a GenericByteViewArray<T>, usize);

#[inline(always)]
fn is_eq(l: Self::Item, r: Self::Item) -> bool {
// # Safety
// The index is within bounds as it is checked in value()
let l_view = unsafe { l.0.views().get_unchecked(l.1) };
let l_len = *l_view as u32;

let r_view = unsafe { r.0.views().get_unchecked(r.1) };
let l_len = *l_view as u32;
let r_len = *r_view as u32;
// This is a fast path for equality check.
// We don't need to look at the actual bytes to determine if they are equal.
if l_len != r_len {
return false;
}

// When both len are same, we can compare the inlined bytes, this can handle the corner case after
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also compare earlier based on actual length instead like in the other place?

i am also not sure comparing length here is really a fast path, because slice equality also checks on slice first before accessing the element.

Copy link
Contributor

@Dandandan Dandandan Jun 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we can do the eq case on the u128 value directly @zhuqi-lucas (compare both length and value to be equal)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you @Dandandan , thank you for catching this! Addressed in latest PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, eq optimize a lot, 4.75 faster.

critcmp  main fast_path_view --filter "inlined bytes"
group                                               fast_path_view                         main
-----                                               --------------                         ----
eq StringViewArray StringViewArray inlined bytes    1.00      2.7±0.01ms        ? ?/sec    4.75     12.8±0.10ms        ? ?/sec
lt StringViewArray StringViewArray inlined bytes    1.00     47.3±0.55ms        ? ?/sec    1.12     52.8±1.38ms        ? ?/sec

// the len compare, for example:
// > println!("{:?}", "\0".cmp(""))
// > Greater
if l.0.data_buffers().is_empty() && r.0.data_buffers().is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code probably depends on inlining / moving this check outside of the loop?
Should we add #[inline(always)] to it to make sure it keeps doing that? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Dandandan for this good suggestion, i addressed it in latest PR.

Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Jun 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add the #[inline(always)] to eq(), lt() and compare_byte_view(), i am sure if it's the right way.

// If the lengths are equal, we can compare the inlined bytes
// Only need to compare the inlined bytes
let l_bytes = unsafe { GenericByteViewArray::<T>::inline_value(l_view, 12) };
let r_bytes = unsafe { GenericByteViewArray::<T>::inline_value(r_view, 12) };
return l_bytes.cmp(r_bytes).is_eq();
}

// # Safety
// The index is within bounds as it is checked in value()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it would be good to move the Safety comment up along with the calls to unsafe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for review, addressed it in latest PR.

unsafe { GenericByteViewArray::compare_unchecked(l.0, l.1, r.0, r.1).is_eq() }
}

#[inline(always)]
fn is_lt(l: Self::Item, r: Self::Item) -> bool {
if l.0.data_buffers().is_empty() && r.0.data_buffers().is_empty() {
let l_view = unsafe { l.0.views().get_unchecked(l.1) };
let r_view = unsafe { r.0.views().get_unchecked(r.1) };
let l_len = *l_view as u32 as usize;
let r_len = *r_view as u32 as usize;
let l_bytes = unsafe { GenericByteViewArray::<T>::inline_value(l_view, l_len) };
Copy link
Contributor

@Dandandan Dandandan Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the main thing I see is what we could do next is do inline_value on a u128 level rather than &[u8] and compare the values on u128 and length if equal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Dandandan , interesting, i will investigate this idea!

let r_bytes = unsafe { GenericByteViewArray::<T>::inline_value(r_view, r_len) };
return l_bytes.cmp(r_bytes).is_lt();
}
// # Safety
// The index is within bounds as it is checked in value()
unsafe { GenericByteViewArray::compare_unchecked(l.0, l.1, r.0, r.1).is_lt() }
Expand Down Expand Up @@ -618,6 +640,7 @@ impl<'a> ArrayOrd for &'a FixedSizeBinaryArray {
}

/// Compares two [`GenericByteViewArray`] at index `left_idx` and `right_idx`
#[inline(always)]
pub fn compare_byte_view<T: ByteViewType>(
left: &GenericByteViewArray<T>,
left_idx: usize,
Expand All @@ -626,6 +649,15 @@ pub fn compare_byte_view<T: ByteViewType>(
) -> std::cmp::Ordering {
assert!(left_idx < left.len());
assert!(right_idx < right.len());
if left.data_buffers().is_empty() && right.data_buffers().is_empty() {
let l_view = unsafe { left.views().get_unchecked(left_idx) };
let r_view = unsafe { right.views().get_unchecked(right_idx) };
let l_len = *l_view as u32 as usize;
let r_len = *r_view as u32 as usize;
let l_bytes = unsafe { GenericByteViewArray::<T>::inline_value(l_view, l_len) };
let r_bytes = unsafe { GenericByteViewArray::<T>::inline_value(r_view, r_len) };
return l_bytes.cmp(r_bytes);
}
unsafe { GenericByteViewArray::compare_unchecked(left, left_idx, right, right_idx) }
}

Expand Down
28 changes: 28 additions & 0 deletions arrow/benches/comparison_kernels.rs
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,17 @@ fn make_string_array(size: usize, rng: &mut StdRng) -> impl Iterator<Item = Opti
})
}

fn make_inlined_string_array(
size: usize,
rng: &mut StdRng,
) -> impl Iterator<Item = Option<String>> + '_ {
(0..size).map(|_| {
let len = rng.random_range(0..12);
let bytes = (0..len).map(|_| rng.random_range(0..128)).collect();
Some(String::from_utf8(bytes).unwrap())
})
}

fn add_benchmark(c: &mut Criterion) {
let arr_a = create_primitive_array_with_seed::<Float32Type>(SIZE, 0.0, 42);
let arr_b = create_primitive_array_with_seed::<Float32Type>(SIZE, 0.0, 43);
Expand Down Expand Up @@ -226,6 +237,23 @@ fn add_benchmark(c: &mut Criterion) {
b.iter(|| eq(&string_view_left, &string_view_right).unwrap())
});

let array_gen = make_inlined_string_array(1024 * 1024 * 8, &mut rng);
let string_left = StringArray::from_iter(array_gen);
let string_view_inlined_left = StringViewArray::from_iter(string_left.iter());

let array_gen = make_inlined_string_array(1024 * 1024 * 8, &mut rng);
let string_right = StringArray::from_iter(array_gen);
let string_view_inlined_right = StringViewArray::from_iter(string_right.iter());

// Add fast path benchmarks for StringViewArray, both side are inlined views < 12 bytes
c.bench_function("eq StringViewArray StringViewArray inlined bytes", |b| {
b.iter(|| eq(&string_view_inlined_left, &string_view_inlined_right).unwrap())
});

c.bench_function("lt StringViewArray StringViewArray inlined bytes", |b| {
b.iter(|| lt(&string_view_inlined_left, &string_view_inlined_right).unwrap())
});

// eq benchmarks for long strings with the same prefix
c.bench_function("eq long same prefix strings StringArray", |b| {
b.iter(|| eq(&left_arr_long_string, &right_arr_long_string).unwrap())
Expand Down
Loading