-
Notifications
You must be signed in to change notification settings - Fork 982
perf: speed up StringViewArray gc 1.4 ~5.x faster #7873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Testing result: critcmp fast_gc main --filter "gc"
group fast_gc main
----- ------- ----
gc view types all 1.00 346.3±6.41µs ? ?/sec 1.24 430.5±7.80µs ? ?/sec
gc view types slice half 1.00 162.0±5.75µs ? ?/sec 1.24 201.4±7.30µs ? ?/sec |
Convert it as draft, the benchmark result seems unstable, i need to verify and experimenting more. |
@@ -473,10 +473,25 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> { | |||
/// Note: this function does not attempt to canonicalize / deduplicate values. For this | |||
/// feature see [`GenericByteViewBuilder::with_deduplicate_strings`]. | |||
pub fn gc(&self) -> Self { | |||
let mut builder = GenericByteViewBuilder::<T>::with_capacity(self.len()); | |||
let len = self.len(); | |||
let mut builder = GenericByteViewBuilder::<T>::with_capacity(len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think avoiding using the builder could make it faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Dandandan for good suggestion, i will try it soon!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems no improvement when i changing to remove builder:
diff --git a/arrow-array/src/array/byte_view_array.rs b/arrow-array/src/array/byte_view_array.rs
index b749459f9f..8605eb8108 100644
--- a/arrow-array/src/array/byte_view_array.rs
+++ b/arrow-array/src/array/byte_view_array.rs
@@ -21,7 +21,7 @@ use crate::iterator::ArrayIter;
use crate::types::bytes::ByteArrayNativeType;
use crate::types::{BinaryViewType, ByteViewType, StringViewType};
use crate::{Array, ArrayAccessor, ArrayRef, GenericByteArray, OffsetSizeTrait, Scalar};
-use arrow_buffer::{ArrowNativeType, Buffer, NullBuffer, ScalarBuffer};
+use arrow_buffer::{ArrowNativeType, Buffer, NullBuffer, NullBufferBuilder, ScalarBuffer};
use arrow_data::{ArrayData, ArrayDataBuilder, ByteView, MAX_INLINE_VIEW_LEN};
use arrow_schema::{ArrowError, DataType};
use core::str;
@@ -474,27 +474,65 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
/// feature see [`GenericByteViewBuilder::with_deduplicate_strings`].
pub fn gc(&self) -> Self {
let len = self.len();
- let mut builder = GenericByteViewBuilder::<T>::with_capacity(len);
let views = self.views();
+ let mut total_large_bytes = 0;
+ for i in 0..len {
+ if !self.is_null(i) {
+ let length = views[i] as u32;
+ if length > MAX_INLINE_VIEW_LEN {
+ total_large_bytes += length as usize;
+ }
+ }
+ }
+
+ let mut data_buf = Vec::with_capacity(total_large_bytes);
+ let mut views_buf = Vec::with_capacity(len);
+ let mut null_builder = NullBufferBuilder::new(len);
+
for i in 0..len {
if self.is_null(i) {
- builder.append_null();
+ // null
+ views_buf.push(0);
+ null_builder.append_null();
continue;
}
let native: &T::Native = unsafe { self.value_unchecked(i) };
- let bytes: &[u8] = native.as_ref();
+ let v: &[u8] = native.as_ref();
+ let length = v.len() as u32;
- let length = views[i] as u32;
if length <= MAX_INLINE_VIEW_LEN {
- builder.append_inlined(bytes, length);
+ let mut view_bytes = [0u8; 16];
+ view_bytes[0..4].copy_from_slice(&length.to_le_bytes());
+ view_bytes[4..4 + v.len()].copy_from_slice(v);
+ views_buf.push(u128::from_le_bytes(view_bytes));
} else {
- builder.append_bytes(bytes, length);
+ let offset = data_buf.len() as u32;
+ data_buf.extend_from_slice(v);
+
+ let prefix = u32::from_le_bytes(v[0..4].try_into().unwrap());
+ let bv = ByteView {
+ length,
+ prefix,
+ buffer_index: 0,
+ offset,
+ };
+ views_buf.push(bv.into());
}
+
+ null_builder.append_non_null();
}
- builder.finish()
+ let data_block = Buffer::from_vec(data_buf);
+ let nulls = null_builder.finish();
+ unsafe {
+ GenericByteViewArray::new_unchecked(
+ ScalarBuffer::new(Buffer::from_slice_ref(&views_buf), 0, len),
+ vec![data_block],
+ nulls,
+ )
+ }
}
/// Returns the total number of bytes used by all non inlined views in all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoiding the builder can yield speedup if you
- clone the nulls instead of rebuilding using null builder
- Use
.into()
from theVec
to createScalarBuffer
(without copy) - use
Vec::extend
rather thanpush
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and even better is to use collect
into Vec
if possible rather than using extend
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you addressed it in latest PR @Dandandan , amazing suggestion! Even better than my original solution!
Now it's 1.3x ~ 1.4 faster!
Thank you @Dandandan for review and good suggestion, updated amazing result, 1.3x ~1.4 faster! FYI @alamb critcmp fast_gc main --filter "gc"
group fast_gc main
----- ------- ----
gc view types all 1.00 313.2±5.99µs ? ?/sec 1.35 422.0±6.47µs ? ?/sec
gc view types slice half 1.00 142.5±6.21µs ? ?/sec 1.38 197.0±7.04µs ? ?/sec |
Continue optimizing, in latest PR, we don't use value_unchecked in gc which is duplicating check len, etc. The result is amazing, almost 2X faster: critcmp fast_gc main --filter "gc"
group fast_gc main
----- ------- ----
gc view types all 1.00 223.5±4.05µs ? ?/sec 1.89 422.0±6.47µs ? ?/sec
gc view types slice half 1.00 100.0±4.69µs ? ?/sec 1.97 197.0±7.04µs ? ?/sec |
let nulls = self.nulls().cloned(); // reuse & clone existing null bitmap | ||
|
||
// 2) Pre-scan to determine how many out‑of‑line bytes we must store | ||
let total_large: usize = (0..len) | ||
.filter_map(|i| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use set_indices
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, i tried now, using set_indices will not improve performance for benchmark.
.filter_map(|i| { | ||
// skip null entries | ||
if self.is_null(i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be avoided for arrays without nulls.
|
||
// 2) Pre-scan to determine how many out‑of‑line bytes we must store | ||
let total_large: usize = (0..len) | ||
.filter_map(|i| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use 0
for None
instead of using filter_map
let views_buf: Vec<u128> = (0..len) | ||
.map(|i| { | ||
// if null, represent as 0 | ||
if self.is_null(i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - could use set_indices
for nulls, could optimize non-null case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Dandandan, i try to use set_indices for nulls, but the benchmark not got improvement, i checked the benchmark, i found that our benchmark will always has nulls, so we should add not null benchmark first, i will submit a PR for not null benchmark, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking nice, I think there is a bit more opportunity to squeeze out some extra performance. |
Thank you @Dandandan for review and great suggestions! I will try to address it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one comment about reusing views
Thank you @Dandandan for review, addressed it now! |
Hi @Dandandan @alamb In latest PR, i do more optimization, i remove all null caculation and process view for null because i think it's not needed for gc operation, and remove all if / else better SIMD, it shows improvement continue for more 20% faster, what do you think for this change? And the null is correct, because we always keep it in the construct, also our unit testing is checking it already. Thanks a lot! The result is amazing now, 1.5 ~ 3 faster! critcmp fast_gc main --filter "gc"
group fast_gc main
----- ------- ----
gc view types all without nulls[100000] 1.00 380.6±23.41µs ? ?/sec 2.83 1078.9±38.41µs ? ?/sec
gc view types all without nulls[8000] 1.00 29.5±4.76µs ? ?/sec 1.84 54.2±5.48µs ? ?/sec
gc view types all[100000] 1.00 177.6±13.86µs ? ?/sec 2.48 440.8±11.15µs ? ?/sec
gc view types all[8000] 1.00 15.2±5.76µs ? ?/sec 2.80 42.5±10.59µs ? ?/sec
gc view types slice half without nulls[100000] 1.00 340.1±14.62µs ? ?/sec 1.44 490.3±21.18µs ? ?/sec
gc view types slice half without nulls[8000] 1.00 15.5±3.92µs ? ?/sec 1.76 27.4±4.11µs ? ?/sec
gc view types slice half[100000] 1.00 70.9±6.12µs ? ?/sec 2.94 208.2±6.17µs ? ?/sec
gc view types slice half[8000] 1.00 9.5±5.67µs ? ?/sec 1.71 16.3±0.36µs ? ?/sec |
🤖 |
Thank you @alamb for benchmark, i also try to see this changes to sort_tpch for datafusion, i submitted a PR for datafusion, maybe we can try to run benchmark there also, thanks! |
FWIW the benchmarks failed with the following panic cargo bench --features=arrow,async,test_common,experimental --bench view_types ...
|
Thank you @alamb , my bad for another PR which added new benchmark, it makes the array len broken for view types slice, submitted a quick fix now: |
🤖 |
🤖: Benchmark completed Details
|
😮 -- very nuce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zhuqi-lucas -- I think this PR could be merged as is
I think we could improve the safety of process_views
/ make it less dangerous, but e could also do it as a follow on
Thanks again -- very nice
// extracting the data from the buffers if necessary. | ||
// It used by `gc` function to process each view. | ||
#[inline(always)] | ||
fn process_view(&self, i: usize, views: &[u128], data_buf: &mut Vec<u8>) -> u128 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should mark this method as unsafe as it makes assumptions that views always point to a valid view in self.buffers. It would be good to document assumptions too:
views
but be valid views that point toself.buffers
- the returned view is updated to point at buffer "0" and the bytes are copued to data_buf
I think we might be able to make it safer by NOT passing in views but instead using self.views
-- that would make it clearer that the views MUST come from self (and thus refer to a valid self.buffer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for review and good suggestion, addressed in latest PR.
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Amazing, 1.44 ~ 5.4 faster! |
Thank you @alamb for review, i also addressed this good point in this PR. |
Interesting the Rustdocs check seems broken, not related to this PR. |
lets gogogogogogo |
Which issue does this PR close?
Improve the StringViewArray gc performance
Rationale for this change
Improve the StringViewArray gc performance
What changes are included in this PR?
Improve the StringViewArray gc performance
Are these changes tested?
Yes
Are there any user-facing changes?
No