Skip to content

perf: speed up StringViewArray gc 1.4 ~5.x faster #7873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Jul 11, 2025

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Jul 5, 2025

Which issue does this PR close?

Improve the StringViewArray gc performance

Rationale for this change

Improve the StringViewArray gc performance

  1. Such as precompute the len and reserve
  2. Split function for inlined and not inlined
  3. Remove builder and construct ourself

What changes are included in this PR?

Improve the StringViewArray gc performance

  1. Such as precompute the len and reserve
  2. Split function for inlined and not inlined
  3. Remove builder and construct ourself

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 5, 2025
@zhuqi-lucas zhuqi-lucas changed the title Fast gc perf: speed up StringViewArray gc 1.2x faster Jul 5, 2025
@zhuqi-lucas
Copy link
Contributor Author

Testing result:

critcmp  fast_gc  main --filter "gc"
group                       fast_gc                                main
-----                       -------                                ----
gc view types all           1.00    346.3±6.41µs        ? ?/sec    1.24    430.5±7.80µs        ? ?/sec
gc view types slice half    1.00    162.0±5.75µs        ? ?/sec    1.24    201.4±7.30µs        ? ?/sec

@zhuqi-lucas
Copy link
Contributor Author

Convert it as draft, the benchmark result seems unstable, i need to verify and experimenting more.

@@ -473,10 +473,25 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
/// Note: this function does not attempt to canonicalize / deduplicate values. For this
/// feature see [`GenericByteViewBuilder::with_deduplicate_strings`].
pub fn gc(&self) -> Self {
let mut builder = GenericByteViewBuilder::<T>::with_capacity(self.len());
let len = self.len();
let mut builder = GenericByteViewBuilder::<T>::with_capacity(len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think avoiding using the builder could make it faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Dandandan for good suggestion, i will try it soon!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems no improvement when i changing to remove builder:

diff --git a/arrow-array/src/array/byte_view_array.rs b/arrow-array/src/array/byte_view_array.rs
index b749459f9f..8605eb8108 100644
--- a/arrow-array/src/array/byte_view_array.rs
+++ b/arrow-array/src/array/byte_view_array.rs
@@ -21,7 +21,7 @@ use crate::iterator::ArrayIter;
 use crate::types::bytes::ByteArrayNativeType;
 use crate::types::{BinaryViewType, ByteViewType, StringViewType};
 use crate::{Array, ArrayAccessor, ArrayRef, GenericByteArray, OffsetSizeTrait, Scalar};
-use arrow_buffer::{ArrowNativeType, Buffer, NullBuffer, ScalarBuffer};
+use arrow_buffer::{ArrowNativeType, Buffer, NullBuffer, NullBufferBuilder, ScalarBuffer};
 use arrow_data::{ArrayData, ArrayDataBuilder, ByteView, MAX_INLINE_VIEW_LEN};
 use arrow_schema::{ArrowError, DataType};
 use core::str;
@@ -474,27 +474,65 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
     /// feature see  [`GenericByteViewBuilder::with_deduplicate_strings`].
     pub fn gc(&self) -> Self {
         let len = self.len();
-        let mut builder = GenericByteViewBuilder::<T>::with_capacity(len);
         let views = self.views();
 
+        let mut total_large_bytes = 0;
+        for i in 0..len {
+            if !self.is_null(i) {
+                let length = views[i] as u32;
+                if length > MAX_INLINE_VIEW_LEN {
+                    total_large_bytes += length as usize;
+                }
+            }
+        }
+
+        let mut data_buf = Vec::with_capacity(total_large_bytes);
+        let mut views_buf = Vec::with_capacity(len);
+        let mut null_builder = NullBufferBuilder::new(len);
+
         for i in 0..len {
             if self.is_null(i) {
-                builder.append_null();
+                // null
+                views_buf.push(0);
+                null_builder.append_null();
                 continue;
             }
 
             let native: &T::Native = unsafe { self.value_unchecked(i) };
-            let bytes: &[u8] = native.as_ref();
+            let v: &[u8] = native.as_ref();
+            let length = v.len() as u32;
 
-            let length = views[i] as u32;
             if length <= MAX_INLINE_VIEW_LEN {
-                builder.append_inlined(bytes, length);
+                let mut view_bytes = [0u8; 16];
+                view_bytes[0..4].copy_from_slice(&length.to_le_bytes());
+                view_bytes[4..4 + v.len()].copy_from_slice(v);
+                views_buf.push(u128::from_le_bytes(view_bytes));
             } else {
-                builder.append_bytes(bytes, length);
+                let offset = data_buf.len() as u32;
+                data_buf.extend_from_slice(v);
+
+                let prefix = u32::from_le_bytes(v[0..4].try_into().unwrap());
+                let bv = ByteView {
+                    length,
+                    prefix,
+                    buffer_index: 0,
+                    offset,
+                };
+                views_buf.push(bv.into());
             }
+
+            null_builder.append_non_null();
         }
 
-        builder.finish()
+        let data_block = Buffer::from_vec(data_buf);
+        let nulls = null_builder.finish();
+        unsafe {
+            GenericByteViewArray::new_unchecked(
+                ScalarBuffer::new(Buffer::from_slice_ref(&views_buf), 0, len),
+                vec![data_block],
+                nulls,
+            )
+        }
     }
 
     /// Returns the total number of bytes used by all non inlined views in all

Copy link
Contributor

@Dandandan Dandandan Jul 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoiding the builder can yield speedup if you

  • clone the nulls instead of rebuilding using null builder
  • Use .into() from the Vec to create ScalarBuffer (without copy)
  • use Vec::extend rather than push

Copy link
Contributor

@Dandandan Dandandan Jul 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and even better is to use collect into Vec if possible rather than using extend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you addressed it in latest PR @Dandandan , amazing suggestion! Even better than my original solution!

Now it's 1.3x ~ 1.4 faster!

@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review July 7, 2025 02:49
@zhuqi-lucas zhuqi-lucas changed the title perf: speed up StringViewArray gc 1.2x faster perf: speed up StringViewArray gc 1.3x faster Jul 7, 2025
@zhuqi-lucas
Copy link
Contributor Author

Thank you @Dandandan for review and good suggestion, updated amazing result, 1.3x ~1.4 faster! FYI @alamb

critcmp  fast_gc  main --filter "gc"
group                       fast_gc                                main
-----                       -------                                ----
gc view types all           1.00    313.2±5.99µs        ? ?/sec    1.35    422.0±6.47µs        ? ?/sec
gc view types slice half    1.00    142.5±6.21µs        ? ?/sec    1.38    197.0±7.04µs        ? ?/sec

@zhuqi-lucas
Copy link
Contributor Author

Continue optimizing, in latest PR, we don't use value_unchecked in gc which is duplicating check len, etc.

The result is amazing, almost 2X faster:

critcmp  fast_gc  main --filter "gc"
group                       fast_gc                                main
-----                       -------                                ----
gc view types all           1.00    223.5±4.05µs        ? ?/sec    1.89    422.0±6.47µs        ? ?/sec
gc view types slice half    1.00    100.0±4.69µs        ? ?/sec    1.97    197.0±7.04µs        ? ?/sec

@zhuqi-lucas zhuqi-lucas changed the title perf: speed up StringViewArray gc 1.3x faster perf: speed up StringViewArray gc 1.8x faster Jul 7, 2025
let nulls = self.nulls().cloned(); // reuse & clone existing null bitmap

// 2) Pre-scan to determine how many out‑of‑line bytes we must store
let total_large: usize = (0..len)
.filter_map(|i| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use set_indices here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, i tried now, using set_indices will not improve performance for benchmark.

.filter_map(|i| {
// skip null entries
if self.is_null(i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be avoided for arrays without nulls.


// 2) Pre-scan to determine how many out‑of‑line bytes we must store
let total_large: usize = (0..len)
.filter_map(|i| {
Copy link
Contributor

@Dandandan Dandandan Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use 0 for None instead of using filter_map

let views_buf: Vec<u128> = (0..len)
.map(|i| {
// if null, represent as 0
if self.is_null(i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - could use set_indices for nulls, could optimize non-null case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Dandandan, i try to use set_indices for nulls, but the benchmark not got improvement, i checked the benchmark, i found that our benchmark will always has nulls, so we should add not null benchmark first, i will submit a PR for not null benchmark, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitted a PR for this benchmark:

#7877

Thank you @Dandandan !

@Dandandan
Copy link
Contributor

Looking nice, I think there is a bit more opportunity to squeeze out some extra performance.

@zhuqi-lucas
Copy link
Contributor Author

Looking nice, I think there is a bit more opportunity to squeeze out some extra performance.

Thank you @Dandandan for review and great suggestions! I will try to address it!

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one comment about reusing views

@zhuqi-lucas
Copy link
Contributor Author

LGTM, one comment about reusing views

Thank you @Dandandan for review, addressed it now!

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Jul 10, 2025

Hi @Dandandan @alamb

In latest PR, i do more optimization, i remove all null caculation and process view for null because i think it's not needed for gc operation, and remove all if / else better SIMD, it shows improvement continue for more 20% faster, what do you think for this change? And the null is correct, because we always keep it in the construct, also our unit testing is checking it already. Thanks a lot!

The result is amazing now, 1.5 ~ 3 faster!

 critcmp  fast_gc  main --filter "gc"
group                                             fast_gc                                main
-----                                             -------                                ----
gc view types all without nulls[100000]           1.00   380.6±23.41µs        ? ?/sec    2.83  1078.9±38.41µs        ? ?/sec
gc view types all without nulls[8000]             1.00     29.5±4.76µs        ? ?/sec    1.84     54.2±5.48µs        ? ?/sec
gc view types all[100000]                         1.00   177.6±13.86µs        ? ?/sec    2.48   440.8±11.15µs        ? ?/sec
gc view types all[8000]                           1.00     15.2±5.76µs        ? ?/sec    2.80    42.5±10.59µs        ? ?/sec
gc view types slice half without nulls[100000]    1.00   340.1±14.62µs        ? ?/sec    1.44   490.3±21.18µs        ? ?/sec
gc view types slice half without nulls[8000]      1.00     15.5±3.92µs        ? ?/sec    1.76     27.4±4.11µs        ? ?/sec
gc view types slice half[100000]                  1.00     70.9±6.12µs        ? ?/sec    2.94    208.2±6.17µs        ? ?/sec
gc view types slice half[8000]                    1.00      9.5±5.67µs        ? ?/sec    1.71     16.3±0.36µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Jul 10, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_gc (135cbeb) to ff3a2f2 diff
BENCH_NAME=view_types
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench view_types
BENCH_FILTER=
BENCH_BRANCH_NAME=fast_gc
Results will be posted here when complete

@zhuqi-lucas
Copy link
Contributor Author

Thank you @alamb for benchmark, i also try to see this changes to sort_tpch for datafusion, i submitted a PR for datafusion, maybe we can try to run benchmark there also, thanks!

apache/datafusion#16739

@alamb
Copy link
Contributor

alamb commented Jul 10, 2025

FWIW the benchmarks failed with the following panic

cargo bench --features=arrow,async,test_common,experimental --bench view_types

...


Benchmarking gc view types slice half without nulls[8000]
Benchmarking gc view types slice half without nulls[8000]: Warming up for 3.0000 s
Benchmarking gc view types slice half without nulls[8000]: Collecting 100 samples in estimated 5.1026 s (187k iterations)
Benchmarking gc view types slice half without nulls[8000]: Analyzing
gc view types slice half without nulls[8000]
                        time:   [27.199 µs 27.268 µs 27.342 µs]

Benchmarking view types slice
Benchmarking view types slice: Warming up for 3.0000 s

thread 'main' panicked at arrow-buffer/src/buffer/immutable.rs:297:9:
the offset of the new Buffer cannot exceed the existing length: slice offset=0 length=800000 selflen=128000
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

error: bench failed, to rerun pass `-p arrow-array --bench view_types`

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Jul 10, 2025

FWIW the benchmarks failed with the following panic

cargo bench --features=arrow,async,test_common,experimental --bench view_types

...


Benchmarking gc view types slice half without nulls[8000]
Benchmarking gc view types slice half without nulls[8000]: Warming up for 3.0000 s
Benchmarking gc view types slice half without nulls[8000]: Collecting 100 samples in estimated 5.1026 s (187k iterations)
Benchmarking gc view types slice half without nulls[8000]: Analyzing
gc view types slice half without nulls[8000]
                        time:   [27.199 µs 27.268 µs 27.342 µs]

Benchmarking view types slice
Benchmarking view types slice: Warming up for 3.0000 s

thread 'main' panicked at arrow-buffer/src/buffer/immutable.rs:297:9:
the offset of the new Buffer cannot exceed the existing length: slice offset=0 length=800000 selflen=128000
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

error: bench failed, to rerun pass `-p arrow-array --bench view_types`

Thank you @alamb , my bad for another PR which added new benchmark, it makes the array len broken for view types slice, submitted a quick fix now:

#7892

@alamb
Copy link
Contributor

alamb commented Jul 10, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_gc (625c421) to 7595417 diff
BENCH_NAME=view_types
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench view_types
BENCH_FILTER=
BENCH_BRANCH_NAME=fast_gc
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jul 10, 2025

Thank you @alamb , my bad for another PR which added new benchmark, it makes the array len broken for view types slice, submitted a quick fix now:

#7892

merged and I merged up to include it

@alamb
Copy link
Contributor

alamb commented Jul 10, 2025

🤖: Benchmark completed

Details

group                                             fast_gc                                main
-----                                             -------                                ----
gc view types all without nulls[100000]           1.00  1562.8±42.69µs        ? ?/sec    4.50      7.0±0.12ms        ? ?/sec
gc view types all without nulls[8000]             1.00     65.7±4.00µs        ? ?/sec    1.44     94.6±0.78µs        ? ?/sec
gc view types all[100000]                         1.00    309.8±6.83µs        ? ?/sec    2.67    828.4±2.50µs        ? ?/sec
gc view types all[8000]                           1.00     23.8±0.05µs        ? ?/sec    2.72     64.8±0.10µs        ? ?/sec
gc view types slice half without nulls[100000]    1.00   525.1±12.12µs        ? ?/sec    5.39      2.8±0.08ms        ? ?/sec
gc view types slice half without nulls[8000]      1.00     27.0±0.16µs        ? ?/sec    1.70     45.8±0.47µs        ? ?/sec
gc view types slice half[100000]                  1.00    151.9±2.16µs        ? ?/sec    2.70    409.7±3.16µs        ? ?/sec
gc view types slice half[8000]                    1.00     12.0±0.02µs        ? ?/sec    2.72     32.7±0.06µs        ? ?/sec
view types slice                                  1.00    705.8±1.54ns        ? ?/sec    1.00    705.2±1.50ns        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Jul 10, 2025

🤖: Benchmark completed

😮 -- very nuce

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas -- I think this PR could be merged as is

I think we could improve the safety of process_views / make it less dangerous, but e could also do it as a follow on

Thanks again -- very nice

// extracting the data from the buffers if necessary.
// It used by `gc` function to process each view.
#[inline(always)]
fn process_view(&self, i: usize, views: &[u128], data_buf: &mut Vec<u8>) -> u128 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should mark this method as unsafe as it makes assumptions that views always point to a valid view in self.buffers. It would be good to document assumptions too:

  1. views but be valid views that point to self.buffers
  2. the returned view is updated to point at buffer "0" and the bytes are copued to data_buf

I think we might be able to make it safer by NOT passing in views but instead using self.views -- that would make it clearer that the views MUST come from self (and thus refer to a valid self.buffer)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for review and good suggestion, addressed in latest PR.

zhuqi-lucas and others added 3 commits July 11, 2025 11:29
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@zhuqi-lucas
Copy link
Contributor Author

🤖: Benchmark completed

😮 -- very nuce

Amazing, 1.44 ~ 5.4 faster!

@zhuqi-lucas zhuqi-lucas changed the title perf: speed up StringViewArray gc 1.3x ~2.x faster perf: speed up StringViewArray gc 1.4 ~5.x faster Jul 11, 2025
@zhuqi-lucas
Copy link
Contributor Author

Thank you @zhuqi-lucas -- I think this PR could be merged as is

I think we could improve the safety of process_views / make it less dangerous, but e could also do it as a follow on

Thanks again -- very nice

Thank you @alamb for review, i also addressed this good point in this PR.

@zhuqi-lucas
Copy link
Contributor Author

Interesting the Rustdocs check seems broken, not related to this PR.

@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

Interesting the Rustdocs check seems broken, not related to this PR.

@viirya fixed it in #7898

I merged main into this PR to pick up the fix and hopefully get a clean CI run

@alamb
Copy link
Contributor

alamb commented Jul 11, 2025

lets gogogogogogo

@alamb alamb merged commit 7b219f9 into apache:main Jul 11, 2025
29 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants