perf: speed up StringViewArray gc 1.4 ~5.x faster #7873

zhuqi-lucas · 2025-07-05T12:32:28Z

Which issue does this PR close?

Improve the StringViewArray gc performance

Rationale for this change

Improve the StringViewArray gc performance

Such as precompute the len and reserve
Split function for inlined and not inlined
Remove builder and construct ourself

What changes are included in this PR?

Improve the StringViewArray gc performance

Such as precompute the len and reserve
Split function for inlined and not inlined
Remove builder and construct ourself

Are these changes tested?

Yes

Are there any user-facing changes?

No

zhuqi-lucas · 2025-07-05T12:36:52Z

Testing result:

critcmp  fast_gc  main --filter "gc"
group                       fast_gc                                main
-----                       -------                                ----
gc view types all           1.00    346.3±6.41µs        ? ?/sec    1.24    430.5±7.80µs        ? ?/sec
gc view types slice half    1.00    162.0±5.75µs        ? ?/sec    1.24    201.4±7.30µs        ? ?/sec

zhuqi-lucas · 2025-07-05T13:07:39Z

Convert it as draft, the benchmark result seems unstable, i need to verify and experimenting more.

Dandandan · 2025-07-05T14:42:04Z

arrow-array/src/array/byte_view_array.rs

@@ -473,10 +473,25 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
    /// Note: this function does not attempt to canonicalize / deduplicate values. For this
    /// feature see  [`GenericByteViewBuilder::with_deduplicate_strings`].
    pub fn gc(&self) -> Self {
-        let mut builder = GenericByteViewBuilder::<T>::with_capacity(self.len());
+        let len = self.len();
+        let mut builder = GenericByteViewBuilder::<T>::with_capacity(len);


I think avoiding using the builder could make it faster

Thank you @Dandandan for good suggestion, i will try it soon!

It seems no improvement when i changing to remove builder:

diff --git a/arrow-array/src/array/byte_view_array.rs b/arrow-array/src/array/byte_view_array.rs index b749459f9f..8605eb8108 100644 --- a/arrow-array/src/array/byte_view_array.rs +++ b/arrow-array/src/array/byte_view_array.rs @@ -21,7 +21,7 @@ use crate::iterator::ArrayIter; use crate::types::bytes::ByteArrayNativeType; use crate::types::{BinaryViewType, ByteViewType, StringViewType}; use crate::{Array, ArrayAccessor, ArrayRef, GenericByteArray, OffsetSizeTrait, Scalar}; -use arrow_buffer::{ArrowNativeType, Buffer, NullBuffer, ScalarBuffer}; +use arrow_buffer::{ArrowNativeType, Buffer, NullBuffer, NullBufferBuilder, ScalarBuffer}; use arrow_data::{ArrayData, ArrayDataBuilder, ByteView, MAX_INLINE_VIEW_LEN}; use arrow_schema::{ArrowError, DataType}; use core::str; @@ -474,27 +474,65 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> { /// feature see [`GenericByteViewBuilder::with_deduplicate_strings`]. pub fn gc(&self) -> Self { let len = self.len(); - let mut builder = GenericByteViewBuilder::<T>::with_capacity(len); let views = self.views(); + let mut total_large_bytes = 0; + for i in 0..len { + if !self.is_null(i) { + let length = views[i] as u32; + if length > MAX_INLINE_VIEW_LEN { + total_large_bytes += length as usize; + } + } + } + + let mut data_buf = Vec::with_capacity(total_large_bytes); + let mut views_buf = Vec::with_capacity(len); + let mut null_builder = NullBufferBuilder::new(len); + for i in 0..len { if self.is_null(i) { - builder.append_null(); + // null + views_buf.push(0); + null_builder.append_null(); continue; } let native: &T::Native = unsafe { self.value_unchecked(i) }; - let bytes: &[u8] = native.as_ref(); + let v: &[u8] = native.as_ref(); + let length = v.len() as u32; - let length = views[i] as u32; if length <= MAX_INLINE_VIEW_LEN { - builder.append_inlined(bytes, length); + let mut view_bytes = [0u8; 16]; + view_bytes[0..4].copy_from_slice(&length.to_le_bytes()); + view_bytes[4..4 + v.len()].copy_from_slice(v); + views_buf.push(u128::from_le_bytes(view_bytes)); } else { - builder.append_bytes(bytes, length); + let offset = data_buf.len() as u32; + data_buf.extend_from_slice(v); + + let prefix = u32::from_le_bytes(v[0..4].try_into().unwrap()); + let bv = ByteView { + length, + prefix, + buffer_index: 0, + offset, + }; + views_buf.push(bv.into()); } + + null_builder.append_non_null(); } - builder.finish() + let data_block = Buffer::from_vec(data_buf); + let nulls = null_builder.finish(); + unsafe { + GenericByteViewArray::new_unchecked( + ScalarBuffer::new(Buffer::from_slice_ref(&views_buf), 0, len), + vec![data_block], + nulls, + ) + } } /// Returns the total number of bytes used by all non inlined views in all

avoiding the builder can yield speedup if you

clone the nulls instead of rebuilding using null builder

Use .into() from the Vec to create ScalarBuffer (without copy)

use Vec::extend rather than push

Oh and even better is to use collect into Vec if possible rather than using extend.

Thank you addressed it in latest PR @Dandandan , amazing suggestion! Even better than my original solution!

Now it's 1.3x ~ 1.4 faster!

zhuqi-lucas · 2025-07-07T02:57:51Z

Thank you @Dandandan for review and good suggestion, updated amazing result, 1.3x ~1.4 faster! FYI @alamb

critcmp  fast_gc  main --filter "gc"
group                       fast_gc                                main
-----                       -------                                ----
gc view types all           1.00    313.2±5.99µs        ? ?/sec    1.35    422.0±6.47µs        ? ?/sec
gc view types slice half    1.00    142.5±6.21µs        ? ?/sec    1.38    197.0±7.04µs        ? ?/sec

zhuqi-lucas · 2025-07-07T04:09:14Z

Continue optimizing, in latest PR, we don't use value_unchecked in gc which is duplicating check len, etc.

The result is amazing, almost 2X faster:

critcmp  fast_gc  main --filter "gc"
group                       fast_gc                                main
-----                       -------                                ----
gc view types all           1.00    223.5±4.05µs        ? ?/sec    1.89    422.0±6.47µs        ? ?/sec
gc view types slice half    1.00    100.0±4.69µs        ? ?/sec    1.97    197.0±7.04µs        ? ?/sec

Dandandan · 2025-07-07T07:52:41Z

arrow-array/src/array/byte_view_array.rs

+        let nulls = self.nulls().cloned(); // reuse & clone existing null bitmap
+
+        // 2) Pre-scan to determine how many out‑of‑line bytes we must store
+        let total_large: usize = (0..len)
            .filter_map(|i| {


Can use set_indices here.

Interesting, i tried now, using set_indices will not improve performance for benchmark.

Dandandan · 2025-07-07T07:52:59Z

arrow-array/src/array/byte_view_array.rs

            .filter_map(|i| {
+                // skip null entries
                if self.is_null(i) {


This could be avoided for arrays without nulls.

Dandandan · 2025-07-07T07:53:39Z

arrow-array/src/array/byte_view_array.rs

+
+        // 2) Pre-scan to determine how many out‑of‑line bytes we must store
+        let total_large: usize = (0..len)
+            .filter_map(|i| {


Can use 0 for None instead of using filter_map

Dandandan · 2025-07-07T07:55:56Z

arrow-array/src/array/byte_view_array.rs

+        let views_buf: Vec<u128> = (0..len)
+            .map(|i| {
+                // if null, represent as 0
+                if self.is_null(i) {


Same here - could use set_indices for nulls, could optimize non-null case.

Thank you @Dandandan, i try to use set_indices for nulls, but the benchmark not got improvement, i checked the benchmark, i found that our benchmark will always has nulls, so we should add not null benchmark first, i will submit a PR for not null benchmark, thanks!

Submitted a PR for this benchmark:

#7877

Thank you @Dandandan !

Dandandan · 2025-07-07T07:56:55Z

Looking nice, I think there is a bit more opportunity to squeeze out some extra performance.

zhuqi-lucas · 2025-07-07T10:41:43Z

Looking nice, I think there is a bit more opportunity to squeeze out some extra performance.

Thank you @Dandandan for review and great suggestions! I will try to address it!

Dandandan

LGTM, one comment about reusing views

zhuqi-lucas · 2025-07-10T01:49:11Z

LGTM, one comment about reusing views

Thank you @Dandandan for review, addressed it now!

zhuqi-lucas · 2025-07-10T02:17:37Z

Hi @Dandandan @alamb

In latest PR, i do more optimization, i remove all null caculation and process view for null because i think it's not needed for gc operation, and remove all if / else better SIMD, it shows improvement continue for more 20% faster, what do you think for this change? And the null is correct, because we always keep it in the construct, also our unit testing is checking it already. Thanks a lot!

The result is amazing now, 1.5 ~ 3 faster!

 critcmp  fast_gc  main --filter "gc"
group                                             fast_gc                                main
-----                                             -------                                ----
gc view types all without nulls[100000]           1.00   380.6±23.41µs        ? ?/sec    2.83  1078.9±38.41µs        ? ?/sec
gc view types all without nulls[8000]             1.00     29.5±4.76µs        ? ?/sec    1.84     54.2±5.48µs        ? ?/sec
gc view types all[100000]                         1.00   177.6±13.86µs        ? ?/sec    2.48   440.8±11.15µs        ? ?/sec
gc view types all[8000]                           1.00     15.2±5.76µs        ? ?/sec    2.80    42.5±10.59µs        ? ?/sec
gc view types slice half without nulls[100000]    1.00   340.1±14.62µs        ? ?/sec    1.44   490.3±21.18µs        ? ?/sec
gc view types slice half without nulls[8000]      1.00     15.5±3.92µs        ? ?/sec    1.76     27.4±4.11µs        ? ?/sec
gc view types slice half[100000]                  1.00     70.9±6.12µs        ? ?/sec    2.94    208.2±6.17µs        ? ?/sec
gc view types slice half[8000]                    1.00      9.5±5.67µs        ? ?/sec    1.71     16.3±0.36µs        ? ?/sec

alamb · 2025-07-10T13:12:51Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_gc (135cbeb) to ff3a2f2 diff
BENCH_NAME=view_types
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench view_types
BENCH_FILTER=
BENCH_BRANCH_NAME=fast_gc
Results will be posted here when complete

zhuqi-lucas · 2025-07-10T14:51:56Z

Thank you @alamb for benchmark, i also try to see this changes to sort_tpch for datafusion, i submitted a PR for datafusion, maybe we can try to run benchmark there also, thanks!

apache/datafusion#16739

alamb · 2025-07-10T15:30:04Z

FWIW the benchmarks failed with the following panic

cargo bench --features=arrow,async,test_common,experimental --bench view_types

...


Benchmarking gc view types slice half without nulls[8000]
Benchmarking gc view types slice half without nulls[8000]: Warming up for 3.0000 s
Benchmarking gc view types slice half without nulls[8000]: Collecting 100 samples in estimated 5.1026 s (187k iterations)
Benchmarking gc view types slice half without nulls[8000]: Analyzing
gc view types slice half without nulls[8000]
                        time:   [27.199 µs 27.268 µs 27.342 µs]

Benchmarking view types slice
Benchmarking view types slice: Warming up for 3.0000 s

thread 'main' panicked at arrow-buffer/src/buffer/immutable.rs:297:9:
the offset of the new Buffer cannot exceed the existing length: slice offset=0 length=800000 selflen=128000
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

error: bench failed, to rerun pass `-p arrow-array --bench view_types`

zhuqi-lucas · 2025-07-10T15:43:23Z

FWIW the benchmarks failed with the following panic

cargo bench --features=arrow,async,test_common,experimental --bench view_types

...


Benchmarking gc view types slice half without nulls[8000]
Benchmarking gc view types slice half without nulls[8000]: Warming up for 3.0000 s
Benchmarking gc view types slice half without nulls[8000]: Collecting 100 samples in estimated 5.1026 s (187k iterations)
Benchmarking gc view types slice half without nulls[8000]: Analyzing
gc view types slice half without nulls[8000]
                        time:   [27.199 µs 27.268 µs 27.342 µs]

Benchmarking view types slice
Benchmarking view types slice: Warming up for 3.0000 s

thread 'main' panicked at arrow-buffer/src/buffer/immutable.rs:297:9:
the offset of the new Buffer cannot exceed the existing length: slice offset=0 length=800000 selflen=128000
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

error: bench failed, to rerun pass `-p arrow-array --bench view_types`

Thank you @alamb , my bad for another PR which added new benchmark, it makes the array len broken for view types slice, submitted a quick fix now:

#7892

alamb · 2025-07-10T16:05:20Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing fast_gc (625c421) to 7595417 diff
BENCH_NAME=view_types
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench view_types
BENCH_FILTER=
BENCH_BRANCH_NAME=fast_gc
Results will be posted here when complete

alamb · 2025-07-10T16:05:41Z

Thank you @alamb , my bad for another PR which added new benchmark, it makes the array len broken for view types slice, submitted a quick fix now:

#7892

merged and I merged up to include it

alamb · 2025-07-10T16:08:36Z

🤖: Benchmark completed

Details

group                                             fast_gc                                main
-----                                             -------                                ----
gc view types all without nulls[100000]           1.00  1562.8±42.69µs        ? ?/sec    4.50      7.0±0.12ms        ? ?/sec
gc view types all without nulls[8000]             1.00     65.7±4.00µs        ? ?/sec    1.44     94.6±0.78µs        ? ?/sec
gc view types all[100000]                         1.00    309.8±6.83µs        ? ?/sec    2.67    828.4±2.50µs        ? ?/sec
gc view types all[8000]                           1.00     23.8±0.05µs        ? ?/sec    2.72     64.8±0.10µs        ? ?/sec
gc view types slice half without nulls[100000]    1.00   525.1±12.12µs        ? ?/sec    5.39      2.8±0.08ms        ? ?/sec
gc view types slice half without nulls[8000]      1.00     27.0±0.16µs        ? ?/sec    1.70     45.8±0.47µs        ? ?/sec
gc view types slice half[100000]                  1.00    151.9±2.16µs        ? ?/sec    2.70    409.7±3.16µs        ? ?/sec
gc view types slice half[8000]                    1.00     12.0±0.02µs        ? ?/sec    2.72     32.7±0.06µs        ? ?/sec
view types slice                                  1.00    705.8±1.54ns        ? ?/sec    1.00    705.2±1.50ns        ? ?/sec

alamb · 2025-07-10T16:10:33Z

🤖: Benchmark completed

😮 -- very nuce

alamb

Thank you @zhuqi-lucas -- I think this PR could be merged as is

I think we could improve the safety of process_views / make it less dangerous, but e could also do it as a follow on

Thanks again -- very nice

arrow-array/src/array/byte_view_array.rs

alamb · 2025-07-10T16:30:13Z

arrow-array/src/array/byte_view_array.rs

+    // extracting the data from the buffers if necessary.
+    // It used by `gc` function to process each view.
+    #[inline(always)]
+    fn process_view(&self, i: usize, views: &[u128], data_buf: &mut Vec<u8>) -> u128 {


I think we should mark this method as unsafe as it makes assumptions that views always point to a valid view in self.buffers. It would be good to document assumptions too:

views but be valid views that point to self.buffers

the returned view is updated to point at buffer "0" and the bytes are copued to data_buf

I think we might be able to make it safer by NOT passing in views but instead using self.views -- that would make it clearer that the views MUST come from self (and thus refer to a valid self.buffer)

Thank you @alamb for review and good suggestion, addressed in latest PR.

arrow-array/src/array/byte_view_array.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

zhuqi-lucas · 2025-07-11T03:46:07Z

🤖: Benchmark completed

😮 -- very nuce

Amazing, 1.44 ~ 5.4 faster!

zhuqi-lucas · 2025-07-11T03:46:57Z

Thank you @zhuqi-lucas -- I think this PR could be merged as is

I think we could improve the safety of process_views / make it less dangerous, but e could also do it as a follow on

Thanks again -- very nice

Thank you @alamb for review, i also addressed this good point in this PR.

zhuqi-lucas · 2025-07-11T05:54:36Z

Interesting the Rustdocs check seems broken, not related to this PR.

alamb · 2025-07-11T19:13:26Z

Interesting the Rustdocs check seems broken, not related to this PR.

@viirya fixed it in #7898

I merged main into this PR to pick up the fix and hopefully get a clean CI run

alamb · 2025-07-11T19:45:24Z

lets gogogogogogo

zhuqi-lucas added 4 commits July 4, 2025 23:59

Perf: implement fast gc for string view

e0728ed

Merge remote-tracking branch 'upstream/main' into fast_gc

02f2870

polish

5815519

format

f5488cc

github-actions bot added the arrow Changes to the arrow crate label Jul 5, 2025

zhuqi-lucas changed the title ~~Fast gc~~ perf: speed up StringViewArray gc 1.2x faster Jul 5, 2025

polish code

4c3c7ee

zhuqi-lucas mentioned this pull request Jul 5, 2025

[EPIC] A collection of improvement for the performance for sort and compare and gc, etc #7802

Open

12 tasks

zhuqi-lucas marked this pull request as draft July 5, 2025 13:07

Dandandan reviewed Jul 5, 2025

View reviewed changes

Address comments

1601ab6

zhuqi-lucas marked this pull request as ready for review July 7, 2025 02:49

zhuqi-lucas changed the title ~~perf: speed up StringViewArray gc 1.2x faster~~ perf: speed up StringViewArray gc 1.3x faster Jul 7, 2025

zhuqi-lucas added 2 commits July 7, 2025 10:53

remove unused code

a809c98

Merge remote-tracking branch 'upstream/main' into fast_gc

2bb5b93

don't use value_unchecked which is duplicating check len, etc

5b5a05c

zhuqi-lucas added 2 commits July 7, 2025 12:44

polish code

3cb9431

fix comments

de6a199

zhuqi-lucas changed the title ~~perf: speed up StringViewArray gc 1.3x faster~~ perf: speed up StringViewArray gc 1.8x faster Jul 7, 2025

Dandandan reviewed Jul 7, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into fast_gc

55fb826

Dandandan approved these changes Jul 9, 2025

View reviewed changes

Address comments

219aabf

Don't need null caculation

c16d236

zhuqi-lucas added 2 commits July 10, 2025 11:36

Merge remote-tracking branch 'upstream/main' into fast_gc

6e6387d

fast path for no data buffer

135cbeb

Merge remote-tracking branch 'apache/main' into fast_gc

625c421

alamb approved these changes Jul 10, 2025

View reviewed changes

zhuqi-lucas and others added 3 commits July 11, 2025 11:29

Update arrow-array/src/array/byte_view_array.rs

2097747

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Update arrow-array/src/array/byte_view_array.rs

c1a1065

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

address comments

797b63e

zhuqi-lucas changed the title ~~perf: speed up StringViewArray gc 1.3x ~2.x faster~~ perf: speed up StringViewArray gc 1.4 ~5.x faster Jul 11, 2025

zhuqi-lucas added 2 commits July 11, 2025 12:02

fmt

5e30749

Merge remote-tracking branch 'upstream/main' into fast_gc

0110ef9

Merge remote-tracking branch 'apache/main' into fast_gc

214f844

alamb merged commit 7b219f9 into apache:main Jul 11, 2025
29 of 30 checks passed

perf: speed up StringViewArray gc 1.4 ~5.x faster #7873

perf: speed up StringViewArray gc 1.4 ~5.x faster #7873

Uh oh!

Conversation

zhuqi-lucas commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

zhuqi-lucas commented Jul 5, 2025

Uh oh!

zhuqi-lucas commented Jul 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Jul 7, 2025

Uh oh!

zhuqi-lucas commented Jul 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Jul 7, 2025

Uh oh!

zhuqi-lucas commented Jul 7, 2025

Uh oh!

Dandandan left a comment

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Jul 10, 2025

Uh oh!

zhuqi-lucas commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jul 10, 2025

Uh oh!

zhuqi-lucas commented Jul 10, 2025

Uh oh!

alamb commented Jul 10, 2025

Uh oh!

zhuqi-lucas commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jul 10, 2025

Uh oh!

alamb commented Jul 10, 2025

Uh oh!

alamb commented Jul 10, 2025

Uh oh!

alamb commented Jul 10, 2025

zhuqi-lucas commented Jul 5, 2025 •

edited

Loading

Dandandan Jul 6, 2025 •

edited

Loading

Dandandan Jul 6, 2025 •

edited

Loading

Dandandan Jul 7, 2025 •

edited

Loading

zhuqi-lucas commented Jul 10, 2025 •

edited

Loading

zhuqi-lucas commented Jul 10, 2025 •

edited

Loading