Improve memory usage for `arrow-row -> String/BinaryView` when utf8 validation disabled #7917

ding-young · 2025-07-12T11:54:17Z

Which issue does this PR close?

Related to Improve arrow-row --> StringView/BinaryView memory usage #6057 .

Rationale for this change

As described in above issue, when constructing a StringViewArray from rows, we currently store inline strings twice: once through make_view, and again in the values buffer so that we can validate utf8 in one go. However, this is suboptimal in terms of memory consumption, so ideally, we should avoid placing inline strings into the values buffer when UTF-8 validation is disabled.

What changes are included in this PR?

When UTF-8 validation is disabled, this PR modifies the string/bytes view array construction from rows as follows:

The capacity of the values buffer is set to accommodate only long strings plus 12 bytes for a single inline string placeholder.
All decoded strings are initially appended to the values buffer.
If a string turns out to be an inline string, it is included via make_view, and then the corresponding inline portion is truncated from the values buffer, ensuring the inline string does not appear twice in the resulting array.

Are these changes tested?

copied & modified existing fuzz_test to set disable utf8 validation.
Run bench & add bench case when array consists of both inline string & long strings

Are there any user-facing changes?

No.

Considered alternatives

One idea was to support separate buffers for inline strings even when UTF-8 validation is enabled. However, since we need to call decoded_len() first to determine the target buffer, this approach can be tricky or inefficient:

For example, precomputing a boolean flag per string to determine which buffer to use would increase temporary memory usage.
Alternatively, appending to the values buffer first and then moving inline strings to a separate buffer would lead to frequent memcpy overhead.

Given that datafusion disables UTF-8 validation when using RowConverter, this PR focuses on improving memory efficiency specifically when validation is turned off.

ding-young · 2025-07-14T03:36:23Z

@alamb @XiangpengHao @2010YOUY01

When running the existing benchmarks (cargo bench --bench row_format "convert_rows 4096 string view” ), I noticed there might be a slight regression, but it seems relatively minor given the normal level of fluctuation.
I’d love to hear your thoughts on this PR — if you think this direction is useful, I’ll run more benchmark experiments and polish it further.

XiangpengHao · 2025-07-14T16:53:54Z

I took a high level look and it looks good to me. Curious to see the perf diff. Does the benchmark report memory usage yet? @ding-young

ding-young · 2025-07-15T09:50:24Z

@XiangpengHao The current benchmark doesn’t report memory usage directly, but I’ve been printing stats manually using jemalloc. It seems like there might be an issue with my implementation, so I’ll double-check that and share the perf once I’ve confirmed.

ding-young · 2025-07-16T07:41:16Z

cargo bench result

Case (str_len, null prob)	main	issue-6057
string view(10, 0)	51.23 µs	52.18 µs
string view(30, 0)	45.47 µs	46.63 µs
string view(100, 0)	64.18 µs	68.54 µs
string view(100, 0.5)	70.11 µs	74.06 µs
string view(1..100, 0)	100.72 µs	103.80 µs
string view(1..100, 0.5)	80.48 µs	86.02 µs

manual memory profiling result (*unit = B)

I added code to get jemalloc stats (allocate, resident, active) before and after decoding binary view, and the memory usage actually improved especially when short strings are mixed up with large strings. When given rows consists of only large strings, the memory usage was the same.

let before = jemalloc_stat();

let view = if !validate_utf8 {
    decode_binary_view_inner_utf8_unchecked(rows, options)
} else {
    decode_binary_view_inner(rows, options, validate_utf8)
};

let after = jemalloc_stat();
// print ( after - before )

(To reproduce, see https://github.com/ding-young/arrow-rs/tree/issue-6057-bench-mem )

Case	main (alloc / active)	issue-6057 (alloc / active)
string view(10, 0)	102656 / 114688	65536 / 69632
string view(30, 0)	196608 / 204800	196608 / 204800
string view(100, 0)	524288 / 532480	524288 / 532480
string view(100, 0.5)	294912 / 303104	294912 / 303104
string view(1..100, 0)	294912 / 303104	294912 / 303104
string view(1..100, 0.5)	180224 / 188416	163840 / 172032

alamb

Thanks @ding-young -- the basic idea makes sense to me but this PR contains a lot of duplicated code which makes it hard to understand what is actually changing here

if the goal of this PR is to reduce the size of the values buffer when we are not validating utf8, I think that should also be testable

For example decode the same rows with and without validation and show the buffer without validation is smaller 🤔

alamb · 2025-07-22T21:48:19Z

arrow-row/src/variable.rs

@@ -246,10 +246,76 @@ pub fn decode_binary<I: OffsetSizeTrait>(
    unsafe { GenericBinaryArray::from(builder.build_unchecked()) }
 }

-fn decode_binary_view_inner(
+fn decode_binary_view_inner_utf8_unchecked(


I don't understand -- if it is a BinaryViewArray it can never have utf8 data. Maybe we can rename this function

decode_string_view also calls decode_binary_view_inner(...), so this function can still be reached when decoding UTF-8 data. I'll think about whether there’s a clearer way to rename it.

…ecked

github-actions bot added the arrow Changes to the arrow crate label Jul 12, 2025

ding-young force-pushed the issue-6057 branch from fe7c44d to ab8cb56 Compare July 16, 2025 07:43

ding-young marked this pull request as ready for review July 16, 2025 07:44

alamb reviewed Jul 22, 2025

View reviewed changes

ding-young added 6 commits July 23, 2025 08:18

avoid buffering all inline strings when decode_binary_view_inner_unch…

1d4b158

…ecked

add bench for case when short & long string both exists

2360d9e

fix wrong len value

83e74ee

chore: remove unnecessary test

7e2d493

refactor into single function

6b9e69a

add test case to check length of values buffer

0ff6084

ding-young force-pushed the issue-6057 branch from ca3225d to 0ff6084 Compare July 23, 2025 08:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve memory usage for `arrow-row -> String/BinaryView` when utf8 validation disabled #7917

Improve memory usage for `arrow-row -> String/BinaryView` when utf8 validation disabled #7917

ding-young commented Jul 12, 2025 •

edited

Loading

Uh oh!

ding-young commented Jul 14, 2025

Uh oh!

XiangpengHao commented Jul 14, 2025

Uh oh!

ding-young commented Jul 15, 2025

Uh oh!

ding-young commented Jul 16, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Jul 22, 2025

Uh oh!

ding-young Jul 23, 2025

Uh oh!

Uh oh!

Improve memory usage for arrow-row -> String/BinaryView when utf8 validation disabled #7917

Are you sure you want to change the base?

Improve memory usage for arrow-row -> String/BinaryView when utf8 validation disabled #7917

Conversation

ding-young commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Considered alternatives

Uh oh!

ding-young commented Jul 14, 2025

Uh oh!

XiangpengHao commented Jul 14, 2025

Uh oh!

ding-young commented Jul 15, 2025

Uh oh!

ding-young commented Jul 16, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

ding-young Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Improve memory usage for `arrow-row -> String/BinaryView` when utf8 validation disabled #7917

Improve memory usage for `arrow-row -> String/BinaryView` when utf8 validation disabled #7917

ding-young commented Jul 12, 2025 •

edited

Loading