Skip to content

Improve memory usage for arrow-row -> String/BinaryView when utf8 validation disabled #7917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ding-young
Copy link

@ding-young ding-young commented Jul 12, 2025

Which issue does this PR close?

Rationale for this change

As described in above issue, when constructing a StringViewArray from rows, we currently store inline strings twice: once through make_view, and again in the values buffer so that we can validate utf8 in one go. However, this is suboptimal in terms of memory consumption, so ideally, we should avoid placing inline strings into the values buffer when UTF-8 validation is disabled.

What changes are included in this PR?

When UTF-8 validation is disabled, this PR modifies the string/bytes view array construction from rows as follows:

  1. The capacity of the values buffer is set to accommodate only long strings plus 12 bytes for a single inline string placeholder.

  2. All decoded strings are initially appended to the values buffer.

  3. If a string turns out to be an inline string, it is included via make_view, and then the corresponding inline portion is truncated from the values buffer, ensuring the inline string does not appear twice in the resulting array.

Are these changes tested?

  1. copied & modified existing fuzz_test to set disable utf8 validation.
  2. Run bench & add bench case when array consists of both inline string & long strings

Are there any user-facing changes?

No.

Considered alternatives

One idea was to support separate buffers for inline strings even when UTF-8 validation is enabled. However, since we need to call decoded_len() first to determine the target buffer, this approach can be tricky or inefficient:

  • For example, precomputing a boolean flag per string to determine which buffer to use would increase temporary memory usage.

  • Alternatively, appending to the values buffer first and then moving inline strings to a separate buffer would lead to frequent memcpy overhead.

Given that datafusion disables UTF-8 validation when using RowConverter, this PR focuses on improving memory efficiency specifically when validation is turned off.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 12, 2025
@ding-young
Copy link
Author

@alamb @XiangpengHao @2010YOUY01

When running the existing benchmarks (cargo bench --bench row_format "convert_rows 4096 string view” ), I noticed there might be a slight regression, but it seems relatively minor given the normal level of fluctuation.
I’d love to hear your thoughts on this PR — if you think this direction is useful, I’ll run more benchmark experiments and polish it further.

@XiangpengHao
Copy link
Contributor

I took a high level look and it looks good to me. Curious to see the perf diff. Does the benchmark report memory usage yet? @ding-young

@ding-young
Copy link
Author

@XiangpengHao The current benchmark doesn’t report memory usage directly, but I’ve been printing stats manually using jemalloc. It seems like there might be an issue with my implementation, so I’ll double-check that and share the perf once I’ve confirmed.

@ding-young
Copy link
Author

  • cargo bench result
Case (str_len, null prob) main issue-6057
string view(10, 0) 51.23 µs 52.18 µs
string view(30, 0) 45.47 µs 46.63 µs
string view(100, 0) 64.18 µs 68.54 µs
string view(100, 0.5) 70.11 µs 74.06 µs
string view(1..100, 0) 100.72 µs 103.80 µs
string view(1..100, 0.5) 80.48 µs 86.02 µs
  • manual memory profiling result (*unit = B)

I added code to get jemalloc stats (allocate, resident, active) before and after decoding binary view, and the memory usage actually improved especially when short strings are mixed up with large strings. When given rows consists of only large strings, the memory usage was the same.

let before = jemalloc_stat();

let view = if !validate_utf8 {
    decode_binary_view_inner_utf8_unchecked(rows, options)
} else {
    decode_binary_view_inner(rows, options, validate_utf8)
};

let after = jemalloc_stat();
// print ( after - before ) 

(To reproduce, see https://github.com/ding-young/arrow-rs/tree/issue-6057-bench-mem )

Case main (alloc / active) issue-6057 (alloc / active)
string view(10, 0) 102656 / 114688 65536 / 69632
string view(30, 0) 196608 / 204800 196608 / 204800
string view(100, 0) 524288 / 532480 524288 / 532480
string view(100, 0.5) 294912 / 303104 294912 / 303104
string view(1..100, 0) 294912 / 303104 294912 / 303104
string view(1..100, 0.5) 180224 / 188416 163840 / 172032

@ding-young ding-young marked this pull request as ready for review July 16, 2025 07:44
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ding-young -- the basic idea makes sense to me but this PR contains a lot of duplicated code which makes it hard to understand what is actually changing here

if the goal of this PR is to reduce the size of the values buffer when we are not validating utf8, I think that should also be testable

For example decode the same rows with and without validation and show the buffer without validation is smaller 🤔

@@ -246,10 +246,76 @@ pub fn decode_binary<I: OffsetSizeTrait>(
unsafe { GenericBinaryArray::from(builder.build_unchecked()) }
}

fn decode_binary_view_inner(
fn decode_binary_view_inner_utf8_unchecked(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand -- if it is a BinaryViewArray it can never have utf8 data. Maybe we can rename this function

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decode_string_view also calls decode_binary_view_inner(...), so this function can still be reached when decoding UTF-8 data. I'll think about whether there’s a clearer way to rename it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants