-
Notifications
You must be signed in to change notification settings - Fork 976
Improve memory usage for arrow-row -> String/BinaryView
when utf8 validation disabled
#7917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@alamb @XiangpengHao @2010YOUY01 When running the existing benchmarks ( |
I took a high level look and it looks good to me. Curious to see the perf diff. Does the benchmark report memory usage yet? @ding-young |
@XiangpengHao The current benchmark doesn’t report memory usage directly, but I’ve been printing stats manually using jemalloc. It seems like there might be an issue with my implementation, so I’ll double-check that and share the perf once I’ve confirmed. |
I added code to get jemalloc stats (allocate, resident, active) before and after decoding binary view, and the memory usage actually improved especially when short strings are mixed up with large strings. When given rows consists of only large strings, the memory usage was the same. let before = jemalloc_stat();
let view = if !validate_utf8 {
decode_binary_view_inner_utf8_unchecked(rows, options)
} else {
decode_binary_view_inner(rows, options, validate_utf8)
};
let after = jemalloc_stat();
// print ( after - before ) (To reproduce, see https://github.com/ding-young/arrow-rs/tree/issue-6057-bench-mem )
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ding-young -- the basic idea makes sense to me but this PR contains a lot of duplicated code which makes it hard to understand what is actually changing here
if the goal of this PR is to reduce the size of the values buffer when we are not validating utf8, I think that should also be testable
For example decode the same rows with and without validation and show the buffer without validation is smaller 🤔
arrow-row/src/variable.rs
Outdated
@@ -246,10 +246,76 @@ pub fn decode_binary<I: OffsetSizeTrait>( | |||
unsafe { GenericBinaryArray::from(builder.build_unchecked()) } | |||
} | |||
|
|||
fn decode_binary_view_inner( | |||
fn decode_binary_view_inner_utf8_unchecked( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand -- if it is a BinaryViewArray it can never have utf8 data. Maybe we can rename this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decode_string_view
also calls decode_binary_view_inner(...)
, so this function can still be reached when decoding UTF-8 data. I'll think about whether there’s a clearer way to rename it.
Which issue does this PR close?
Rationale for this change
As described in above issue, when constructing a
StringViewArray
from rows, we currently store inline strings twice: once throughmake_view
, and again in thevalues buffer
so that we can validate utf8 in one go. However, this is suboptimal in terms of memory consumption, so ideally, we should avoid placing inline strings into the values buffer when UTF-8 validation is disabled.What changes are included in this PR?
When UTF-8 validation is disabled, this PR modifies the string/bytes view array construction from rows as follows:
The capacity of the values buffer is set to accommodate only long strings plus 12 bytes for a single inline string placeholder.
All decoded strings are initially appended to the values buffer.
If a string turns out to be an inline string, it is included via
make_view
, and then the corresponding inline portion is truncated from the values buffer, ensuring the inline string does not appear twice in the resulting array.Are these changes tested?
fuzz_test
to set disable utf8 validation.Are there any user-facing changes?
No.
Considered alternatives
One idea was to support separate buffers for inline strings even when UTF-8 validation is enabled. However, since we need to call
decoded_len()
first to determine the target buffer, this approach can be tricky or inefficient:For example, precomputing a boolean flag per string to determine which buffer to use would increase temporary memory usage.
Alternatively, appending to the values buffer first and then moving inline strings to a separate buffer would lead to frequent memcpy overhead.
Given that datafusion disables UTF-8 validation when using RowConverter, this PR focuses on improving memory efficiency specifically when validation is turned off.