-
Notifications
You must be signed in to change notification settings - Fork 1.6k
cache generation of dictionary keys and null arrays for ScalarValue #16789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @adriangb
With this code, can we now also remove ZeroBufferGenerators
? Perhaps as a follow on PR?
My only suggestion is to move this code into its own module as scalar/mod.rs
is already quite large
datafusion/common/src/scalar/mod.rs
Outdated
@@ -854,6 +854,140 @@ pub fn get_dict_value<K: ArrowDictionaryKeyType>( | |||
Ok((dict_array.values(), dict_array.key(index))) | |||
} | |||
|
|||
/// Cache for dictionary key arrays to avoid repeated allocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is important to also point out that the cached keys are always zero
SO like "cache for dictionary keys for single valued dictionary arrays" or something
datafusion/common/src/scalar/mod.rs
Outdated
/// when the same size is used frequently. | ||
/// | ||
/// Similar to PartitionColumnProjector's ZeroBufferGenerators, this cache | ||
/// stores key arrays for different dictionary key types. The cache is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// stores key arrays for different dictionary key types. The cache is | |
/// stores key arrays with all `0` values for different dictionary key types. The cache is |
datafusion/common/src/scalar/mod.rs
Outdated
} | ||
} | ||
|
||
/// Cache for null arrays to avoid repeated allocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Cache for null arrays to avoid repeated allocations | |
/// Cache for null arrays to avoid repeated allocations | |
/// of all zeros |
datafusion/common/src/scalar/mod.rs
Outdated
/// limited to 1 entry per type (the last size used). | ||
#[derive(Debug)] | ||
struct KeyArrayCache<K: ArrowDictionaryKeyType> { | ||
cache: Option<(usize, bool, PrimitiveArray<K>)>, // (size, is_null, key_array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cache: Option<(usize, bool, PrimitiveArray<K>)>, // (size, is_null, key_array) | |
cache: Option<(usize, bool, PrimitiveArray<K>)>, // (num_rows, is_null, key_array) |
datafusion/common/src/scalar/mod.rs
Outdated
} | ||
|
||
impl<K: ArrowDictionaryKeyType> KeyArrayCache<K> { | ||
fn get_or_create(&mut self, size: usize, is_null: bool) -> PrimitiveArray<K> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit is I would use num_rows
here rather than size as the size might be confused with the key size (eg Int16Type
would have size 16??)
datafusion/common/src/scalar/mod.rs
Outdated
} | ||
|
||
/// Get cached null array for the given size | ||
fn get_cached_null_array(size: usize) -> ArrayRef { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn get_cached_null_array(size: usize) -> ArrayRef { | |
fn get_or_create_cached_null_array(size: usize) -> ArrayRef { |
datafusion/common/src/scalar/mod.rs
Outdated
} | ||
|
||
/// Get cached key array for a specific key type | ||
fn get_cached_key_array<K: ArrowDictionaryKeyType>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn get_cached_key_array<K: ArrowDictionaryKeyType>( | |
fn get_or_create_cached_key_array<K: ArrowDictionaryKeyType>( |
Not yet. That will still be used for projection (selecting a partition column) until we fully implement expression pushdown |
Are we worried about memory overhead with this? One thing I think we could do is set a reasonable limit to the cache size - only write to the cache if |
- Move cache code into its own module (cache.rs) instead of scalar/mod.rs - Rename 'size' parameter to 'num_rows' for clarity - Update function names for better clarity: - get_cached_null_array() → get_or_create_cached_null_array() - get_cached_key_array() → get_or_create_cached_key_array() - Add cache size limit (1024 * 1024 items) to prevent memory leaks - Improve documentation comments - Use HashMap instead of single-entry cache for better performance - Use ArrayData conversion instead of unsafe transmute 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
The HashMap approach was overengineered. The original single-entry cache design is actually optimal for the expected usage pattern where the same num_rows is used repeatedly, providing 100% cache hit rates while using minimal memory (1 entry per type). Changes: - Reverted from HashMap<(usize, bool), Array> back to Option<(usize, bool, Array)> - Maintained the num_rows parameter naming and cache size limit - Kept the improved documentation and module organization 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
I've implemented a max number of items of 1M. |
I for one am not really worried about memory as we are talking about |
I agree. Let's more forward with this then. I'll allow a couple more days for review since we're in no rush. |
This was inspired by what is already being done for partition values:
datafusion/datafusion/datasource/src/file_scan_config.rs
Lines 1243 to 1253 in 62dbebd
I am interested in refactoring away the current handling of partition values, but we'd loose that caching. Which made me think, what if we just added that globally? In particular I it's quite a common operation to create a null array of size
batch_size
, this would help with that.