Skip to content

cache generation of dictionary keys and null arrays for ScalarValue #16789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 20, 2025

Conversation

adriangb
Copy link
Contributor

This was inspired by what is already being done for partition values:

#[derive(Debug, Default)]
struct ZeroBufferGenerators {
gen_i8: ZeroBufferGenerator<i8>,
gen_i16: ZeroBufferGenerator<i16>,
gen_i32: ZeroBufferGenerator<i32>,
gen_i64: ZeroBufferGenerator<i64>,
gen_u8: ZeroBufferGenerator<u8>,
gen_u16: ZeroBufferGenerator<u16>,
gen_u32: ZeroBufferGenerator<u32>,
gen_u64: ZeroBufferGenerator<u64>,
}

I am interested in refactoring away the current handling of partition values, but we'd loose that caching. Which made me think, what if we just added that globally? In particular I it's quite a common operation to create a null array of size batch_size, this would help with that.

@github-actions github-actions bot added the common Related to common crate label Jul 15, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb

With this code, can we now also remove ZeroBufferGenerators ? Perhaps as a follow on PR?

My only suggestion is to move this code into its own module as scalar/mod.rs is already quite large

@@ -854,6 +854,140 @@ pub fn get_dict_value<K: ArrowDictionaryKeyType>(
Ok((dict_array.values(), dict_array.key(index)))
}

/// Cache for dictionary key arrays to avoid repeated allocations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is important to also point out that the cached keys are always zero

SO like "cache for dictionary keys for single valued dictionary arrays" or something

/// when the same size is used frequently.
///
/// Similar to PartitionColumnProjector's ZeroBufferGenerators, this cache
/// stores key arrays for different dictionary key types. The cache is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// stores key arrays for different dictionary key types. The cache is
/// stores key arrays with all `0` values for different dictionary key types. The cache is

}
}

/// Cache for null arrays to avoid repeated allocations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Cache for null arrays to avoid repeated allocations
/// Cache for null arrays to avoid repeated allocations
/// of all zeros

/// limited to 1 entry per type (the last size used).
#[derive(Debug)]
struct KeyArrayCache<K: ArrowDictionaryKeyType> {
cache: Option<(usize, bool, PrimitiveArray<K>)>, // (size, is_null, key_array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cache: Option<(usize, bool, PrimitiveArray<K>)>, // (size, is_null, key_array)
cache: Option<(usize, bool, PrimitiveArray<K>)>, // (num_rows, is_null, key_array)

}

impl<K: ArrowDictionaryKeyType> KeyArrayCache<K> {
fn get_or_create(&mut self, size: usize, is_null: bool) -> PrimitiveArray<K> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit is I would use num_rows here rather than size as the size might be confused with the key size (eg Int16Type would have size 16??)

}

/// Get cached null array for the given size
fn get_cached_null_array(size: usize) -> ArrayRef {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn get_cached_null_array(size: usize) -> ArrayRef {
fn get_or_create_cached_null_array(size: usize) -> ArrayRef {

}

/// Get cached key array for a specific key type
fn get_cached_key_array<K: ArrowDictionaryKeyType>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn get_cached_key_array<K: ArrowDictionaryKeyType>(
fn get_or_create_cached_key_array<K: ArrowDictionaryKeyType>(

@adriangb
Copy link
Contributor Author

With this code, can we now also remove ZeroBufferGenerators ? Perhaps as a follow on PR?

Not yet. That will still be used for projection (selecting a partition column) until we fully implement expression pushdown

@adriangb
Copy link
Contributor Author

Are we worried about memory overhead with this? One thing I think we could do is set a reasonable limit to the cache size - only write to the cache if size is less than 1024 * 1024 items (so limit memory usage to a couple MB). The default batch size is 16k so this should be more than enough to cache batch size items while ensuring that any calls for arrays of 100M items don't leak multiple MBs of memory.

adriangb and others added 3 commits July 17, 2025 09:22
- Move cache code into its own module (cache.rs) instead of scalar/mod.rs
- Rename 'size' parameter to 'num_rows' for clarity
- Update function names for better clarity:
  - get_cached_null_array() → get_or_create_cached_null_array()
  - get_cached_key_array() → get_or_create_cached_key_array()
- Add cache size limit (1024 * 1024 items) to prevent memory leaks
- Improve documentation comments
- Use HashMap instead of single-entry cache for better performance
- Use ArrayData conversion instead of unsafe transmute

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
The HashMap approach was overengineered. The original single-entry cache
design is actually optimal for the expected usage pattern where the same
num_rows is used repeatedly, providing 100% cache hit rates while using
minimal memory (1 entry per type).

Changes:
- Reverted from HashMap<(usize, bool), Array> back to Option<(usize, bool, Array)>
- Maintained the num_rows parameter naming and cache size limit
- Kept the improved documentation and module organization

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@adriangb
Copy link
Contributor Author

Are we worried about memory overhead with this? One thing I think we could do is set a reasonable limit to the cache size - only write to the cache if size is less than 1024 * 1024 items (so limit memory usage to a couple MB). The default batch size is 16k so this should be more than enough to cache batch size items while ensuring that any calls for arrays of 100M items don't leak multiple MBs of memory.

I've implemented a max number of items of 1M.

@alamb
Copy link
Contributor

alamb commented Jul 17, 2025

I for one am not really worried about memory as we are talking about 8k * Int64 = 64k for the largest index size. I expect most people to use a single index size (e.g. Int32) so that is like 32K per process if they query partition columns

@adriangb
Copy link
Contributor Author

I agree. Let's more forward with this then. I'll allow a couple more days for review since we're in no rush.

@adriangb adriangb merged commit 3869857 into apache:main Jul 20, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants