cache generation of dictionary keys and null arrays for ScalarValue #16789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

adriangb merged 5 commits into apache:main from pydantic:cache-keys

Jul 20, 2025

Contributor

adriangb commented Jul 15, 2025

This was inspired by what is already being done for partition values:

datafusion/datafusion/datasource/src/file_scan_config.rs

Lines 1243 to 1253 in 62dbebd

    
           #[derive(Debug, Default)] 
        
           struct ZeroBufferGenerators { 
        
               gen_i8: ZeroBufferGenerator<i8>, 
        
               gen_i16: ZeroBufferGenerator<i16>, 
        
               gen_i32: ZeroBufferGenerator<i32>, 
        
               gen_i64: ZeroBufferGenerator<i64>, 
        
               gen_u8: ZeroBufferGenerator<u8>, 
        
               gen_u16: ZeroBufferGenerator<u16>, 
        
               gen_u32: ZeroBufferGenerator<u32>, 
        
               gen_u64: ZeroBufferGenerator<u64>, 
        
           }

I am interested in refactoring away the current handling of partition values, but we'd loose that caching. Which made me think, what if we just added that globally? In particular I it's quite a common operation to create a null array of size batch_size, this would help with that.

github-actions bot added the common label

alamb approved these changes

View reviewed changes

Contributor

alamb left a comment

Thank you @adriangb

With this code, can we now also remove ZeroBufferGenerators ? Perhaps as a follow on PR?

My only suggestion is to move this code into its own module as scalar/mod.rs is already quite large

datafusion/common/src/scalar/mod.rs Outdated

@@ @@ -854,6 +854,140 @@ pub fn get_dict_value<K: ArrowDictionaryKeyType>( @@
                   Ok((dict_array.values(), dict_array.key(index)))
               }
+              /// Cache for dictionary key arrays to avoid repeated allocations

Contributor

alamb Jul 15, 2025

I think it is important to also point out that the cached keys are always zero

SO like "cache for dictionary keys for single valued dictionary arrays" or something

datafusion/common/src/scalar/mod.rs Outdated

+              /// when the same size is used frequently.
+              ///
+              /// Similar to PartitionColumnProjector's ZeroBufferGenerators, this cache
+              /// stores key arrays for different dictionary key types. The cache is

Contributor

alamb Jul 15, 2025

Suggested change

      
            /// stores key arrays for different dictionary key types. The cache is
          
            /// stores key arrays with all `0` values for different dictionary key types. The cache is

datafusion/common/src/scalar/mod.rs Outdated

+                  }
+              }
+              /// Cache for null arrays to avoid repeated allocations

Contributor

alamb Jul 15, 2025

Suggested change

      
            /// Cache for null arrays to avoid repeated allocations
          
            /// Cache for null arrays to avoid repeated allocations
          
            /// of all zeros

datafusion/common/src/scalar/mod.rs Outdated

+              /// limited to 1 entry per type (the last size used).
+              #[derive(Debug)]
+              struct KeyArrayCache<K: ArrowDictionaryKeyType> {
+                  cache: Option<(usize, bool, PrimitiveArray<K>)>, // (size, is_null, key_array)

Contributor

alamb Jul 15, 2025

Suggested change

      
                cache: Option<(usize, bool, PrimitiveArray<K>)>, // (size, is_null, key_array)
          
                cache: Option<(usize, bool, PrimitiveArray<K>)>, // (num_rows, is_null, key_array)

datafusion/common/src/scalar/mod.rs Outdated

+              }
+              impl<K: ArrowDictionaryKeyType> KeyArrayCache<K> {
+                  fn get_or_create(&mut self, size: usize, is_null: bool) -> PrimitiveArray<K> {

Contributor

alamb Jul 15, 2025

nit is I would use num_rows here rather than size as the size might be confused with the key size (eg Int16Type would have size 16??)

datafusion/common/src/scalar/mod.rs Outdated

+              }
+              /// Get cached null array for the given size
+              fn get_cached_null_array(size: usize) -> ArrayRef {

Contributor

alamb Jul 15, 2025

Suggested change

      
            fn get_cached_null_array(size: usize) -> ArrayRef {
          
            fn get_or_create_cached_null_array(size: usize) -> ArrayRef {

datafusion/common/src/scalar/mod.rs Outdated

+              }
+              /// Get cached key array for a specific key type
+              fn get_cached_key_array<K: ArrowDictionaryKeyType>(

Contributor

alamb Jul 15, 2025

Suggested change

      
            fn get_cached_key_array<K: ArrowDictionaryKeyType>(
          
            fn get_or_create_cached_key_array<K: ArrowDictionaryKeyType>(

Contributor Author

adriangb commented Jul 15, 2025

With this code, can we now also remove ZeroBufferGenerators ? Perhaps as a follow on PR?

Not yet. That will still be used for projection (selecting a partition column) until we fully implement expression pushdown

Contributor Author

adriangb commented Jul 17, 2025

Are we worried about memory overhead with this? One thing I think we could do is set a reasonable limit to the cache size - only write to the cache if size is less than 1024 * 1024 items (so limit memory usage to a couple MB). The default batch size is 16k so this should be more than enough to cache batch size items while ensuring that any calls for arrays of 100M items don't leak multiple MBs of memory.

adriangb and others added 3 commits

July 17, 2025 09:22


          cache generation of dictionary keys and null arrays for ScalarValue

7ed36c9


          Refactor scalar value caching to address PR feedback

4fc192d

- Move cache code into its own module (cache.rs) instead of scalar/mod.rs
- Rename 'size' parameter to 'num_rows' for clarity
- Update function names for better clarity:
  - get_cached_null_array() → get_or_create_cached_null_array()
  - get_cached_key_array() → get_or_create_cached_key_array()
- Add cache size limit (1024 * 1024 items) to prevent memory leaks
- Improve documentation comments
- Use HashMap instead of single-entry cache for better performance
- Use ArrayData conversion instead of unsafe transmute

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>


          Revert to single-entry cache for better memory efficiency

46e7d31

The HashMap approach was overengineered. The original single-entry cache
design is actually optimal for the expected usage pattern where the same
num_rows is used repeatedly, providing 100% cache hit rates while using
minimal memory (1 entry per type).

Changes:
- Reverted from HashMap<(usize, bool), Array> back to Option<(usize, bool, Array)>
- Maintained the num_rows parameter naming and cache size limit
- Kept the improved documentation and module organization

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

adriangb force-pushed the cache-keys branch from 5dc73dc to 46e7d31 Compare

July 17, 2025 14:52

Contributor Author

adriangb commented Jul 17, 2025

Are we worried about memory overhead with this? One thing I think we could do is set a reasonable limit to the cache size - only write to the cache if size is less than 1024 * 1024 items (so limit memory usage to a couple MB). The default batch size is 16k so this should be more than enough to cache batch size items while ensuring that any calls for arrays of 100M items don't leak multiple MBs of memory.

I've implemented a max number of items of 1M.

adriangb added 2 commits

July 17, 2025 11:02

fmt

4a86a32


          tweaks

5bb4221

Contributor

alamb commented Jul 17, 2025

I for one am not really worried about memory as we are talking about 8k * Int64 = 64k for the largest index size. I expect most people to use a single index size (e.g. Int32) so that is like 32K per process if they query partition columns

Contributor Author

adriangb commented Jul 17, 2025

I agree. Let's more forward with this then. I'll allow a couple more days for review since we're in no rush.

adriangb merged commit 3869857 into apache:main

27 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common