Skip to content

Accept Iterable for Performance #31

@william-silversmith

Description

@william-silversmith

Hi, thanks so much for this very useful library! I'm using it to randomize keys for billions of objects to create condensed files that contain groups of thousands to millions of objects at a time.

https://github.com/google/neuroglancer/blob/056a3548abffc3c76c93c7a906f1603ce02b5fa3/src/neuroglancer/datasource/precomputed/sharded.md

It's not critical, but there is a bottleneck step in the front of my Python processing pipeline where the hash is applied to all object labels at once to figure out how to assign them for further processing. The hash function is dominating this calculation.

Function: murmur at line 25
Time per Hit in microseconds
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    25                                           @profile
    26                                           def murmur(x):
    27     69734      74332.0      1.1      1.6    y = uint64(x).tobytes()
    28     69734    4381932.0     62.8     97.2    y = mmh3.hash64(y, x64arch=False)
    29     69733      52731.0      0.8      1.2    return uint64(y[0])

Total time: 5.44635 s
File: REDACTED
Function: compute_shard_location at line 145
Time per Hit in microseconds
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   145                                             @profile
   146                                             def compute_shard_location(self, key):
   147     69734      99805.0      1.4      1.8      chunkid = uint64(key) >> uint64(self.preshift_bits)
   148     69734    4703010.0     67.4     86.4      chunkid = self.hashfn(chunkid)
   149     69733      60072.0      0.9      1.1      minishard_number = uint64(chunkid & self.minishard_mask)
   150     69733      97034.0      1.4      1.8      shard_number = uint64((chunkid & self.shard_mask) >> uint64(self.minishard_bits))
   151     69733     312626.0      4.5      5.7      shard_number = format(shard_number, 'x').zfill(int(np.ceil(self.shard_bits / 4.0)))
   152     69733     102740.0      1.5      1.9      remainder = chunkid >> uint64(self.minishard_bits + self.shard_bits)
   153                                           
   154     69733      71060.0      1.0      1.3      return ShardLocation(shard_number, minishard_number, remainder)

It could be possible to thread this processing, but Python has the GIL. Multiprocessing could work, though the picke/unpickle will also take some time. I was thinking that a neat way to increase thoughput would be to process multiple hashes at once in C, that is, accept both a scalar and an iterator as input to the function. This would allow for the compiler to autovetorize and also avoid Python/C overheads. I'm getting ~66.5k hashes/sec on Apple Silicon M1 ARM64 currently.

I'm thinking of an interface similar to this. The second should be some buffer that is easy to read into numpy.

(lower, upper) = mmh3.hash64(some_bytes, x64arch=False)
[l1,h1,l2,h2] = mmh3.hash64(iterable_containing_bytes, x64arch=False)

Thank you for your consideration and for all the effort you've put into this library!

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions