-
Notifications
You must be signed in to change notification settings - Fork 73
Description
Hi, thanks so much for this very useful library! I'm using it to randomize keys for billions of objects to create condensed files that contain groups of thousands to millions of objects at a time.
It's not critical, but there is a bottleneck step in the front of my Python processing pipeline where the hash is applied to all object labels at once to figure out how to assign them for further processing. The hash function is dominating this calculation.
Function: murmur at line 25
Time per Hit in microseconds
Line # Hits Time Per Hit % Time Line Contents
==============================================================
25 @profile
26 def murmur(x):
27 69734 74332.0 1.1 1.6 y = uint64(x).tobytes()
28 69734 4381932.0 62.8 97.2 y = mmh3.hash64(y, x64arch=False)
29 69733 52731.0 0.8 1.2 return uint64(y[0])
Total time: 5.44635 s
File: REDACTED
Function: compute_shard_location at line 145
Time per Hit in microseconds
Line # Hits Time Per Hit % Time Line Contents
==============================================================
145 @profile
146 def compute_shard_location(self, key):
147 69734 99805.0 1.4 1.8 chunkid = uint64(key) >> uint64(self.preshift_bits)
148 69734 4703010.0 67.4 86.4 chunkid = self.hashfn(chunkid)
149 69733 60072.0 0.9 1.1 minishard_number = uint64(chunkid & self.minishard_mask)
150 69733 97034.0 1.4 1.8 shard_number = uint64((chunkid & self.shard_mask) >> uint64(self.minishard_bits))
151 69733 312626.0 4.5 5.7 shard_number = format(shard_number, 'x').zfill(int(np.ceil(self.shard_bits / 4.0)))
152 69733 102740.0 1.5 1.9 remainder = chunkid >> uint64(self.minishard_bits + self.shard_bits)
153
154 69733 71060.0 1.0 1.3 return ShardLocation(shard_number, minishard_number, remainder)
It could be possible to thread this processing, but Python has the GIL. Multiprocessing could work, though the picke/unpickle will also take some time. I was thinking that a neat way to increase thoughput would be to process multiple hashes at once in C, that is, accept both a scalar and an iterator as input to the function. This would allow for the compiler to autovetorize and also avoid Python/C overheads. I'm getting ~66.5k hashes/sec on Apple Silicon M1 ARM64 currently.
I'm thinking of an interface similar to this. The second should be some buffer that is easy to read into numpy.
(lower, upper) = mmh3.hash64(some_bytes, x64arch=False)
[l1,h1,l2,h2] = mmh3.hash64(iterable_containing_bytes, x64arch=False)
Thank you for your consideration and for all the effort you've put into this library!