Revisit the dict object #219

lpereira · 2022-01-11T18:48:53Z

lpereira
Jan 11, 2022

I was reading about Rust's new hash table the other day, and found some articles about the hash table it derives from. Might be a good idea to look into this and incorporate some of the ideas, given how hot hash tables are in Python:

Blog post by Aria Beingessner about hashbrown & swisstable: https://gankra.github.io/blah/hashbrown-tldr/
Here's a talk by Matt Kulukundis about SwissTable, and how they came up with the current implementation (plus some insights that could be useful for us): https://www.youtube.com/watch?v=ncHmEUmJZf4

gvanrossum · 2022-01-11T20:04:51Z

gvanrossum
Jan 11, 2022
Maintainer

@methane ^^

0 replies

brandtbucher · 2022-01-11T22:40:10Z

brandtbucher
Jan 11, 2022
Maintainer

At one point in Matt's talk, he discusses a rejected cache-line-friendly design that requires 8-byte keys, and explicitly says that if he were designing a hash map for Python, he would use that design. (Note, however, that Python's insertion ordering guarantees make a lot of this stuff a bit trickier.)

Basically, keys are collected in groups of seven, with the struct for each group looking something like:

typedef struct {
    uint8_t ever_full : 1;
    uint8_t tombstones : 7;
    uint8_t hashes[7];  // Low byte of each hash.
    PyObject *keys[7];
} seven_keys;

...then use intrinsics on the first 8 bytes to probe for hits.

2 replies

sweeneyde Jan 12, 2022

Note, however, that Python's insertion ordering guarantees make a lot of this stuff a bit trickier.

Wouldn't it mostly be a matter of switching to PyDictKeyEntry *keys[7] that point to the right place in the PyDictKeyEntry array? Although I would worry that that could hurt memory usage on small dicts: from (a power of two larger than) n bytes to (a power of two larger than) n * (8/7) (??) words in the hash table proper.

I guess variants with uint8_t, uint16_t, uint32_t and uint64_t like in the current CPython dict could help with that memory usage, but then it would probably be preferred to use the standard Swiss Table structure (not the "Cache Line Aware Groups" version). Something like

struct _dictkeysobject {
    Py_ssize_t dk_refcnt;
    uint8_t dk_log2_size;
    ...
    /* Everything below could be computed dynamically, not actual struct members */
    uint8_t CONTROL_BYTES[CURRENT_NUM_BINS][16];
    union {
        uint8_t SMALL_INDICES[CURRENT_NUM_BINS][16];
        uint16_t MEDIUM_INDICES[CURRENT_NUM_BINS][16];
        uint32_t LARGE_INDICES[CURRENT_NUM_BINS][16];
        uint32_t HUGE_INDICES[CURRENT_NUM_BINS][16];
    };
    DictKeysEntry items_in_insertion_order[16 * CURRENT_NUM_BINS /* capacity */];
};

Memory overhead would be a bit larger for small dicts (2 bytes per entry instead of 1 in the main hash table, and the minimum size would be bigger), but probably not that bad.

sweeneyde Jan 12, 2022

One could even attempt to write a C++ extension module that uses pre-existing SwissTable<SomeSortOfPyObjectWithCachedHash, intxx_t> code, from keys to the index into a values array, just to test the performance characteristics.

methane · 2022-01-12T04:32:01Z

methane
Jan 12, 2022

I impressed how they lookup value in a group: https://youtu.be/ncHmEUmJZf4?t=1739.
We may be able to optimize small dict using that idea.

Currently, our dk_indices stores the index of the dk_entry. But when dict is small (len<=8), we can store hash%127 instead.

$ PYTHONHASHSEED=0 python3
>>> d={"foo":1, "bar":2, "baz":3}
>>> for k in d:
...   h = hash(k)
...   print(hex(h & ((1<<64)-1)), h%8, h%128)
...
0xdde38bd48eae7414 4 20
0x8783542c0804c968 0 104
0x665e159a7e18356c 4 108

Current:

One conflict between "foo" and "baz".

Idea:

Zero conflict
8 entries can be stored.

(When SSE2 can be used, we can use same idea for len<=16 hash.)

5 replies

methane Feb 7, 2022

This is proof of concept code.
methane/cpython#40

I don't have good name for this idea. But I call it as "vector", because it is not hashtable, but just a vector with hash.
The code is ugly. Do not support Windows yet.
Implemented only 8-size vector.
The minimum dict (vector8) can contain 8 entries, instead of 5 (status quo).

Microbenchmark for the best scenario:

$ vector8/python -m pyperf timeit --compare-to main/python --duplicate 10 -s 'd=dict.fromkeys((1,9,17,25))' -- '25 in d'

main/python: ..................... 25.5 ns +- 0.4 ns
vector8/python: ..................... 19.9 ns +- 0.1 ns
 
Mean +- std dev: [main/python] 25.5 ns +- 0.4 ns -> [vector8/python] 19.9 ns +- 0.1 ns: 1.28x faster

Note that all of (1,9,17,25) has 1 mod 8. So searching 25 needs four comparison.
On the other hand, all of them has different mod 128. So searching any memmber needs only one comparison.

On the other hand, more realistic microbenchmark shows very small difference.

$ vector8/python -m pyperf timeit --compare-to main/python --duplicate 10 -s 'd=dict.fromkeys("foo bar baz fizz".split())' -- '"fizz" in d'
 main/python: ..................... 27.0 ns +- 1.6 ns
 vector8/python: ..................... 26.1 ns +- 0.1 ns
 
 Mean +- std dev: [main/python] 27.0 ns +- 1.6 ns -> [vector8/python] 26.1 ns +- 0.1 ns: 1.03x faster

And I can not see any difference for pyperformance.

methane Feb 14, 2022

And I can not see any difference for pyperformance.

I had used wrong version for benchmarking. I reran the pyperformance and get "1% faster".

Slower (10):
- scimark_sparse_mat_mult: 6.70 ms +- 0.10 ms -> 7.26 ms +- 0.05 ms: 1.08x slower
- html5lib: 82.6 ms +- 3.4 ms -> 85.2 ms +- 3.5 ms: 1.03x slower
- logging_simple: 8.42 us +- 0.19 us -> 8.64 us +- 0.22 us: 1.03x slower
- chaos: 96.8 ms +- 1.8 ms -> 99.1 ms +- 1.1 ms: 1.02x slower
- logging_silent: 141 ns +- 1 ns -> 144 ns +- 3 ns: 1.02x slower
- pyflate: 570 ms +- 5 ms -> 580 ms +- 6 ms: 1.02x slower
- meteor_contest: 126 ms +- 1 ms -> 127 ms +- 1 ms: 1.01x slower
- regex_dna: 238 ms +- 2 ms -> 241 ms +- 2 ms: 1.01x slower
- float: 95.9 ms +- 1.6 ms -> 96.9 ms +- 1.2 ms: 1.01x slower
- scimark_sor: 156 ms +- 2 ms -> 158 ms +- 2 ms: 1.01x slower

Faster (19):
- pickle_dict: 39.2 us +- 4.6 us -> 36.7 us +- 0.2 us: 1.07x faster
- sympy_expand: 694 ms +- 11 ms -> 664 ms +- 4 ms: 1.04x faster
- deltablue: 5.79 ms +- 0.42 ms -> 5.57 ms +- 0.07 ms: 1.04x faster
- unpickle_list: 6.06 us +- 0.21 us -> 5.87 us +- 0.02 us: 1.03x faster
- sympy_str: 416 ms +- 7 ms -> 404 ms +- 4 ms: 1.03x faster
- scimark_lu: 153 ms +- 7 ms -> 149 ms +- 1 ms: 1.03x faster
- unpack_sequence: 56.1 ns +- 1.4 ns -> 54.8 ns +- 0.8 ns: 1.02x faster
- sympy_sum: 230 ms +- 4 ms -> 225 ms +- 2 ms: 1.02x faster
- richards: 66.8 ms +- 1.8 ms -> 65.4 ms +- 1.8 ms: 1.02x faster
- json_dumps: 16.6 ms +- 0.3 ms -> 16.2 ms +- 0.1 ms: 1.02x faster
- unpickle: 19.2 us +- 0.3 us -> 18.8 us +- 0.2 us: 1.02x faster
- sympy_integrate: 27.6 ms +- 0.4 ms -> 27.1 ms +- 0.1 ms: 1.02x faster
- django_template: 51.0 ms +- 1.2 ms -> 50.2 ms +- 0.6 ms: 1.02x faster
- pathlib: 26.5 ms +- 0.5 ms -> 26.0 ms +- 0.3 ms: 1.02x faster
- regex_v8: 29.9 ms +- 0.4 ms -> 29.5 ms +- 0.2 ms: 1.02x faster
- python_startup: 9.94 ms +- 0.09 ms -> 9.79 ms +- 0.06 ms: 1.02x faster
- chameleon: 9.18 ms +- 0.08 ms -> 9.05 ms +- 0.26 ms: 1.01x faster
- python_startup_no_site: 7.16 ms +- 0.08 ms -> 7.06 ms +- 0.02 ms: 1.01x faster
- hexiom: 8.40 ms +- 0.08 ms -> 8.30 ms +- 0.06 ms: 1.01x faster

Benchmark hidden because not significant (30): 2to3, crypto_pyaes, dulwich_log, fannkuch, go, json_loads, logging_format, mako, nbody, nqueens, pickle, pickle_list, pickle_pure_python, pidigits, raytrace, regex_compile, regex_effbot, scimark_fft, scimark_monte_carlo, spectral_norm, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, telco, tornado_http, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.01x faster

And I implemented vector16 that works only with SSE2:

# vector8 vs vector16

$ cpython/python -m pyperf compare_to dict-vector8.json dict-vector16.json -G --min-speed=1
Slower (5):
- unpickle_list: 5.87 us +- 0.02 us -> 5.99 us +- 0.13 us: 1.02x slower
- pickle_dict: 36.7 us +- 0.2 us -> 37.4 us +- 0.2 us: 1.02x slower
- scimark_lu: 149 ms +- 1 ms -> 151 ms +- 1 ms: 1.01x slower
- unpickle: 18.8 us +- 0.2 us -> 19.0 us +- 0.7 us: 1.01x slower
- sqlite_synth: 3.22 us +- 0.06 us -> 3.26 us +- 0.06 us: 1.01x slower

Faster (18):
- regex_effbot: 4.01 ms +- 0.03 ms -> 3.72 ms +- 0.05 ms: 1.08x faster
- scimark_sparse_mat_mult: 7.26 ms +- 0.05 ms -> 6.95 ms +- 0.04 ms: 1.04x faster
- logging_simple: 8.64 us +- 0.22 us -> 8.28 us +- 0.16 us: 1.04x faster
- regex_dna: 241 ms +- 2 ms -> 232 ms +- 1 ms: 1.04x faster
- regex_v8: 29.5 ms +- 0.2 ms -> 28.5 ms +- 0.2 ms: 1.03x faster
- chaos: 99.1 ms +- 1.1 ms -> 96.3 ms +- 1.1 ms: 1.03x faster
- json_dumps: 16.2 ms +- 0.1 ms -> 15.8 ms +- 0.1 ms: 1.03x faster
- logging_format: 9.55 us +- 0.15 us -> 9.34 us +- 0.19 us: 1.02x faster
- html5lib: 85.2 ms +- 3.5 ms -> 83.3 ms +- 3.4 ms: 1.02x faster
- pathlib: 26.0 ms +- 0.3 ms -> 25.5 ms +- 0.7 ms: 1.02x faster
- crypto_pyaes: 111 ms +- 1 ms -> 109 ms +- 1 ms: 1.02x faster
- logging_silent: 144 ns +- 3 ns -> 142 ns +- 1 ns: 1.02x faster
- pickle_list: 5.43 us +- 0.17 us -> 5.34 us +- 0.06 us: 1.02x faster
- scimark_fft: 471 ms +- 3 ms -> 463 ms +- 3 ms: 1.02x faster
- pyflate: 580 ms +- 6 ms -> 572 ms +- 5 ms: 1.01x faster
- json_loads: 35.5 us +- 0.5 us -> 35.0 us +- 0.4 us: 1.01x faster
- mako: 13.6 ms +- 0.1 ms -> 13.4 ms +- 0.2 ms: 1.01x faster
- meteor_contest: 127 ms +- 1 ms -> 126 ms +- 1 ms: 1.01x faster

Benchmark hidden because not significant (36): 2to3, chameleon, deltablue, django_template, dulwich_log, fannkuch, float, go, hexiom, nbody, nqueens, pickle, pickle_pure_python, pidigits, python_startup, python_startup_no_site, raytrace, regex_compile, richards, scimark_monte_carlo, scimark_sor, spectral_norm, sqlalchemy_declarative, sqlalchemy_imperative, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, unpack_sequence, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.01x faster

## main vs vector16

Slower (1):
- scimark_sparse_mat_mult: 6.70 ms +- 0.10 ms -> 6.95 ms +- 0.04 ms: 1.04x slower

Faster (30):
- regex_effbot: 3.98 ms +- 0.04 ms -> 3.72 ms +- 0.05 ms: 1.07x faster
- regex_v8: 29.9 ms +- 0.4 ms -> 28.5 ms +- 0.2 ms: 1.05x faster
- deltablue: 5.79 ms +- 0.42 ms -> 5.52 ms +- 0.06 ms: 1.05x faster
- pickle_dict: 39.2 us +- 4.6 us -> 37.4 us +- 0.2 us: 1.05x faster
- json_dumps: 16.6 ms +- 0.3 ms -> 15.8 ms +- 0.1 ms: 1.05x faster
- sympy_expand: 694 ms +- 11 ms -> 664 ms +- 5 ms: 1.05x faster
- pathlib: 26.5 ms +- 0.5 ms -> 25.5 ms +- 0.7 ms: 1.04x faster
- sympy_str: 416 ms +- 7 ms -> 402 ms +- 4 ms: 1.03x faster
- django_template: 51.0 ms +- 1.2 ms -> 49.7 ms +- 0.5 ms: 1.03x faster
- sympy_sum: 230 ms +- 4 ms -> 224 ms +- 2 ms: 1.03x faster
- richards: 66.8 ms +- 1.8 ms -> 65.1 ms +- 2.2 ms: 1.03x faster
- sympy_integrate: 27.6 ms +- 0.4 ms -> 26.9 ms +- 0.2 ms: 1.03x faster
- scimark_fft: 475 ms +- 24 ms -> 463 ms +- 3 ms: 1.03x faster
- regex_dna: 238 ms +- 2 ms -> 232 ms +- 1 ms: 1.02x faster
- pickle_list: 5.45 us +- 0.07 us -> 5.34 us +- 0.06 us: 1.02x faster
- crypto_pyaes: 111 ms +- 2 ms -> 109 ms +- 1 ms: 1.02x faster
- json_loads: 35.7 us +- 0.4 us -> 35.0 us +- 0.4 us: 1.02x faster
- mako: 13.7 ms +- 0.1 ms -> 13.4 ms +- 0.2 ms: 1.02x faster
- logging_simple: 8.42 us +- 0.19 us -> 8.28 us +- 0.16 us: 1.02x faster
- dulwich_log: 100 ms +- 5 ms -> 98.8 ms +- 0.7 ms: 1.02x faster
- sqlalchemy_imperative: 27.6 ms +- 1.4 ms -> 27.1 ms +- 0.9 ms: 1.02x faster
- unpack_sequence: 56.1 ns +- 1.4 ns -> 55.2 ns +- 1.2 ns: 1.02x faster
- logging_format: 9.49 us +- 0.19 us -> 9.34 us +- 0.19 us: 1.02x faster
- regex_compile: 185 ms +- 3 ms -> 183 ms +- 2 ms: 1.01x faster
- hexiom: 8.40 ms +- 0.08 ms -> 8.29 ms +- 0.08 ms: 1.01x faster
- tornado_http: 153 ms +- 4 ms -> 151 ms +- 2 ms: 1.01x faster
- scimark_lu: 153 ms +- 7 ms -> 151 ms +- 1 ms: 1.01x faster
- spectral_norm: 147 ms +- 3 ms -> 145 ms +- 3 ms: 1.01x faster
- unpickle_list: 6.06 us +- 0.21 us -> 5.99 us +- 0.13 us: 1.01x faster
- chameleon: 9.18 ms +- 0.08 ms -> 9.08 ms +- 0.09 ms: 1.01x faster

Benchmark hidden because not significant (28): 2to3, chaos, fannkuch, float, go, html5lib, logging_silent, meteor_contest, nbody, nqueens, pickle, pickle_pure_python, pidigits, pyflate, python_startup, python_startup_no_site, raytrace, scimark_monte_carlo, scimark_sor, sqlalchemy_declarative, sqlite_synth, telco, unpickle, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.01x faster

methane Feb 14, 2022

vector16 is implemented in methane/cpython#40 too.

It seems performance improvement is significant. But I want to try implementing #219 (reply in thread) before finishing this dict-vector branch.

methane Feb 15, 2022

I noticed that optimizing for small dicts is not so important for now.

Looking up from small dicts was important because small dict is very popular for instance namespace. But now:

ceval.c now caches the position in the dict in some case.
In other cases (e.g. LOAD_METHOD checks the method is not overridden by instance variable), key-sharing dict uses 2^6=64 keysize by default. (Since this commit)

That explains vector8 and vector16 have only 1% benefit.

To optimize 2^5 ~ 2^10 cases, especially 2^6, implementing swisstable for all sizes seems much better to me.

markshannon Feb 15, 2022
Collaborator

Yes.
If we optimize the VM well, then almost accesses of dicts as namespaces should be removed at runtime.
Anything more than a 1% speedup seems unlikely, as we shouldn't be spending much more than 1% of execution time doing dict accesses.

sthagen · 2022-01-12T05:36:55Z

sthagen
Jan 12, 2022

Is this discussion overlapping with #133 - could we relate or join these explicitly?

revisiting #133 … also, there maybe performance trade-offs to consider depending on when and how we select the best fitting CPU instruction sets.

I see four open issues in the aHash project that may be of interest here:

1 reply

methane Jan 12, 2022

I don't think two issues are tightly coupled.

Unlike Rust, string in Python caches hash value, and dict in Python stores hash values of keys too.
So speed of calculating hash value is not tightly coupled with dict performance.

markshannon · 2022-01-13T09:54:33Z

markshannon
Jan 13, 2022
Collaborator

How would this hypothetical new design handle split dictionaries?
How would lazy creation of an object's dict work?

1 reply

methane Feb 7, 2022

In my idea, no change for split dictionary.
My PoC supports only 8-vector for now. So split dictionary still uses 32-wide hash table.

I will try supporting 16-vector and split dictionary use it instead of hash table.
But since overall performance gain is not so great, this idea has low priority for me.
I am bit busy for my company these days so I don't think it can be done before 3.11beta.

gvanrossum · 2022-02-07T01:23:02Z

gvanrossum
Feb 7, 2022
Maintainer

Yeah, it doesn't sound like this is going to be a killer change

0 replies

markshannon · 2022-02-07T13:07:05Z

markshannon
Feb 7, 2022
Collaborator

For small dict-keys, it may well be faster to perform a linear scan over the keys. It is also a lot simpler than fancy SSE stuff.
What "small" means in this case would need some experimentation, but I would expect something like 12-16 given how expensive cache misses are.

0 replies

markshannon · 2022-02-07T13:16:13Z

markshannon
Feb 7, 2022
Collaborator

One other possibility is to use bi-directional layout, as we do for the quickened code (and values array in python/cpython#31191).
The dict-keys pointer would point to the start of the entries, and the indices would be laid out backwards from there.

Now:

    meta-data <--- points here
    index[0]
    ...
    index[N-1]
    entry[0]
    ...
    entry[N-1]

With bi-directional layout:

    index[N-1]
    ...
    index[0]
    meta-data <--- points here
    entry[0]
    ...
    entry[N-1]

This removes the overhead of computing the offset of entry[0] for each read and write, at the cost of computing it once at allocation, and once at deallocation.

3 replies

methane Feb 20, 2022

python/cpython#31439

sweeneyde Feb 20, 2022

Since dk_version is a hot field in ceval (LOAD_METHOD_CACHED, LOAD_ATTR_MODULE, LOAD_METHOD_MODULE, LOAD_GLOBAL_MODULE, LOAD_GLOBAL_BUILTIN), but dk_refcnt isn't, would it make sense to put it at the top?

Before

8byte dk_refcnt <------- dictkeysobject
1byte dk_log2_size
1byte dk_log2_bytes
1byte dk_kind
1byte (padding)
4byte dk_version

After

4byte dk_version <-------- dictkeysobject
1byte dk_log2_size
1byte dk_log2_bytes
1byte dk_kind
1byte (padding)
8 dk_refcnt

methane Feb 21, 2022

I don't think it makes any difference because dictkeys object is 16byte aligned and cache line is 64byte.
All of these 16bytes are live in the same cache line always.

gvanrossum · 2022-02-21T01:38:54Z

gvanrossum
Feb 21, 2022
Maintainer

But wouldn’t that leave two 4-byte holes?

2 replies

methane Feb 21, 2022

Would you use "Write a reply" textarea instead of "Write a comment"?

gvanrossum Feb 21, 2022
Maintainer

Next time. :-) Anyway looks like he edited the comment.

methane · 2022-02-22T08:53:22Z

methane
Feb 22, 2022

I implemented Swisstable: methane/cpython#41
Simple lookup benchmark: https://gist.github.com/methane/9d2a35a519d89e657454f278e49fd4ae

Benchmark	main	swiss
dict[str, 10]	20.8 ns	23.2 ns: 1.11x slower
dict[str, 1,000]	20.6 ns	22.5 ns: 1.09x slower
dict[str, 1,000,000]	30.2 ns	27.9 ns: 1.08x faster
dict[int, 10]	20.2 ns	23.6 ns: 1.16x slower
dict[int, 1,000]	19.7 ns	23.4 ns: 1.18x slower
dict[int, 1,000,000]	21.0 ns	24.5 ns: 1.17x slower
Geometric mean	(ref)	1.10x slower

Swisstable hash reduces conflict. But it makes one more lookup layer.
Its overhead is few nanoseconds. It is significant for looking up small dict.

Swisstable hash is cache friendly, but our compact dict is cache friendly too.

I suspend trying Swisstable. If someone interested in Swisstable, please try to change set, instead of dict.

set uses classical data layout. Swisstable will be much more cache friendly.
set is very sparse, thus memory inefficient. Since Swisstable reduces conflict overhead, we can make set more dence and memory efficient.

0 replies

markshannon · 2022-02-22T12:13:21Z

markshannon
Feb 22, 2022
Collaborator

Thanks for trying that out.

0 replies

Revisit the dict object #219

Uh oh!

Uh oh!

Replies: 11 comments · 14 replies

Uh oh!

gvanrossum Jan 11, 2022 Maintainer

Uh oh!

Uh oh!

brandtbucher Jan 11, 2022 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markshannon Feb 15, 2022 Collaborator

Uh oh!

Uh oh!

Uh oh!

markshannon Jan 13, 2022 Collaborator

Uh oh!

Uh oh!

gvanrossum Feb 7, 2022 Maintainer

Uh oh!

markshannon Feb 7, 2022 Collaborator

Uh oh!

markshannon Feb 7, 2022 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gvanrossum Feb 21, 2022 Maintainer

Uh oh!

Uh oh!

gvanrossum Feb 21, 2022 Maintainer

Uh oh!

Uh oh!

Uh oh!

markshannon Feb 22, 2022 Collaborator

Replies: 11 comments 14 replies

gvanrossum
Jan 11, 2022
Maintainer

brandtbucher
Jan 11, 2022
Maintainer

markshannon Feb 15, 2022
Collaborator

markshannon
Jan 13, 2022
Collaborator

gvanrossum
Feb 7, 2022
Maintainer

markshannon
Feb 7, 2022
Collaborator

markshannon
Feb 7, 2022
Collaborator

gvanrossum
Feb 21, 2022
Maintainer

gvanrossum Feb 21, 2022
Maintainer

markshannon
Feb 22, 2022
Collaborator