Reorder instruction cases in _PyEval_EvalFrameDefault
#149
Replies: 12 comments
-
A suggestion: put the shuffling script in the faster-cpython/tools repo. That way you don't have to work on a branch or get code review to improve your script. I'll give you access. |
Beta Was this translation helpful? Give feedback.
-
Thanks! I did have to make some tiny changes to the main loop to get this to work, though. Should those be upstreamed? |
Beta Was this translation helpful? Give feedback.
-
Can you send a separate PR for those? I'm a little bit worried that the change to |
Beta Was this translation helpful? Give feedback.
-
I also realize that my suggestion actually makes your workflow more complicated, so feel free to ignore it. |
Beta Was this translation helpful? Give feedback.
-
The changes to ceval.c look fine, if you want to upstream them. |
Beta Was this translation helpful? Give feedback.
-
Cool. Then I'll try benchmarking a random shuffle and see if that creates any significant differences. |
Beta Was this translation helpful? Give feedback.
-
The advice I've heard is to order code by density (execution count divided by instruction bytes), though I was under the impression that PGO did branch reordering and would do this sort of thing automatically. |
Beta Was this translation helpful? Give feedback.
-
Interesting... that's easy enough to add to the script.
Yeah, I'm expecting that the difference between two orders will be negligible. We'll see. |
Beta Was this translation helpful? Give feedback.
-
The results are in, and... nothing significant seems to happen when reordering the ceval cases (as expected). I benchmarked the following changes against
The noise seems much better now that we've properly isolated our benchmarking machine, but it'll never be perfect... we're still seeing individual results ranging from 11% slower to 7% faster on some runs.
|
Beta Was this translation helpful? Give feedback.
-
Oh, this is interesting. We could compile a list of how big the variance is for each of the benchmarks over several randomized runs. That will give us a sense of how much to panic (or celebrate) if a certain benchmark suddenly sees a big swing. That pidigits reports no stdev for a given build suggests that it really is about code alignment in the i-cache. It looks bimodal (listen to me using bug statistics words :-), either 1.00 or around 1.10. That could be due to either code alignment of the cases for the few opcodes it actually exercises, or due to the integer code being pushed to a slightly different boundary. But my vote's on alignment of the cases. So it would be useful to get a dynamic execution profile of the core loop. |
Beta Was this translation helpful? Give feedback.
-
Looking at pidigits some more, the inner loop calls compose() and extract(), both of which use these opcodes a lot:
and these a little less:
So the spread of those opcodes (esp. the first four) across the i-cache might have a huge effect. Not sure we let that guide us too much -- this is just an example of highly mathematical code, I'd rather optimize something like mypy. :-) |
Beta Was this translation helpful? Give feedback.
-
Done! You can check out the results in #109. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Following our discussion yesterday, I wrote a small script for reordering the main interpreter loop in-place, as well as implementing a few simple strategies:
You can find it on the
shuffle-ceval
branch of my fork. I wanted to stop here and gather feedback before figuring out the next steps with this project.Beta Was this translation helpful? Give feedback.
All reactions