How noisy is each benchmark? #142
Replies: 2 comments
-
This is a difficult and highly technical question to answer. To get a good overview I would strongly recommend watching "Performance matters" by Emery Berger (https://www.youtube.com/watch?v=r-TLSBdHe1A) which describes the issues in detail. His work in this area (and many others) is top notch. A key thing to remember with benchmarks — and one that I frequently have to remind my students of — is that a benchmark is a means to an end and not the end itself. What you want is an understanding of why a certain change has a certain effect. If you find changing one part of the interpreter improves one test case by 5% but slows another down by 10% what matters is why. It is only from digging into this that one will get an appreciation and working mental model for the system. This, in turn, will allow one to start proposing meaningful hypothesis about the system that will actually lead to improvements. Too often people have what seems like a good idea, try it, find it does not yield an improvement and bin it, without ever sitting down to actually figure out why it did not work out. As such they are really no better placed to come up with better proposals in the future. |
Beta Was this translation helpful? Give feedback.
-
Interesting/related to this may be this paper: https://arxiv.org/abs/1602.00602. It's called "Virtual Machine Warmup Blows Hot and Cold", but they also benchmark executables simply created with GCC, and find that even that ends up being extremely noisy sometimes, suggesting that depending on the workload, you're going to have to live with +/- 10% noise. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I want to preface this with a disclaimer that I'm not a statistician, so my methodology here may be flawed. I'm just curious roughly what sort of "noise floor" we have for each individual benchmark.
These results take the slowest and fastest mean times from 6 almost-identical runs of the benchmark suite (details below), and divide them to approximate what sort of spread we can expect to see in each individual benchmark.
My goal with this is to get a better intuition of what a 5% change in
2to3
means vs a 5% change inpidigits
(for example). Here are the results for each benchmark, sorted in order of stability:The geometric mean for this worst-case scenario is 1.03x faster. More details for anybody curious:
These results were obtained by running the
pyperformance
suite with PGO/LTO on our benchmarking machine for several similar commits:main
: A fairly recent commit frommain
.main-again
: No changes, just the same exact commit.shuffle-ceval-random-a
: A random shuffle of the cases in_PyEval_EvalFrameDefault
.shuffle-ceval-random-b
: A random shuffle of the cases in_PyEval_EvalFrameDefault
.shuffle-ceval-random-c
: A random shuffle of the cases in_PyEval_EvalFrameDefault
.shuffle-ceval-random-d
: A random shuffle of the cases in_PyEval_EvalFrameDefault
.Output of
pyperf system tune
:Full results:
Beta Was this translation helpful? Give feedback.
All reactions