How noisy is each benchmark? #142

brandtbucher · 2021-11-02T21:38:31Z

brandtbucher
Nov 2, 2021
Maintainer

I want to preface this with a disclaimer that I'm not a statistician, so my methodology here may be flawed. I'm just curious roughly what sort of "noise floor" we have for each individual benchmark.

These results take the slowest and fastest mean times from 6 almost-identical runs of the benchmark suite (details below), and divide them to approximate what sort of spread we can expect to see in each individual benchmark.

My goal with this is to get a better intuition of what a 5% change in 2to3 means vs a 5% change in pidigits (for example). Here are the results for each benchmark, sorted in order of stability:

benchmark	slowest / fastest
2to3	1.00x faster
python_startup_no_site	1.01x faster
sympy_expand	1.01x faster
float	1.01x faster
sqlalchemy_declarative	1.01x faster
python_startup	1.01x faster
pickle_pure_python	1.01x faster
sympy_integrate	1.01x faster
tornado_http	1.01x faster
telco	1.01x faster
sympy_str	1.01x faster
deltablue	1.01x faster
dulwich_log	1.01x faster
sqlalchemy_imperative	1.01x faster
chaos	1.01x faster
sympy_sum	1.01x faster
django_template	1.01x faster
raytrace	1.02x faster
xml_etree_process	1.02x faster
nbody	1.02x faster
go	1.02x faster
unpickle_pure_python	1.02x faster
meteor_contest	1.02x faster
xml_etree_generate	1.02x faster
xml_etree_parse	1.02x faster
regex_compile	1.02x faster
pathlib	1.02x faster
sqlite_synth	1.02x faster
hexiom	1.02x faster
richards	1.02x faster
json_loads	1.02x faster
crypto_pyaes	1.03x faster
scimark_sor	1.03x faster
scimark_lu	1.03x faster
chameleon	1.03x faster
scimark_sparse_mat_mult	1.03x faster
logging_simple	1.03x faster
pyflate	1.03x faster
json_dumps	1.03x faster
fannkuch	1.03x faster
regex_dna	1.03x faster
logging_silent	1.04x faster
mako	1.04x faster
logging_format	1.04x faster
unpickle	1.04x faster
xml_etree_iterparse	1.05x faster
pickle_list	1.05x faster
pickle	1.05x faster
scimark_fft	1.06x faster
pickle_dict	1.06x faster
unpickle_list	1.06x faster
scimark_monte_carlo	1.06x faster
nqueens	1.06x faster
regex_v8	1.06x faster
regex_effbot	1.06x faster
unpack_sequence	1.07x faster
spectral_norm	1.08x faster
pidigits	1.11x faster

The geometric mean for this worst-case scenario is 1.03x faster. More details for anybody curious:

These results were obtained by running the pyperformance suite with PGO/LTO on our benchmarking machine for several similar commits:

main: A fairly recent commit from main.
main-again: No changes, just the same exact commit.
shuffle-ceval-random-a: A random shuffle of the cases in _PyEval_EvalFrameDefault.
shuffle-ceval-random-b: A random shuffle of the cases in _PyEval_EvalFrameDefault.
shuffle-ceval-random-c: A random shuffle of the cases in _PyEval_EvalFrameDefault.
shuffle-ceval-random-d: A random shuffle of the cases in _PyEval_EvalFrameDefault.

Output of pyperf system tune:

Tune the system configuration to run benchmarks

Actions
=======

CPU Frequency: Minimum frequency of CPU 0-1,11,14 set to the maximum frequency

System state
============

CPU: use 4 logical CPUs: 0-1,11,14
Perf event: Maximum sample rate: 1 per second
ASLR: Full randomization
Linux scheduler: Isolated CPUs (4/20): 0-1,11,14
Linux scheduler: RCU disabled on CPUs (4/20): 0-1,11,14
CPU Frequency: 0-1,11,14=min=max=3700 MHz; 2-10,12-13,15-19=min=1200 MHz, max=3700 MHz
CPU scaling governor (intel_pstate): performance
Turbo Boost (intel_pstate): Turbo Boost disabled
IRQ affinity: irqbalance service: inactive
IRQ affinity: Default IRQ affinity: CPU 2-10,12-13,15-19
IRQ affinity: IRQ affinity: IRQ 0,2=CPU 0-19; IRQ 1,3-16,28,30-31,33-75,97-98,100-108=CPU 2-10,12-13,15-19; IRQ 76=CPU 0; IRQ 77=CPU 1; IRQ 78=CPU 2; IRQ 79=CPU 3; IRQ 80=CPU 4; IRQ 81=CPU 5; IRQ 82=CPU 6; IRQ 83=CPU 7; IRQ 84=CPU 8; IRQ 85=CPU 9; IRQ 86=CPU 10; IRQ 87=CPU 11; IRQ 88=CPU 12; IRQ 89=CPU 13; IRQ 90=CPU 14; IRQ 91=CPU 15; IRQ 92=CPU 16; IRQ 93=CPU 17; IRQ 94=CPU 18; IRQ 95=CPU 19

Full results:

Benchmark	main	main-again	shuffle-ceval-random-a	shuffle-ceval-random-b	shuffle-ceval-random-c	shuffle-ceval-random-d
2to3	269 ms	not significant	268 ms: 1.00x faster	not significant	268 ms: 1.01x faster	not significant
chameleon	7.66 ms	not significant	7.77 ms: 1.01x slower	7.88 ms: 1.03x slower	not significant	7.69 ms: 1.00x slower
chaos	79.0 ms	78.3 ms: 1.01x faster	not significant	not significant	78.1 ms: 1.01x faster	78.1 ms: 1.01x faster
crypto_pyaes	88.4 ms	87.9 ms: 1.01x faster	not significant	not significant	87.2 ms: 1.01x faster	86.2 ms: 1.03x faster
deltablue	4.90 ms	not significant	4.95 ms: 1.01x slower	not significant	not significant	4.94 ms: 1.01x slower
django_template	39.0 ms	39.2 ms: 1.01x slower	not significant	38.7 ms: 1.01x faster	38.7 ms: 1.01x faster	39.2 ms: 1.01x slower
dulwich_log	68.6 ms	68.8 ms: 1.00x slower	68.1 ms: 1.01x faster	68.3 ms: 1.00x faster	not significant	68.8 ms: 1.00x slower
fannkuch	421 ms	426 ms: 1.01x slower	435 ms: 1.03x slower	434 ms: 1.03x slower	not significant	423 ms: 1.00x slower
float	84.7 ms	not significant	84.2 ms: 1.01x faster	not significant	not significant	not significant
go	171 ms	172 ms: 1.01x slower	169 ms: 1.01x faster	170 ms: 1.01x faster	169 ms: 1.01x faster	not significant
hexiom	7.64 ms	not significant	7.81 ms: 1.02x slower	7.79 ms: 1.02x slower	not significant	7.68 ms: 1.01x slower
json_dumps	12.5 ms	12.6 ms: 1.01x slower	12.8 ms: 1.03x slower	12.6 ms: 1.01x slower	12.4 ms: 1.01x faster	not significant
json_loads	25.3 us	25.9 us: 1.02x slower	25.8 us: 1.02x slower	25.9 us: 1.02x slower	25.6 us: 1.01x slower	25.8 us: 1.02x slower
logging_format	6.82 us	7.00 us: 1.03x slower	6.73 us: 1.01x faster	6.87 us: 1.01x slower	not significant	6.86 us: 1.01x slower
logging_silent	115 ns	116 ns: 1.01x slower	113 ns: 1.02x faster	114 ns: 1.01x faster	not significant	117 ns: 1.01x slower
logging_simple	6.10 us	6.16 us: 1.01x slower	6.01 us: 1.01x faster	6.19 us: 1.01x slower	not significant	6.18 us: 1.01x slower
mako	12.7 ms	12.8 ms: 1.01x slower	13.1 ms: 1.03x slower	13.2 ms: 1.04x slower	12.8 ms: 1.01x slower	12.8 ms: 1.01x slower
meteor_contest	107 ms	109 ms: 1.02x slower	108 ms: 1.01x slower	108 ms: 1.01x slower	not significant	107 ms: 1.01x slower
nbody	113 ms	114 ms: 1.01x slower	115 ms: 1.02x slower	114 ms: 1.01x slower	115 ms: 1.02x slower	115 ms: 1.02x slower
nqueens	89.6 ms	88.3 ms: 1.01x faster	not significant	87.9 ms: 1.02x faster	87.8 ms: 1.02x faster	93.3 ms: 1.04x slower
pathlib	19.2 ms	not significant	not significant	18.9 ms: 1.02x faster	18.9 ms: 1.02x faster	19.3 ms: 1.01x slower
pickle	9.87 us	10.0 us: 1.01x slower	10.4 us: 1.05x slower	10.4 us: 1.05x slower	not significant	10.1 us: 1.03x slower
pickle_dict	27.2 us	27.3 us: 1.01x slower	28.4 us: 1.05x slower	28.8 us: 1.06x slower	27.7 us: 1.02x slower	27.4 us: 1.01x slower
pickle_list	4.41 us	4.53 us: 1.03x slower	4.52 us: 1.02x slower	4.57 us: 1.04x slower	4.34 us: 1.02x faster	4.48 us: 1.01x slower
pickle_pure_python	379 us	376 us: 1.01x faster	not significant	not significant	not significant	not significant
pidigits	188 ms	208 ms: 1.11x slower	189 ms: 1.00x slower	209 ms: 1.11x slower	188 ms: 1.00x slower	188 ms: 1.00x slower
pyflate	531 ms	536 ms: 1.01x slower	544 ms: 1.02x slower	538 ms: 1.01x slower	528 ms: 1.01x faster	542 ms: 1.02x slower
python_startup	8.46 ms	8.45 ms: 1.00x faster	8.41 ms: 1.01x faster	8.45 ms: 1.00x faster	8.47 ms: 1.00x slower	8.45 ms: 1.00x faster
python_startup_no_site	5.85 ms	5.84 ms: 1.00x faster	5.82 ms: 1.01x faster	5.84 ms: 1.00x faster	5.85 ms: 1.00x slower	5.84 ms: 1.00x faster
raytrace	326 ms	331 ms: 1.02x slower	not significant	328 ms: 1.01x slower	330 ms: 1.01x slower	327 ms: 1.00x slower
regex_compile	151 ms	151 ms: 1.01x slower	not significant	152 ms: 1.01x slower	149 ms: 1.01x faster	150 ms: 1.01x faster
regex_dna	214 ms	212 ms: 1.01x faster	209 ms: 1.03x faster	210 ms: 1.02x faster	210 ms: 1.02x faster	216 ms: 1.01x slower
regex_effbot	3.31 ms	3.11 ms: 1.06x faster	3.19 ms: 1.04x faster	3.13 ms: 1.06x faster	3.18 ms: 1.04x faster	3.28 ms: 1.01x faster
regex_v8	24.3 ms	24.1 ms: 1.01x faster	23.8 ms: 1.02x faster	23.7 ms: 1.02x faster	23.6 ms: 1.03x faster	25.1 ms: 1.03x slower
richards	55.3 ms	not significant	56.1 ms: 1.01x slower	56.6 ms: 1.02x slower	55.6 ms: 1.01x slower	not significant
scimark_fft	350 ms	345 ms: 1.01x faster	356 ms: 1.02x slower	361 ms: 1.03x slower	341 ms: 1.03x faster	not significant
scimark_lu	150 ms	not significant	not significant	152 ms: 1.01x slower	152 ms: 1.02x slower	148 ms: 1.01x faster
scimark_monte_carlo	78.7 ms	80.1 ms: 1.02x slower	81.9 ms: 1.04x slower	not significant	78.2 ms: 1.01x faster	77.1 ms: 1.02x faster
scimark_sor	156 ms	158 ms: 1.01x slower	160 ms: 1.03x slower	158 ms: 1.01x slower	not significant	not significant
scimark_sparse_mat_mult	4.90 ms	4.94 ms: 1.01x slower	4.83 ms: 1.01x faster	4.86 ms: 1.01x faster	not significant	4.80 ms: 1.02x faster
spectral_norm	114 ms	111 ms: 1.03x faster	118 ms: 1.03x slower	120 ms: 1.05x slower	111 ms: 1.03x faster	115 ms: 1.01x slower
sqlalchemy_declarative	145 ms	not significant	not significant	not significant	144 ms: 1.01x faster	not significant
sqlalchemy_imperative	18.8 ms	not significant	18.6 ms: 1.01x faster	not significant	18.7 ms: 1.01x faster	not significant
sqlite_synth	2.76 us	not significant	2.73 us: 1.01x faster	not significant	not significant	2.79 us: 1.01x slower
sympy_expand	515 ms	512 ms: 1.01x faster	512 ms: 1.01x faster	514 ms: 1.00x faster	512 ms: 1.01x faster	513 ms: 1.00x faster
sympy_integrate	22.4 ms	22.2 ms: 1.01x faster	22.2 ms: 1.01x faster	22.2 ms: 1.01x faster	22.2 ms: 1.01x faster	22.3 ms: 1.00x faster
sympy_sum	171 ms	not significant	172 ms: 1.01x slower	173 ms: 1.01x slower	172 ms: 1.01x slower	173 ms: 1.01x slower
sympy_str	307 ms	306 ms: 1.01x faster	not significant	309 ms: 1.01x slower	not significant	not significant
telco	6.46 ms	not significant	not significant	6.40 ms: 1.01x faster	not significant	not significant
tornado_http	111 ms	110 ms: 1.01x faster	not significant	not significant	not significant	not significant
unpack_sequence	47.6 ns	46.1 ns: 1.03x faster	not significant	48.5 ns: 1.02x slower	45.5 ns: 1.05x faster	45.5 ns: 1.05x faster
unpickle	13.4 us	14.0 us: 1.04x slower	13.8 us: 1.03x slower	13.8 us: 1.03x slower	13.7 us: 1.02x slower	13.9 us: 1.03x slower
unpickle_list	4.98 us	5.12 us: 1.03x slower	5.03 us: 1.01x slower	4.89 us: 1.02x faster	5.09 us: 1.02x slower	5.18 us: 1.04x slower
unpickle_pure_python	276 us	274 us: 1.01x faster	277 us: 1.01x slower	not significant	272 us: 1.02x faster	272 us: 1.01x faster
xml_etree_parse	154 ms	156 ms: 1.01x slower	157 ms: 1.02x slower	156 ms: 1.01x slower	156 ms: 1.01x slower	157 ms: 1.02x slower
xml_etree_iterparse	107 ms	not significant	111 ms: 1.04x slower	112 ms: 1.05x slower	107 ms: 1.01x slower	109 ms: 1.02x slower
xml_etree_generate	83.9 ms	84.7 ms: 1.01x slower	84.9 ms: 1.01x slower	85.5 ms: 1.02x slower	84.6 ms: 1.01x slower	84.6 ms: 1.01x slower
xml_etree_process	61.7 ms	62.2 ms: 1.01x slower	62.1 ms: 1.01x slower	62.7 ms: 1.02x slower	62.1 ms: 1.01x slower	62.2 ms: 1.01x slower
Geometric mean	(ref)	1.00x slower	1.01x slower	1.01x slower	1.00x faster	1.00x slower

FreddieWitherden · 2021-11-22T15:04:46Z

FreddieWitherden
Nov 22, 2021

This is a difficult and highly technical question to answer. To get a good overview I would strongly recommend watching "Performance matters" by Emery Berger (https://www.youtube.com/watch?v=r-TLSBdHe1A) which describes the issues in detail. His work in this area (and many others) is top notch.

A key thing to remember with benchmarks — and one that I frequently have to remind my students of — is that a benchmark is a means to an end and not the end itself. What you want is an understanding of why a certain change has a certain effect. If you find changing one part of the interpreter improves one test case by 5% but slows another down by 10% what matters is why. It is only from digging into this that one will get an appreciation and working mental model for the system. This, in turn, will allow one to start proposing meaningful hypothesis about the system that will actually lead to improvements.

Too often people have what seems like a good idea, try it, find it does not yield an improvement and bin it, without ever sitting down to actually figure out why it did not work out. As such they are really no better placed to come up with better proposals in the future.

0 replies

timfel · 2021-12-01T09:22:04Z

timfel
Dec 1, 2021

Interesting/related to this may be this paper: https://arxiv.org/abs/1602.00602. It's called "Virtual Machine Warmup Blows Hot and Cold", but they also benchmark executables simply created with GCC, and find that even that ends up being extremely noisy sometimes, suggesting that depending on the workload, you're going to have to live with +/- 10% noise.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How noisy is each benchmark? #142

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How noisy is each benchmark? #142

Uh oh!

brandtbucher Nov 2, 2021 Maintainer

Replies: 2 comments

Uh oh!

FreddieWitherden Nov 22, 2021

Uh oh!

timfel Dec 1, 2021

brandtbucher
Nov 2, 2021
Maintainer

FreddieWitherden
Nov 22, 2021

timfel
Dec 1, 2021