Performance and Measurement #264

sampsyo · 2022-01-23T19:57:37Z

sampsyo
Jan 23, 2022
Maintainer

This thread is for discussing the famous "Producing Wrong Data!" paper by Mytkowicz et al. I (@sampsyo) am the discussion leader and will try to answer all your questions!

charles-rs · 2022-01-26T18:53:09Z

charles-rs
Jan 26, 2022

I wonder how feasible it would be at the scale of a large university, or perhaps beyond, to have a special server set that consists of a lot of different machines that automate testing on different platforms, perhaps introducing other forms of randomization? This may be completely unrealistic, but it seems like it would be nice to "factor out" a solution for these kinds of measurement bias that could easily be reused across projects.

2 replies

sampsyo Jan 27, 2022
Maintainer Author

It's not hardware infrastructure, but one tool you might be interested in is Stabilizer, which has an explicit goal of factoring out confounding factors like this.

charles-rs Jan 27, 2022

well that summary is kinda sad -- O3 over O2 indistinguishable from noise....

michaelmaitland · 2022-01-26T23:01:56Z

michaelmaitland
Jan 26, 2022

I wonder what impact RISC-V and other open source ISA's will have on mitigating measurement bias. For example, "Intel does not provide full description of the LSD nor does it provide any performance hardware monitors that directly measure the behavior the LSD". In an open ISA, researchers would be able to understand how the LSD works and if someone wants to contribute performance hardware monitors, they would be able to (open source FTW).

One problem I have with all of this is that abstraction is not doing its job here. A software developer or researcher should not have to worry about the details of the hardware or Unix environment variable when they build an application that should run on top of many different hardware and many different OS configurations. For a researcher it's probably important to carefully ignore abstraction but I wonder at what point it becomes unproductively cumbersome. Where and how do we find this balance?

0 replies

anshumanmohan · 2022-01-26T23:05:05Z

anshumanmohan
Jan 26, 2022

Apologies if this is a bit flippant, but I struggled a little to see why CS/Systems researchers are facing this problem anew when, as the authors say, this problem has long been recognized in the natural and social sciences. What has caused us to miss this? Systems research is done by experts in quantitative reasoning. What has caused us to need this paper (and, for that matter, the SIGPLAN guidelines)?

2 replies

5hubh4m Jan 27, 2022

I think CS researchers (in a broader sense) have a perspective of having a more complete control over their experimental environment than researchers in the other fields in natural sciences can ever hope to have. This is borne out of the perception that the computation is “deterministic” and our platforms are “standardized” to a degree. And even researchers who are aware of the fallacy of assuming these simplistic models (for instance many systems researchers don’t assume determinism when they model concurrent/parallel systems) often still underestimate the complexity of the setup since they’re so so so many configuration options - from compiler options, OS options (scheduler, network parameters, etc.), hardware options (CPU model/architecture/ISA/manufacturer, etc.) - which researchers try to control for but fail to consider all of them as this paper demonstrates.

atucker Jan 27, 2022

I think there's also something to be said for the size of the measurement bias. The performance difference in the paper of 0.92 vs. 1.1 would be a ~20% difference which would be disastrous in evaluating ML papers, but for an area like systems where people can tout a 10x or 100x improvement over the baseline, the 20% doesn't matter much -- 80x better and 120x better are both clear improvements.

In 6.2 they mention that many of the claimed increases are ~10%, which would definitely be throwable with the measurement bias. So maybe the big results are all reliable, but then people feel pressure to publish smaller results as well?

zilberstein · 2022-01-27T00:20:34Z

zilberstein
Jan 27, 2022

The link between systems research and natural science feels a bit tenuous to me. In natural science, things usually need to be measured in the real world, where we have little control over the environment. A computer is something we created, so in theory it shouldn't be so hard to set up a precise environment and eliminate issues such as the stack start address.

Performance testing does have a "science-y" part to it, in that we have to try to instrument programs (that we created) based on how we think they will be used in the real world (which we can't control). This is in contrast to the more theoretical parts of CS like algorithms where we reason about performance mathematically.

1 reply

sampsyo Jan 27, 2022
Maintainer Author

I generally agree with you that the natural-science analogy is imperfect, but I'd also encourage you to look closely at that "in theory" in your first paragraph above… yes, humans created computers, so in theory it is possible to completely understand everything they do at all times, but can any one person (or team of people) these days actually completely understand everything in a modern machine?

5hubh4m · 2022-01-27T01:47:30Z

5hubh4m
Jan 27, 2022

There’s this recent development in the systems community of encouraging people to submit artifacts for independent evaluation which is motivated partially by findings of papers like this; however most artifact evaluations involve the authors of the research papers giving the evaluators access to their own setup and so what they end up doing is just running the authors’ scripts on their own setup which does nothing to curtail the effects of measurement bias; which kind of defeats the purpose of independent evaluation, at least partially.

1 reply

sampsyo Jan 27, 2022
Maintainer Author

Yeah, I agree with that "at least partially" qualifier. I don't think anyone believes artifact evaluation is an ironclad defense against mistakes or, even less so, actual research malfeasance. But fortunately, it doesn't have to be bulletproof to be incrementally useful in pushing research practices in the direction of reproducibility/reliability. I'd be interested in your thoughts about whether it is effective at doing that, even if it's not perfect (which of course it is not).

s-ren · 2022-01-27T01:56:47Z

s-ren
Jan 27, 2022

We all know that hardware and environment affect benchmark results. I think one of the biggest values of the work lies in demonstrating how big the effect can be. Section 7 and 8 articulated the possible solutions pretty well. As a researcher, I think we can try to identify the use cases of our work, and run experiments that simulate real scenario as much as possible. The review process can also benefit from the work by being aware of this issue.

0 replies

alaiasolkobreslin · 2022-01-27T02:07:37Z

alaiasolkobreslin
Jan 27, 2022

This paper is from 2009 and mentions that in APLOS 2008, PACT 2007, PLDI 2007, and CGO 2007, none of the papers do a good job at addressing measurement bias. I'm wondering if the standard has changed since 2009 and more papers are addressing this issue satisfactorily, or if things are about the same?

In section 6.2 the authors note that these papers reported a median speedup of 10%, and section 4 shows that this number is small enough to be caused by measurement bias. Another question I have is if there might be some N such that any speedup above N% would definitely be the result of the ideas and not measurement bias, and therefore it might be unnecessary for possible bias to be addressed during evaluation?

1 reply

sampsyo Jan 27, 2022
Maintainer Author

That's a really good question. I don't know how much the standards have changed. I think people are slightly more aware these days about the importance of error bars, etc. And the SIGPLAN guidelines are a good indication of the shift in conventional wisdom. But I still see papers all the time that don't bother to randomize anything (and have even written such papers myself). It would be interesting to try to repeat their literature survey with the 2021 conference proceedings.

atucker · 2022-01-27T02:58:09Z

atucker
Jan 27, 2022

Would it be possible to make a broad enough evaluation suite? It seems like there would always be the problem that the evaluation suite may not match the workload that the reader is thinking of no matter what you do. Maybe it would be better to just vary different parameters of the workload (like being more memory or cpu intensive), and show how those changes impact performance?

Prompted by the paper but not related to the main points -- I wonder if people ever rearrange code in linked files in order to put functions that are frequently accessed together in the program onto the same page?

2 replies

michaelmaitland Jan 27, 2022

Perhaps a bit tangential to the paper but in response to a point you make:

Your comment about accessing hot functions made me ask the following question: A program's code is loaded into memory so a processor can access those instructions. A loader is responsible for loading code in and out of memory. For example, if a program is too big to hold in memory, its time to load a new program in, or it may not be optimal to load the entirety of code into memory at one time. Is there such a think as loader optimizations? Is this an interesting area of research and is it related at all to compilers?

sampsyo Jan 27, 2022
Maintainer Author

The answer to @atucker's broad question about whether it makes sense to exploit spatial locality by putting frequently-used-together functions close together in memory is a resounding "yes"! This is generally called "code layout." Here's a cool paper that got great results by doing this at the granularity of individual blocks instead of whole functions, for example: https://dl.acm.org/doi/10.1145/3302516.3307358

@Yasgur99, to refine your observation slightly, it is not often the case in the modern era that a program can't fit into a computer's main memory (main memory is huge these days). However, it is quite common that the entire code section can't fit into a processor's instruction cache. That's what code layout optimizations address. But the entity responsible for swapping in and out frequently used code, in that case, is the I-cache itself (hardware) rather than the loader.

andrewb1999 · 2022-01-27T02:59:01Z

andrewb1999
Jan 27, 2022

This paper very explicitly shows the necessity for computer science to adopt better scientific methodology. Computer scientists are often fooled into believing computers are deterministic, but actually computer systems are becoming less deterministic overtime. Networking is a big concern I see here. Nowadays, much systems research is related to data-center scale optimization that is very prone to unpredictable behavior because of networking and virtualization. As computer science continues to develop we need to learn to accept a certain level of non-determinism at the systems level. There are just too many variables and real world performance hacks to account for if you were to try to create a truly deterministic system.

Another enlightening part of this article is the difficulty in writing proper benchmark suites. As a field we need to be more careful in our choice of benchmark suites. Especially in the case of small, synthetic benchmark suites, the benchmarks can have a huge impact on the conclusion.

0 replies

tonyjie · 2022-01-27T04:51:54Z

tonyjie
Jan 27, 2022

One solution this paper gives to avoid measurement bias is to experiment on various randomized measurement setups to mitigate the effect. But the prerequisite is that we need to know the potential sources of measurement bias before the experiment. This paper mentioned two sources related to memory layout, and different problem domain has their own sources of measurement bias.

But it seems still unclear and not a consensus that which factors would lead to measurement bias in the field of computer system. (E.g. gender, age would be the factors influencing some social experiment. ) We usually just pick up some benchmarks and experiment on different hardwares. For the experiment with finer granularity, we don't know what are the common factors we should consider.

1 reply

sampsyo Jan 27, 2022
Maintainer Author

Yes, I agree this is a big limitation of the philosophy in the paper. In order to "randomize out" a confounding factor, you first need to know what that factor is. And there is no "recipe" that can tell you that you have, at long last, found all the potential confounding factors.

chhzh123 · 2022-01-27T05:04:51Z

chhzh123
Jan 27, 2022

I think this paper gives surprising results that even the UNIX environment size or the linking order has such a big impact on the final performance, which is not what I first imagined. While the paper definitely points out an important issue, I don’t think this kind of measurement bias can be thoroughly eliminated.

It reminds me of Heisenberg's uncertainty principle, which can be similarly applied to performance analysis. Once we want to measure the performance of some pieces of code, we need to add another code snippet to the original program. The added test code may unavoidably introduce bias when testing (e.g. increasing program size and causing memory locality issues). The tricky thing is that you cannot retrieve the real performance unless you introduce a new bias into this system. A common example is that the instrumentation tools (e.g. Pintools) always make the original program run slower since it adds more instructions to the program.

Another thing is even for compilers, operating systems, and profiling tools, they are basically computer software made by humans. Thus it naturally has bugs and bias introduced by the programmers. It is hard to find a pure environment without any systematic bias to conduct experiments. In practice, it is also impossible for a researcher say an eager-to-graduate PhD student to do so many experiments on different machines controlling all the trivial variables. I think evaluating different datasets, comparing with different baseline systems, and using the correct statistic methods are almost the best thing we can do.

Nevertheless, the final goal of this paper is to call system researchers to think more about their experiments and make their results reproducible. Actually I know lots of conferences nowadays like PLDI, SOSP, and OSDI all have Artifact Evaluation processes for the accepted papers. The artifacts will be executed by different people on different machines with different environment settings, though the datasets remain the same. I think it has already been a great advance for the community to judge whether the conclusion of system research is valid and whether the proposed system can be applied to different scenarios.

1 reply

sampsyo Jan 27, 2022
Maintainer Author

Just wanted to link to @5hubh4m's comment at #264 (comment) that also talks about artifact evaluation!

JonathanDLTran · 2022-01-27T06:24:28Z

JonathanDLTran
Jan 27, 2022

I find the paper interesting in how it finds link order and environment size to play a role in measurement bias. But I am intrigued that there is not more discussion in the paper about the effect of selected benchmarks on measurement bias. I imagine benchmarks are chosen from a specific area of study due to their importance, or what they can demonstrate. These benchmarks might lead to implicit bias in the results, because perhaps they all share similarities, such as small working sets, as the paper suggests. I wonder if randomly mutating the benchmark programs (and keeping the mutated programs syntactically well-formed) will help create more representative programs to test systems on. This way, the mutated benchmarks share are still similar enough to be evaluated in the same area of study, but the randomness added by the mutation may decrease by the measurement bias.

It is also interesting to me that we cannot account for all the bias that the system we test on creates. The underlying system is too complex to totally understand, perhaps much like our physical reality. This may be further encouragement to adopt practices to limit measurement bias from other sciences.

0 replies

andreyyao · 2022-01-27T06:53:55Z

andreyyao
Jan 27, 2022

First of all it would be interesting to know how the computer science research community reacted to this paper, because it somewhat discredits or at least cast doubt on the results which gave ``10 percent speedups''.

According to this paper, the root source of measurement bias in these CS experiments is insufficient number of benchmarks or test cases. This is in essence applying the law of large numbers to try to reduce the degree of randomness and perturbation which arise naturally in all kinds of scientific experiments. When the testing are performed on a "black box", such as on a chip about with the manufacturer withhold information, increasing the number of test cases is probably the best one can do. However, if one has access to more information, such as when the compiler involved is open source, etc., there is more hope to put a more fine-grained characterization of these experimental results, i.e. one might be able to claim that the result still holds when certain conditions are satisfied.

However I don't think it's always wrong to favor a certain experimental setting over another. Assume for example that we are concerned about the effectiveness of a compiler's optimization for some language. Programs written by humans are just one small portions of the space of all valid programs, but they should be much more likely to appear in practice. We can go further and say that certain programming habits and patterns are more likely to show up in real codes because they are the way most people are taught. Thus it could be helpful to maybe collect data in order to infer this "probability distribution" of real-life programs.

0 replies

zzzDavid · 2022-01-27T07:30:09Z

zzzDavid
Jan 27, 2022

One key takeaway for me from this paper is that, when you argue that a general optimization method on complex systems is useful, one need to take into account the inevitable measurement bias and try to minimize it with diverse settings, randomization, and statistic analysis. I think this is important analyzing complex systems, particularly in the context when people are focusing on building larger and more complex general processors like CPUs. Now when systems are going more specialized and modular, I wonder if the non-deterministic issue would alleviate for specialized accelerators in the systems. For example, a Neural-network accelerator does not have the complex cache design, and is typically deterministic during runtime. In real life, I think it is generally unrealistic to consider every aspect of a system and conduct causal analysis.

On the compiler side, I think it might be useful to develop better profilers to provide meaningful feedbacks for researchers to understand what factors may cause result sensitivity.

0 replies

ayakayorihiro · 2022-01-27T08:26:45Z

ayakayorihiro
Jan 27, 2022

This paper raises awareness about the effects and frequency of measurement bias by providing two examples: UNIX environment size and link order. Additionally, the authors provide some suggestions for methods to detect and avoid measurement bias. To me, this paper conveyed the difficulty of backing up conclusions with evidence in an experimental science where unseen/unintended factors can have tremendous impacts on the results of the experiments.

Since the paper was written in 2009 (almost 13 years ago!), I wonder how the systems research community has changed in those 12/13 years. Has there been an advent of a new benchmark that is diverse enough to minimize measurement bias? Have researchers shifted away from using benchmarks that were deemed to be biased (ex. SPEC JVM98)? Have papers focused more on measurement bias in their evaluation? Has a similar evaluation been done to a different optimization/technique (the authors chose to investigate the effectiveness of O3 optimizations, but I suspect that one can carry out a similar investigation on other important optimizations/techniques)? Has measurement bias caused certain conclusions (that were actually prematurely made via biased experimentation) be discredited?

Another question that I have after reading this paper is, what does a diverse benchmark that minimizes bias look like? The paper emphasizes that the diversity of the benchmark is more important than its size, but there was no guidelines as to how one can conclude that a benchmark is diverse.

1 reply

sampsyo Jan 27, 2022
Maintainer Author

All good questions! Just wanted to link to @alaiasolkobreslin's comment at #264 (comment) that had some similar discussion about how things have changed since 2009.

gsvic · 2022-01-27T14:27:11Z

gsvic
Jan 27, 2022

Following up to @zilberstein comment about the link between systems and natural sciences, I agree that the link in practice might be somewhat invalid due to the lack of control we have in nature when conducting experiments, compared to computer systems. From a philosophical perspective though, it seems like the measurement bias can exist almost in any science and the avoidance is a common goal in the scientific method in general.

Regarding the scientific method, the ideas presented in the paper, especially the part that randomizes the setup in order to eliminate the bias, reminds me of the concept of suspension of judgment, introduced by René Descartes in his book Meditations on First Philosophy back in 1641. As mentioned in Wikipedia, suspension of judgment means to "systematically doubt all beliefs and do a ground-up rebuild of only definitely true things as an undoubted basis". In other words, when we conduct an experiment we might be biased during the procedure by the result that we want to achieve, forgetting the actual, unbiased result that we should actually get. To avoid that, we should enforce suspension of judgment in order to be completely unbiased when we design the experiment in order to have a valid result. I can see a connection between this and the setup randomization presented in the paper, as randomization takes the control both from the human and (not completely but good enough) from the machine, and enforces the random environment (including settings, etc) that can lead to more realistic and unbiased results.

0 replies

susan-garry · 2022-01-27T14:28:55Z

susan-garry
Jan 27, 2022

alaiasolkobreslin talked about how this paper is more than a decade old at this point and wonders whether there has been any more attention paid to measurement bias in literature published subsequent to this paper. I also wonder if there have been any changes made to the way hardware manufacturers disclose data regarding their processors, or if any concerted effort has been made to create benchmark suites that are diverse enough to account for layout bias. It may be the case that hardware manufacturers have provided enough tools for measuring the effect of each hardware optimization (the paper mentions, for example, allowing major optional features to be disabled), but creating a benchmark suite diverse enough to account for layout bias seems key in reducing the number of experiments that need to be conducted for researchers to have more confidence in their results. This has the obvious benefit of making it easy to save computational power but also makes it easier for researchers to improve their results without changing their methodology, so even though many papers published since this one do not do extensive testing and setup randomization, I think researchers would be likely to incorporate better benchmarks into their evaluation.

0 replies

yy665 · 2022-01-27T15:10:44Z

yy665
Jan 27, 2022

While it's surprising to learn that linking order could lead to significant measurement bias, I don't find the existence of (uncaptured) measurement bias in general to be surprising. The environment we rely on to reproduce system research is way too complex, and it's impossible to take account of every dependency. What I believe is as long as the researchers have good faith (i.e. they are not fabricating their data) and are putting reasonably good effort in mitigating the measurement bias (e.g. applying a randomization standard agreed by community, or doing experiments on real world settings if possible), they should not be penalized for the measurement bias. The scientific community also needs to pay less attention to evaluate the performance metrics (especially for single digit % differences) but more to evaluate the method/idea itself. After all, it might be easier to just accept and embrace the randomization instead of making every efforts to control every source of bias.

0 replies

Performance and Measurement #264

Uh oh!

sampsyo Jan 23, 2022 Maintainer

Replies: 18 comments · 12 replies

Uh oh!

Uh oh!

sampsyo Jan 27, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sampsyo Jan 27, 2022 Maintainer Author

Uh oh!

Uh oh!

sampsyo Jan 27, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

sampsyo Jan 27, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sampsyo Jan 27, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

sampsyo Jan 27, 2022 Maintainer Author

Uh oh!

Uh oh!

sampsyo Jan 27, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sampsyo Jan 27, 2022 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sampsyo
Jan 23, 2022
Maintainer

Replies: 18 comments 12 replies

sampsyo Jan 27, 2022
Maintainer Author

sampsyo Jan 27, 2022
Maintainer Author

sampsyo Jan 27, 2022
Maintainer Author

sampsyo Jan 27, 2022
Maintainer Author

sampsyo Jan 27, 2022
Maintainer Author

sampsyo Jan 27, 2022
Maintainer Author

sampsyo Jan 27, 2022
Maintainer Author

sampsyo Jan 27, 2022
Maintainer Author