Replies: 18 comments 12 replies
-
I wonder how feasible it would be at the scale of a large university, or perhaps beyond, to have a special server set that consists of a lot of different machines that automate testing on different platforms, perhaps introducing other forms of randomization? This may be completely unrealistic, but it seems like it would be nice to "factor out" a solution for these kinds of measurement bias that could easily be reused across projects. |
Beta Was this translation helpful? Give feedback.
-
I wonder what impact RISC-V and other open source ISA's will have on mitigating measurement bias. For example, "Intel does not provide full description of the LSD nor does it provide any performance hardware monitors that directly measure the behavior the LSD". In an open ISA, researchers would be able to understand how the LSD works and if someone wants to contribute performance hardware monitors, they would be able to (open source FTW). One problem I have with all of this is that abstraction is not doing its job here. A software developer or researcher should not have to worry about the details of the hardware or Unix environment variable when they build an application that should run on top of many different hardware and many different OS configurations. For a researcher it's probably important to carefully ignore abstraction but I wonder at what point it becomes unproductively cumbersome. Where and how do we find this balance? |
Beta Was this translation helpful? Give feedback.
-
Apologies if this is a bit flippant, but I struggled a little to see why CS/Systems researchers are facing this problem anew when, as the authors say, this problem has long been recognized in the natural and social sciences. What has caused us to miss this? Systems research is done by experts in quantitative reasoning. What has caused us to need this paper (and, for that matter, the SIGPLAN guidelines)? |
Beta Was this translation helpful? Give feedback.
-
The link between systems research and natural science feels a bit tenuous to me. In natural science, things usually need to be measured in the real world, where we have little control over the environment. A computer is something we created, so in theory it shouldn't be so hard to set up a precise environment and eliminate issues such as the stack start address. Performance testing does have a "science-y" part to it, in that we have to try to instrument programs (that we created) based on how we think they will be used in the real world (which we can't control). This is in contrast to the more theoretical parts of CS like algorithms where we reason about performance mathematically. |
Beta Was this translation helpful? Give feedback.
-
There’s this recent development in the systems community of encouraging people to submit artifacts for independent evaluation which is motivated partially by findings of papers like this; however most artifact evaluations involve the authors of the research papers giving the evaluators access to their own setup and so what they end up doing is just running the authors’ scripts on their own setup which does nothing to curtail the effects of measurement bias; which kind of defeats the purpose of independent evaluation, at least partially. |
Beta Was this translation helpful? Give feedback.
-
We all know that hardware and environment affect benchmark results. I think one of the biggest values of the work lies in demonstrating how big the effect can be. Section 7 and 8 articulated the possible solutions pretty well. As a researcher, I think we can try to identify the use cases of our work, and run experiments that simulate real scenario as much as possible. The review process can also benefit from the work by being aware of this issue. |
Beta Was this translation helpful? Give feedback.
-
This paper is from 2009 and mentions that in APLOS 2008, PACT 2007, PLDI 2007, and CGO 2007, none of the papers do a good job at addressing measurement bias. I'm wondering if the standard has changed since 2009 and more papers are addressing this issue satisfactorily, or if things are about the same? In section 6.2 the authors note that these papers reported a median speedup of 10%, and section 4 shows that this number is small enough to be caused by measurement bias. Another question I have is if there might be some N such that any speedup above N% would definitely be the result of the ideas and not measurement bias, and therefore it might be unnecessary for possible bias to be addressed during evaluation? |
Beta Was this translation helpful? Give feedback.
-
Would it be possible to make a broad enough evaluation suite? It seems like there would always be the problem that the evaluation suite may not match the workload that the reader is thinking of no matter what you do. Maybe it would be better to just vary different parameters of the workload (like being more memory or cpu intensive), and show how those changes impact performance? Prompted by the paper but not related to the main points -- I wonder if people ever rearrange code in linked files in order to put functions that are frequently accessed together in the program onto the same page? |
Beta Was this translation helpful? Give feedback.
-
This paper very explicitly shows the necessity for computer science to adopt better scientific methodology. Computer scientists are often fooled into believing computers are deterministic, but actually computer systems are becoming less deterministic overtime. Networking is a big concern I see here. Nowadays, much systems research is related to data-center scale optimization that is very prone to unpredictable behavior because of networking and virtualization. As computer science continues to develop we need to learn to accept a certain level of non-determinism at the systems level. There are just too many variables and real world performance hacks to account for if you were to try to create a truly deterministic system. Another enlightening part of this article is the difficulty in writing proper benchmark suites. As a field we need to be more careful in our choice of benchmark suites. Especially in the case of small, synthetic benchmark suites, the benchmarks can have a huge impact on the conclusion. |
Beta Was this translation helpful? Give feedback.
-
One solution this paper gives to avoid measurement bias is to experiment on various randomized measurement setups to mitigate the effect. But the prerequisite is that we need to know the potential sources of measurement bias before the experiment. This paper mentioned two sources related to memory layout, and different problem domain has their own sources of measurement bias. But it seems still unclear and not a consensus that which factors would lead to measurement bias in the field of computer system. (E.g. gender, age would be the factors influencing some social experiment. ) We usually just pick up some benchmarks and experiment on different hardwares. For the experiment with finer granularity, we don't know what are the common factors we should consider. |
Beta Was this translation helpful? Give feedback.
-
I think this paper gives surprising results that even the UNIX environment size or the linking order has such a big impact on the final performance, which is not what I first imagined. While the paper definitely points out an important issue, I don’t think this kind of measurement bias can be thoroughly eliminated. It reminds me of Heisenberg's uncertainty principle, which can be similarly applied to performance analysis. Once we want to measure the performance of some pieces of code, we need to add another code snippet to the original program. The added test code may unavoidably introduce bias when testing (e.g. increasing program size and causing memory locality issues). The tricky thing is that you cannot retrieve the real performance unless you introduce a new bias into this system. A common example is that the instrumentation tools (e.g. Pintools) always make the original program run slower since it adds more instructions to the program. Another thing is even for compilers, operating systems, and profiling tools, they are basically computer software made by humans. Thus it naturally has bugs and bias introduced by the programmers. It is hard to find a pure environment without any systematic bias to conduct experiments. In practice, it is also impossible for a researcher say an eager-to-graduate PhD student to do so many experiments on different machines controlling all the trivial variables. I think evaluating different datasets, comparing with different baseline systems, and using the correct statistic methods are almost the best thing we can do. Nevertheless, the final goal of this paper is to call system researchers to think more about their experiments and make their results reproducible. Actually I know lots of conferences nowadays like PLDI, SOSP, and OSDI all have Artifact Evaluation processes for the accepted papers. The artifacts will be executed by different people on different machines with different environment settings, though the datasets remain the same. I think it has already been a great advance for the community to judge whether the conclusion of system research is valid and whether the proposed system can be applied to different scenarios. |
Beta Was this translation helpful? Give feedback.
-
I find the paper interesting in how it finds link order and environment size to play a role in measurement bias. But I am intrigued that there is not more discussion in the paper about the effect of selected benchmarks on measurement bias. I imagine benchmarks are chosen from a specific area of study due to their importance, or what they can demonstrate. These benchmarks might lead to implicit bias in the results, because perhaps they all share similarities, such as small working sets, as the paper suggests. I wonder if randomly mutating the benchmark programs (and keeping the mutated programs syntactically well-formed) will help create more representative programs to test systems on. This way, the mutated benchmarks share are still similar enough to be evaluated in the same area of study, but the randomness added by the mutation may decrease by the measurement bias. It is also interesting to me that we cannot account for all the bias that the system we test on creates. The underlying system is too complex to totally understand, perhaps much like our physical reality. This may be further encouragement to adopt practices to limit measurement bias from other sciences. |
Beta Was this translation helpful? Give feedback.
-
First of all it would be interesting to know how the computer science research community reacted to this paper, because it somewhat discredits or at least cast doubt on the results which gave ``10 percent speedups''. According to this paper, the root source of measurement bias in these CS experiments is insufficient number of benchmarks or test cases. This is in essence applying the law of large numbers to try to reduce the degree of randomness and perturbation which arise naturally in all kinds of scientific experiments. When the testing are performed on a "black box", such as on a chip about with the manufacturer withhold information, increasing the number of test cases is probably the best one can do. However, if one has access to more information, such as when the compiler involved is open source, etc., there is more hope to put a more fine-grained characterization of these experimental results, i.e. one might be able to claim that the result still holds when certain conditions are satisfied. However I don't think it's always wrong to favor a certain experimental setting over another. Assume for example that we are concerned about the effectiveness of a compiler's optimization for some language. Programs written by humans are just one small portions of the space of all valid programs, but they should be much more likely to appear in practice. We can go further and say that certain programming habits and patterns are more likely to show up in real codes because they are the way most people are taught. Thus it could be helpful to maybe collect data in order to infer this "probability distribution" of real-life programs. |
Beta Was this translation helpful? Give feedback.
-
One key takeaway for me from this paper is that, when you argue that a general optimization method on complex systems is useful, one need to take into account the inevitable measurement bias and try to minimize it with diverse settings, randomization, and statistic analysis. I think this is important analyzing complex systems, particularly in the context when people are focusing on building larger and more complex general processors like CPUs. Now when systems are going more specialized and modular, I wonder if the non-deterministic issue would alleviate for specialized accelerators in the systems. For example, a Neural-network accelerator does not have the complex cache design, and is typically deterministic during runtime. In real life, I think it is generally unrealistic to consider every aspect of a system and conduct causal analysis. On the compiler side, I think it might be useful to develop better profilers to provide meaningful feedbacks for researchers to understand what factors may cause result sensitivity. |
Beta Was this translation helpful? Give feedback.
-
This paper raises awareness about the effects and frequency of measurement bias by providing two examples: UNIX environment size and link order. Additionally, the authors provide some suggestions for methods to detect and avoid measurement bias. To me, this paper conveyed the difficulty of backing up conclusions with evidence in an experimental science where unseen/unintended factors can have tremendous impacts on the results of the experiments. Since the paper was written in 2009 (almost 13 years ago!), I wonder how the systems research community has changed in those 12/13 years. Has there been an advent of a new benchmark that is diverse enough to minimize measurement bias? Have researchers shifted away from using benchmarks that were deemed to be biased (ex. SPEC JVM98)? Have papers focused more on measurement bias in their evaluation? Has a similar evaluation been done to a different optimization/technique (the authors chose to investigate the effectiveness of O3 optimizations, but I suspect that one can carry out a similar investigation on other important optimizations/techniques)? Has measurement bias caused certain conclusions (that were actually prematurely made via biased experimentation) be discredited? Another question that I have after reading this paper is, what does a diverse benchmark that minimizes bias look like? The paper emphasizes that the diversity of the benchmark is more important than its size, but there was no guidelines as to how one can conclude that a benchmark is diverse. |
Beta Was this translation helpful? Give feedback.
-
Following up to @zilberstein comment about the link between systems and natural sciences, I agree that the link in practice might be somewhat invalid due to the lack of control we have in nature when conducting experiments, compared to computer systems. From a philosophical perspective though, it seems like the measurement bias can exist almost in any science and the avoidance is a common goal in the scientific method in general. Regarding the scientific method, the ideas presented in the paper, especially the part that randomizes the setup in order to eliminate the bias, reminds me of the concept of suspension of judgment, introduced by René Descartes in his book Meditations on First Philosophy back in 1641. As mentioned in Wikipedia, suspension of judgment means to "systematically doubt all beliefs and do a ground-up rebuild of only definitely true things as an undoubted basis". In other words, when we conduct an experiment we might be biased during the procedure by the result that we want to achieve, forgetting the actual, unbiased result that we should actually get. To avoid that, we should enforce suspension of judgment in order to be completely unbiased when we design the experiment in order to have a valid result. I can see a connection between this and the setup randomization presented in the paper, as randomization takes the control both from the human and (not completely but good enough) from the machine, and enforces the random environment (including settings, etc) that can lead to more realistic and unbiased results. |
Beta Was this translation helpful? Give feedback.
-
alaiasolkobreslin talked about how this paper is more than a decade old at this point and wonders whether there has been any more attention paid to measurement bias in literature published subsequent to this paper. I also wonder if there have been any changes made to the way hardware manufacturers disclose data regarding their processors, or if any concerted effort has been made to create benchmark suites that are diverse enough to account for layout bias. It may be the case that hardware manufacturers have provided enough tools for measuring the effect of each hardware optimization (the paper mentions, for example, allowing major optional features to be disabled), but creating a benchmark suite diverse enough to account for layout bias seems key in reducing the number of experiments that need to be conducted for researchers to have more confidence in their results. This has the obvious benefit of making it easy to save computational power but also makes it easier for researchers to improve their results without changing their methodology, so even though many papers published since this one do not do extensive testing and setup randomization, I think researchers would be likely to incorporate better benchmarks into their evaluation. |
Beta Was this translation helpful? Give feedback.
-
While it's surprising to learn that linking order could lead to significant measurement bias, I don't find the existence of (uncaptured) measurement bias in general to be surprising. The environment we rely on to reproduce system research is way too complex, and it's impossible to take account of every dependency. What I believe is as long as the researchers have good faith (i.e. they are not fabricating their data) and are putting reasonably good effort in mitigating the measurement bias (e.g. applying a randomization standard agreed by community, or doing experiments on real world settings if possible), they should not be penalized for the measurement bias. The scientific community also needs to pay less attention to evaluate the performance metrics (especially for single digit % differences) but more to evaluate the method/idea itself. After all, it might be easier to just accept and embrace the randomization instead of making every efforts to control every source of bias. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
This thread is for discussing the famous "Producing Wrong Data!" paper by Mytkowicz et al. I (@sampsyo) am the discussion leader and will try to answer all your questions!
Beta Was this translation helpful? Give feedback.
All reactions