Replies: 23 comments 25 replies
-
I've never thought too much about the mechanics of profiling so I found the discussion of the specific optimizations to minimize profiling overhead very interesting. In section 3.4 when the authors present optimizations to minimize instrumentation (e.g. folding an addition into a memory address calculation if the chord is the last before EXIT), I was surprised at the thought that went into optimizing something that seems pretty insignificant, especially because the times I've profiled something I have been pretty indifferent to overhead. This focus on optimization came out a few other times in the paper as well, notably section 5.2. I think these seemingly minor optimizations make more sense when I remind myself that the presentation in the paper just covers the micro case of one CFG and when profiling a real program these optimizations may add up to have a significant impact on overhead. I also could just be ignorant of cases where users want to profile something while being more sensitive to overhead. Similarly the main overhead result in the paper (31% vs 16% for edge profiling) makes me curious 1. where would path profiling be with some of the optimizations discussed in the paper disabled and 2. at what hypothetical overhead for path profiling would edge profiling become preferable just because of the cost (as alluded to in section 7) Discussion questions:
|
Beta Was this translation helpful? Give feedback.
-
Path profiling was something I have never really used myself before so I found the description of the limitations of edge/basic block profiling and the benefits of path profiling to be very interesting. The algorithm was broken into multiple nice steps and I liked the way each challenge (like loops) was tackled one by one to create the final algorithm. My main critique would be that I wish the description of the algorithm was a bit more in-depth, as some of the conclusions (like those about MSTs) seemed non-trivial to me. As a whole though, I liked how the paper thoroughly describes the optimizations and considerations in place to minimize profiling overhead, though the first paper discussion does make me wonder how accurately and unbiasedly the overhead caused by profiling can be measured. Furthermore, it seems implied by the paper that a 31% to 16% difference in profiling overhead is small and unimportant, but is that actually true? It seems to me that affecting performance by ~30% seems quite significant. My discussion question is about the motivation of the paper: we know that edge profiling is less accurate at predicting paths, but in what optimizations does this actually matter (and path profiling's extra overhead becomes worth it)? |
Beta Was this translation helpful? Give feedback.
-
Critique: The authors explain that path profiling revealed some targets for profile-driven optimizations. It would have been interesting to see what impact such optimizations would have had. Discussion Question: |
Beta Was this translation helpful? Give feedback.
-
The algorithm presented in the paper seems quite efficient in-practice -- from both the experimental results and the intuition that adding is cheap. However, I wonder if this is provably the most efficient (for some hazy notion of efficiency). Below are some thoughts and questions regarding efficiency. Comparison vs. modified bit tracing. The paper briefly discusses bit tracing. It's unclear to me what would happen if one were to perform a modified bit tracing approach, where, rather than tracing every branch, one would only trace the branches that would be instrumented in the efficient path profiling approach (using the maximum spanning tree + chord technique). Would this provide a space-competitive representation for each unique path then? Hardware-supported path profiling. It seems that one of the key assumptions made in this paper is that addition is particularly fast and cheap (primarily due to hardware support, I presume). But addition is not purpose-built to help with path profiling -- it is merely coincidental. Could more specialized hardware could make path profiling (and other types of profiling) even faster than addition? Are there other ways of tracking paths taken that more specialized hardware could do with even lower overhead? By these two questions, I mean to distinguish between what is fast and efficient in-practice (potentially due to hardware design decisions made for profit in industry) and what would be fast and efficient in-theory (which imo is the more interesting question). |
Beta Was this translation helpful? Give feedback.
-
I thought the way that the authors introduced the problem was convincing. The example CFGs helped me understand what they were talking about, although I did have to stare at them for a while until it clicked. One thing that caught my attention was in the introduction, the authors state that the inaccuracy in edge profiling was usually ignored because it is assumed that path profiling is much more expensive:
I'm curious if this statement is based on existing path profiling tools at the time? I think it would help me contextualize the paper a bit more. Also, I think the motivation for hashing is to make a space tradeoff (storing dynamic paths that actually execute vs all the possible static paths). The authors identify the hashing as one of the main causes of the increased overhead of their implementation, so I'm interested in if they investigated an implementation that doesn't use hashing and instead stores the static paths. Maybe there are some programs where this isn't possible or feasible, but if there any programs where it is, I wonder what the overhead would be. On another note: maybe path profiling isn't the right word, but my understanding is that processors will do something that is sorta similar when trying to predict branches. For example, a common and pretty simple type of branch prediction seeks to find which branches are correlated to each other, where the outcome of one branch accurately predicts the outcome of another. In an imaginary world where we can inspect the hardware structures that compute this sort of thing, could this inform path profiling techniques? I don't really have an answer, but I thought it was interesting to think about this. Discussion question |
Beta Was this translation helpful? Give feedback.
-
Critique: I like how the paper mentions that for most programs, the number of executed paths is fewer than 2300 for most of the programs in the test suite but that the potential number of paths for each of these programs is hundreds of millions to tens of billions. That is an incredibly small proportion! Discussion Question: The paper establishes that edge profiling does not accurately identify the most frequently executed paths, but it doesn't say why it is inaccurate. 38% accuracy for predicting paths from edge profiling seems low. It appears as though it could be higher because you know exactly how many times each edge in a path was taken, so there should be a way to reconstruct how many times each path was taken. I am curious if there are algorithms or heuristics for predicting paths using edge profiling that are more accurate than the heuristic used in this paper, which takes the most frequently executed outgoing edge. Taking the most frequently executed edge seems too black-or-white. It should still be able to take the least frequently executed edge in proportion to how many times it was counted. |
Beta Was this translation helpful? Give feedback.
-
Critique: I thought the authors did a great job justifying and differentiating their path profiling approach from previous profiling techniques, and I was surprised at how straightforward their solution is, especially for finding a minimal set of chords on which to apply instrumentation. One thought I had was in regard to the overhead for procedures with many potential paths, as the authors mention that the most significant profiling overhead occurs when hashing becomes necessary, which is also pretty notable in the experimental results presented in Section 6. However, they also mention that often the number of actually executed paths is significantly less than the number of potential paths by many orders of magnitude, which makes me wonder if this could be leveraged to reduce the number of explored/tracked paths in some modified path profiling algorithm and further decrease the overhead. I also thought the discussion on the applications of path profiling for software testing was interesting, although I'm not sure in what cases it would be considered necessary or practical. Discussion Question: The authors frequently mention how path profiling allows for the identification of longer paths (in terms of instruction count) as compared to edge profiling; on average, its paths are reported to be around 2x longer. I'm curious how well this metric correlates with the actual usefulness of path profiling results—how does it impact the quality of the resulting optimizations? |
Beta Was this translation helpful? Give feedback.
-
I think this is the first time I've seen minimum spanning trees used since learning about them freshman fall. 😅 Critique The biggest thing I wish the authors did was to expound more on why path profiling specifically is useful compared to simply knowing which branch is taken. I can imagine plenty of cases where knowing which branch is taken is useful (eg: CPU branch prediction, JIT compilers choosing to emit optimized code for a frequently taken branch, etc.) but it's less immediately obvious to me what types of optimizations are available with path information that's not available with branch information, even though the authors allude to the fact that they exist. Discussion Questions
|
Beta Was this translation helpful? Give feedback.
-
Critique: Overall, I was impressed with the paper, I think they sold the necessity of their algorithm very well, to the point where I almost didn't want to believe it; my notes in my margins keep saying things like "This sounds too good to be true..." especially when they were emphasizing how path profiling not only is more accurate and collects more information than e.g. edge profiling, but performs well against it too. In some ways, I think my notes in my margins were right-- it is a bit too good to be true. They claim the average overheads of path profiling "can be lower and is usually comparable" to that of edge profiling, but then immediately follow it up by saying that path profiling's average overhead is about double (31% to 16%) that of edge profiling (p 47). Though, looking at Table 1 in section 6, this average seems to be skewed by some outliers; as they outline in the paper, any time the routines were complex enough as to require hashing, PP did pretty abysmally, performance wise. This leads me to wonder whether PP really scales; it seems to be great for analyzing smaller routines, but for larger ones that would require hashing, would it be better to just jump ship entirely and switch back to QPT2? I don't know enough about QPT2 to know the details, but it seems to perform better in the worst case than PP-- 53% max overhead rather than 97% max overhead. Maybe instead of the contingency plan discussed in section 5.3, it would be better to have the emergency brake be switching back to QPT2. Maybe even "every time a hash table is needed, switch back to QPT2". Though they did mention that sometimes even programs with considerable hashing still did well in PP, so maybe switching back to QPT2 every time they start hashing is a bit too extreme. Question: They mention the disparity between executed paths and potential paths (e.g. on page 54): Would it be worth trying to explore a restriction on the calculations of potential paths to be more aligned with these executed paths? Would it even be possible to do so-- to not calculate all potential paths, only ones that we believe are likely used (using some heuristic to approximate which will be likely to be used)-- without compromising correctness? |
Beta Was this translation helpful? Give feedback.
-
CritiqueOverall I really liked this paper, and I think it did a great job of taking me from not knowing what path profiling was to understanding both the need behind it and how their efficient implementation works. Something that I would have liked more from this paper is some rough quantification of what statements like "In most routines, EEL found unused local registers" (pg. 53) and "most routines require far less [counter] space" (pg. 54). I also would have liked to see a bit more on how effective path profiling-based optimizations were, but I can see how that would probably fall out of the scope of the paper. QuestionOne point that stood out to me was that oftentimes, larger routines have too many states to use an array of counters, so some sort of hashing is required, which results in a lot more overhead. However, the paper also mentions how in practice, the number of dynamic paths can be much smaller than the number of potential static paths. PP is built on a library that analyzes compiled binary executables, not source code. My question is: what sort of information (if any) that a compiler has access to from the source code of a program would help in predicting which paths through a program would never be taken? |
Beta Was this translation helpful? Give feedback.
-
Critique: Discussion Question: |
Beta Was this translation helpful? Give feedback.
-
The authors do a good job demonstrating the limitations of basic block and edge profiling, and then clearly deriving and proving the efficient path profiling algorithm. In their evaluation, though, (this ties into the first paper about performance measurement), I wish that the authors had provided more information about the “cache interference caused by profiling code and data.” I may be misinterpreting Table 1, but on the 145.fpppp benchmark, the fact that the overhead with QPT2 was negative (-2.6%), makes me think that the cache interference (and/or other uncontrolled variables) have an effect large enough that it shouldn’t be ignored. (I’m assuming that correctly instrumenting code should not make a program execute in less time.) The authors mention that path profiles have many uses such as “program performance tuning, profile-directed compilation, and software test coverage.” I imagine some overhead isn’t much of a problem in performance tuning and software test coverage when it’s more about evaluating the program, but for profile-directed compilation (JIT compilers), I wonder what metrics exist to balance overhead with how much a path profile with better coverage can improve the compilation. Relatedly, how much does increased path coverage actually help with profile-directed optimizations? |
Beta Was this translation helpful? Give feedback.
-
I appreciated the strong attention to detail that Ball and Larus had on making sure that the path profiling algorithm was efficient especially in lower level efforts to improve the runtime. From a high level point of view, I believe that the use of minimum spanning trees, compact identifiers for paths, and regeneration of paths from a frequency table seems enough to describe efficient path profiling; but they went the extra mile with early termination and optimizing the operations to update the r counter. I wish they'd explained the minimum spanning tree algorithm in more detail, although I understand that most of the details are described in an earlier paper by both authors. Discussion questions. |
Beta Was this translation helpful? Give feedback.
-
Critique Question
|
Beta Was this translation helpful? Give feedback.
-
Critique: Discussion question: My main question is what is the practical application of path profiling except for test coverage? Authors mention that it can be used for "program optimization and performance tuning", but how will it work in practice? Usual profiling helps to identify blocks of code that take the most time and, thus, should be optimized, but how do I use the knowledge of which code paths are most frequently executed? It seems that it can be useful for JIT compilation, but I do not know enough about JIT compilers to be certain. |
Beta Was this translation helpful? Give feedback.
-
Like some other posters here, I've never seen any explanations about how runtime profiling actually works, and it was really interesting to read this paper that not only discussed better heuristics for profiling but also an efficient implementation. In my opinion, the paper does a good job explaining the background of why existing profiling methods aren't that great and why path-profiling is a better solution. What really stood out to me was how, on the SPEC95 benchmark, existing profiling metrics could only predict 38% of the taken paths on acyclic CFGs. As a programmer, if you used those baseline methods to gauge the efficiency of your code, you could make some changes and not even realize any benefits. Not much to say about this, but I found the algorithms and implementation details pretty cool. It was interesting how they found a way to uniquely and minimally indexed paths within a DAG (and efficiently updated path-usage by just indexing into an array). Their use of a spanning tree to find optimized spots within the DAG to compute each path-usage was also cool. The reported number of times a path is taken is definitely a great help. Especially if a programmer knows there's some inefficiency in a loop body or something, I can see how this might paint a realistic picture of the dynamic execution behavior of the code. However, while there are often changes programmers can immediately make, I'd argue it's probably more helpful and realistic to see the microarchitectural behavior of the code. For example, if a commonly-used path uses a heap-based data structure with a bunch of nested pointers to random parts of memory, you might see a lot of cache-misses, which would clue the programmer in to maybe use better data structures, if at all possible. I know the authors mentioned that some processors have an interface to record this behavior-- and this isn't a criticism of the paper itself, but it would be cool if processors expose more of this information. That's probably unrealistic unfortunately because the microarchitecture is probably proprietary and it might reveal some unintended implementation details. QuestionPath profiling sounds really important, and the authors (IMO) made a good case why this specific profiling method may be better than existing, lower-overhead methods. If the high-level goal is to paint a picture of the dynamic behavior of a program, are there better ways to do this than displaying the number of times an acyclic path of your code runs? I'm not really sure what the alternatives would be... how do you convey to a programmer where they should focus their attention, if not for individual paths? At the same time, as I discussed above, it would definitely be helpful to have even more information than just a basic count, such as microarchitectural details. |
Beta Was this translation helpful? Give feedback.
-
I thought that the path profiling idea and implementation were super cool, but one thing that kind of bothered me (because I don't understand it enough) is that the EEL library they used (which I assume modifies executable files instead of doing runtime patching) does a pass over the code to determine unused registers. In any moderately "complex" program that has undergone compiler optimization, how likely is it for there to be abundant unused registers? |
Beta Was this translation helpful? Give feedback.
-
I thought this paper was very well written, it felt smooth to read. It was very refreshing to see an efficient algorithm paper to not talk about time-complexity but only in terms of registers and instructions. The motivation was very clear too. Interesting to see the algorithm examples were more like code than mathy pseudo-code in other papers, I wonder if this is a consequence of the time this paper was written, if this is just a compilers-paper type thing, or some other reason? Considering there are optimizations that depend specific details like the number of bits in the immediate field (Section 5.2), I wonder what the codebase of optimizing compilers look like... do they make optimizations for different scenarios at an ISA level? Or a microarchitectural level? Or just not care and leave that for a more backend compiler. What kind of compiler engineer would have to care about these optimizations? More Questions:
|
Beta Was this translation helpful? Give feedback.
-
Reading this paper was a pretty interesting dive into profiling, and I liked how it framed the problem. The authors made a strong case for why path profiling is useful, especially compared to basic block and edge profiling. I hadn’t thought about how edge profiling could be misleading—like in the example with the control-flow graph in Figure 1, where following the most frequent edges doesn’t necessarily reveal the most frequent paths. That was a really clear way to show why you’d want more precise profiling. The core idea of the algorithm—assigning unique numbers to paths and using arithmetic operations to track execution frequencies—was actually pretty intuitive once I got past the notation. It reminded me a bit of state machines and automata, where each branch updates the state in a systematic way. The spanning tree part took me a bit longer to grasp, though. I understood that they were trying to minimize instrumentation overhead, but the details of how they placed the updates efficiently weren’t immediately obvious. I had to go back and remind myself how spanning trees work in graph algorithms to fully get why that was a smart way to reduce redundant updates. The event counting step was another thing that wasn’t super intuitive at first—I saw what they were doing with propagating values, but it took some effort to really follow how it all fit together. One thing I really appreciated was how the paper was structured. The authors built up the motivation really well, making it easy to follow along with their reasoning. Also, the fact that they implemented the algorithm and tested it on actual benchmarks was super useful. Seeing the SPEC95 benchmarks was reassuring—though at the same time, it made me wonder how relevant these results are today. That was actually one of my biggest questions while reading. This paper is from 1996, and profiling tools have come a long way since then. Modern JIT compilers and sampling-based profilers (like perf or Intel VTune) don’t seem to work quite like this, so I was curious whether this method is still used in some form. The overhead seemed reasonable for the time, but would it still hold up with today’s larger and more complex software? Also, the way they handle loops—by stripping out backedges—felt like a bit of a simplification. I get that it makes path profiling feasible, but loops are a huge part of performance optimization, so I wondered if this limits its usefulness for loop-heavy programs. Another thing that came to mind was scalability. They mentioned that if a function has too many possible paths, they have to fall back on a hash table, which introduces extra overhead. That made me wonder if there’s a way to make path profiling more scalable, maybe using probabilistic techniques like bloom filters. And what about interprocedural profiling? The paper only looks at paths within individual functions, but profiling across function calls could be even more useful for real-world optimization. |
Beta Was this translation helpful? Give feedback.
-
Overall, the paper describes clearly how the path profiling algorithm exhibits better efficiency than existing profiling methods, ie approximating path frequencies with edge frequencies. In the experiments, the authors show that path profiling achieves a 31% overhead, compared with edge profiling having a 16% overhead, and that the gains in accuracy from path profiling were significant. It seems as though their test suite was rather small though, consisting of only 18 test programs, thus making their results somewhat difficult to generalize. But since SPEC95 is a standard benchmark, I believe this is not really a problem. However, one thing I am questioning is whether their algorithm is really that efficient: to determine where to place the instrumentation, we would need to run edge profiling first, which is not that efficient or accurate. I am wondering if there are other ways to identify instrumentation more efficiently without runtime analysis. My questions have more so to do with the trade-offs between accuracy and efficiency in profiling. Given the trade-offs in accuracy and runtime overhead between edge and path profiling (ie 31% PP vs 16% QPT2 overhead as demonstrated in the experiments), under what circumstances would the improved accuracy of path profiling justify its use? How might emerging technologies like better hardware support for profiling influence this decision? |
Beta Was this translation helpful? Give feedback.
-
The paper was interesting to read. I've used profiling tools before (mostly to create flame graphs), and I've seen them advertise how little of a performance impact they have, which always piqued my curiosity, because it seems initially impossible to glean meaningful information about the program without spending a lot of compute energy on it, especially if you consider how computationally heavy a naive attempt would be. I found the solution of assigning numbers to edges and taking the sum of them to almost act as a hash value for the path very elegant, and it seems like a challenging solution that must have been difficult to come up with, because it isn't even initially obvious that it should be possible to programmatically label the edges in such a way. I wonder if there are other possible ways to label edges. For example, would it be possible to come up with a way to label the edges that would take the XOR instead of the sum for labeling the path, and still avoid any conflicts? Discussion Question: |
Beta Was this translation helpful? Give feedback.
-
The authors' description of the algorithm and its usecases was very clear and I thought their approach was pretty clever as well. I also found it surprising that the train dataset was so representative of the reference one, especially because I don't think it was actually constructed to fulfill that purpose. I'd love to know whether that optimizing for relatively simple tests is still a good proxy for today's benchmarks / workloads and what the other barriers to more widespread path profiling for performance tuning are today. Is it just more difficult to make use of? |
Beta Was this translation helpful? Give feedback.
-
I’m sorry this discussion response is late! Critique: Question: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Led by @KabirSamsi, @aw578, @noschiff, @scober
The blog post is finally up! #493
Beta Was this translation helpful? Give feedback.
All reactions