Efficient Path Profiling Discussion #487

scober · 2025-02-10T16:20:23Z

scober
Feb 10, 2025

Led by @KabirSamsi, @aw578, @noschiff, @scober

The blog post is finally up! #493

samuelbreckenridge · 2025-02-18T20:59:23Z

samuelbreckenridge
Feb 18, 2025

I've never thought too much about the mechanics of profiling so I found the discussion of the specific optimizations to minimize profiling overhead very interesting. In section 3.4 when the authors present optimizations to minimize instrumentation (e.g. folding an addition into a memory address calculation if the chord is the last before EXIT), I was surprised at the thought that went into optimizing something that seems pretty insignificant, especially because the times I've profiled something I have been pretty indifferent to overhead. This focus on optimization came out a few other times in the paper as well, notably section 5.2. I think these seemingly minor optimizations make more sense when I remind myself that the presentation in the paper just covers the micro case of one CFG and when profiling a real program these optimizations may add up to have a significant impact on overhead. I also could just be ignorant of cases where users want to profile something while being more sensitive to overhead. Similarly the main overhead result in the paper (31% vs 16% for edge profiling) makes me curious 1. where would path profiling be with some of the optimizations discussed in the paper disabled and 2. at what hypothetical overhead for path profiling would edge profiling become preferable just because of the cost (as alluded to in section 7)

Discussion questions:

What optimizations and tuning does path profiling enable that would be harder/impossible with only edge profiling?
How should we think about overhead when it comes to profiling? In what scenarios are we more/less sensitive to overhead?
How long were the environment variables on the machine used for experiments?

2 replies

neel-patel-1 Feb 19, 2025

Your second question is interesting. I guess in some settings we would be running the actual instrumented version of the code -- rather than profiling, optimizing, and recompiling without the instrumentation before deploying. I think a JIT compiler would do this.

If the paths are heavily input dependent (maybe network packet processing), it may be worthwhile to leave in the path-profiling instrumentation instructions and keep applying new optimizations as traffic changes. Maybe edge-profiling would perform just as well without the 15% overhead of path profiling though.

sampsyo Feb 19, 2025
Maintainer

Thanks for the great discussion questions! About this one:

What optimizations and tuning does path profiling enable that would be harder/impossible with only edge profiling?

A big one is basic block layout. This optimization attempts to improve the locality of accesses to instructions, i.e., to improve instruction cache hit rates. It does this by choosing the layout of blocks within a function (or even across multiple functions) so that common execution paths tend to fit within the same I-cache block (or close enough to enable prefetching). For this, you really want to know what the hot paths are, and approximating that by knowing only the hot edges may not get you as high-quality results.

And about this one:

How should we think about overhead when it comes to profiling? In what scenarios are we more/less sensitive to overhead?

Profiling overhead probably always matters to some degree (for instance, if you're Google and you profile all your code to optimize it, even a few percentage points savings can translate into a lot of money). As @neel-patel-1 says, though, a really big deal would be JIT compilers, where the cost of profiling code can be a "make or break" decision about whether a given optimization is feasible.

UnsignedByte · 2025-02-19T02:16:31Z

UnsignedByte
Feb 19, 2025

Path profiling was something I have never really used myself before so I found the description of the limitations of edge/basic block profiling and the benefits of path profiling to be very interesting. The algorithm was broken into multiple nice steps and I liked the way each challenge (like loops) was tackled one by one to create the final algorithm. My main critique would be that I wish the description of the algorithm was a bit more in-depth, as some of the conclusions (like those about MSTs) seemed non-trivial to me. As a whole though, I liked how the paper thoroughly describes the optimizations and considerations in place to minimize profiling overhead, though the first paper discussion does make me wonder how accurately and unbiasedly the overhead caused by profiling can be measured. Furthermore, it seems implied by the paper that a 31% to 16% difference in profiling overhead is small and unimportant, but is that actually true? It seems to me that affecting performance by ~30% seems quite significant. My discussion question is about the motivation of the paper: we know that edge profiling is less accurate at predicting paths, but in what optimizations does this actually matter (and path profiling's extra overhead becomes worth it)?

1 reply

sampsyo Feb 19, 2025
Maintainer

Expanding on an answer I gave to @samuelbreckenridge's similar question above, Codestitcher is one recent example of a profile-guided optimization that seems to rely on path profiling.

neel-patel-1 · 2025-02-19T05:00:03Z

neel-patel-1
Feb 19, 2025

Critique:
The name fits -- the authors clearly put a lot of work in making sure their implementation of path profiling was efficient. The algorithm is provably optimal in terms of instrumented instructions executed and the implementation makes register allocation (Sec 5.1) and array indexing (Sec 5.2) optimizations.

The authors explain that path profiling revealed some targets for profile-driven optimizations. It would have been interesting to see what impact such optimizations would have had.

Discussion Question:
The authors mention one application of path profiling is software test case coverage. If a set of test cases exercised all paths in a large, instrumented code base would you be confident shipping it in an application?

5 replies

sampsyo Feb 19, 2025
Maintainer

FWIW, I think 100% statement coverage is a pretty high bar that most software probably does not meet. And 100% path coverage is two notches higher than that (with branch coverage being the intermediate step). There are just so many paths in real programs!

emmanueljs1 Feb 20, 2025

My main question I wondered about is related to this discussion, so I'll leave a comment here:

The authors mention that path profiling is more complete for software test coverage as a benefit. I found this pretty cool, but I wondered if this was really a benefit? (In the sense that I'm not sure if this is something developers writing tests have been looking for.) Do any test coverage tools used in practice nowadays actually use path profiling to determine the most common paths?
This is more of a wonky question, but once the authors established that paths are a better metric to profile than edges or blocks, this felt to me like one of those "obvious" things that you wouldn't really think about until somebody says it. Are there any other metrics we could profile that would make more sense than paths?

Separately, I think the paper explains the algorithm very nicely, especially with its long-running example but a personal (and small) critique I have on the paper is that I found the order in which it was written confusing. I didn't really understand the algorithm until it started being described in section 3, so the first two sections didn't make a whole lot of sense to me. I also would have appreciated a smaller example in the beginning.

noschiff Feb 20, 2025

Complete path coverage in a large codebase would be nice, but the main challenge there to overcome is not efficiently instrumenting our code at runtime; rather it'd be ensuring correctness of tests. One possible approach could be checking program properties across all execution. Regardless, I think that just ensuring every path is executed isn't enough to give me confidence in software, since we'd need to verify all test cases and ensure the paths taken cover all possible input spaces. There's some interesting related work in software testing on how to try to generate useful test cases: https://homes.cs.washington.edu/~mernst/pubs/feedback-testgen-icse2007.pdf

noschiff Feb 20, 2025

The authors mention that path profiling is more complete for software test coverage as a benefit. I found this pretty cool, but I wondered if this was really a benefit? (In the sense that I'm not sure if this is something developers writing tests have been looking for.) Do any test coverage tools used in practice nowadays actually use path profiling to determine the most common paths?

This is an interesting question. My view of the paper was that they wanted to integrate path profiling as seamlessly as possible. I would think that performance of a system while executing a test case (as opposed to production) is less critical, but I think it absolutely has a role here, especially as you deal with extremely large codebases in industry where tests are constantly run. In terms of test coverage analysis, I initially was thinking this would be mostly useful for determining if tests are hitting enough paths, i.e. evaluate the quality of tests. However, I think we could leverage path profiling in the creation of tests to focus on where we should target QA efforts. I can image that common code paths are likely already naturally tested, so perhaps this could be used to steer test generation towards lesser executed and possibly overlooked paths.

sampsyo Feb 20, 2025
Maintainer

I didn't really understand the algorithm until it started being described in section 3, so the first two sections didn't make a whole lot of sense to me.

FWIW, @emmanueljs1, as much as I love this paper, I agree: I think it's one of those papers where the core insight is only visible once you essentially read the whole thing.

InnovativeInventor · 2025-02-19T20:56:42Z

InnovativeInventor
Feb 19, 2025

The algorithm presented in the paper seems quite efficient in-practice -- from both the experimental results and the intuition that adding is cheap. However, I wonder if this is provably the most efficient (for some hazy notion of efficiency). Below are some thoughts and questions regarding efficiency.

Comparison vs. modified bit tracing. The paper briefly discusses bit tracing. It's unclear to me what would happen if one were to perform a modified bit tracing approach, where, rather than tracing every branch, one would only trace the branches that would be instrumented in the efficient path profiling approach (using the maximum spanning tree + chord technique). Would this provide a space-competitive representation for each unique path then?

Hardware-supported path profiling. It seems that one of the key assumptions made in this paper is that addition is particularly fast and cheap (primarily due to hardware support, I presume). But addition is not purpose-built to help with path profiling -- it is merely coincidental. Could more specialized hardware could make path profiling (and other types of profiling) even faster than addition? Are there other ways of tracking paths taken that more specialized hardware could do with even lower overhead? By these two questions, I mean to distinguish between what is fast and efficient in-practice (potentially due to hardware design decisions made for profit in industry) and what would be fast and efficient in-theory (which imo is the more interesting question).

2 replies

aw578 Feb 20, 2025

If I'm not missing anything, the space used mostly comes from storing the paths, which scales with the number of paths, then the size of the trace buffer once you need a hash table. If path profiling generates the smallest unique values possible I don't think bit tracing can beat it on memory efficiency. In the worst case, bit tracing needs a hash table but path profiling doesn't, and in the best case both methods should use an equal amount of memory. As for the second question, it's a little strange to think about optimizing for a path profiler at the expense of the actual code, but there are hash tables that are implemented in hardware that run a lot faster. I tried to figure out exactly how much faster and how they scale but it's 3AM so I'm not sure how much but I'm sure you could see pretty significant speedups.

sampsyo Feb 20, 2025
Maintainer

Remarkably, there is one ISA extension out there that specifically supports recording paths in CFGs: Intel Processor Trace. AFAIK, it's an Intel-only thing (not even AMD has an equivalent, at least not yet). It seems super duper useful; for example, it powers Jane Street's Magic Trace, which seems intensely cool.

gabizon103 · 2025-02-19T21:25:03Z

gabizon103
Feb 19, 2025

I thought the way that the authors introduced the problem was convincing. The example CFGs helped me understand what they were talking about, although I did have to stare at them for a while until it clicked. One thing that caught my attention was in the introduction, the authors state that the inaccuracy in edge profiling was usually ignored because it is assumed that path profiling is much more expensive:

This inaccuracy is usually ignored, under the assump- tion that accurate path profiling must be far more expensive than basic block or edge profiling.

I'm curious if this statement is based on existing path profiling tools at the time? I think it would help me contextualize the paper a bit more.

Also, I think the motivation for hashing is to make a space tradeoff (storing dynamic paths that actually execute vs all the possible static paths). The authors identify the hashing as one of the main causes of the increased overhead of their implementation, so I'm interested in if they investigated an implementation that doesn't use hashing and instead stores the static paths. Maybe there are some programs where this isn't possible or feasible, but if there any programs where it is, I wonder what the overhead would be.

On another note: maybe path profiling isn't the right word, but my understanding is that processors will do something that is sorta similar when trying to predict branches. For example, a common and pretty simple type of branch prediction seeks to find which branches are correlated to each other, where the outcome of one branch accurately predicts the outcome of another. In an imaginary world where we can inspect the hardware structures that compute this sort of thing, could this inform path profiling techniques? I don't really have an answer, but I thought it was interesting to think about this.

Discussion question
The authors note that although their approach removes loop backedges, the tool still measures important information like the number of times a loop iterates and the paths into/out of a loop. What kind of profile information might this tool miss out on by removing backedges from profiles? I can think of two: one could be cache misses, in the case where a memory access earlier in the loop body is evicted by one later in loop body. The other could be page faults for instruction memory, if the backedge spans a very, very large amount of instructions. Is this assessment accurate, and are there any other possibly useful pieces of information that might get missed?

2 replies

Jonahcb Feb 19, 2025

I had the same thoughts about the similarity between branch prediction and path profiling, so I hope it is discussed.

sampsyo Feb 20, 2025
Maintainer

I'm curious if this statement is based on existing path profiling tools at the time?

I don't have a full picture of the history, but I think this is exactly what they are implying. (Or else, they are implying that no one had ever bothered to build an actual path profiler, assuming that it would be too slow to be useful.)

Jonahcb · 2025-02-19T21:58:31Z

Jonahcb
Feb 19, 2025

Critique:

I like how the paper mentions that for most programs, the number of executed paths is fewer than 2300 for most of the programs in the test suite but that the potential number of paths for each of these programs is hundreds of millions to tens of billions. That is an incredibly small proportion!

Discussion Question:

The paper establishes that edge profiling does not accurately identify the most frequently executed paths, but it doesn't say why it is inaccurate. 38% accuracy for predicting paths from edge profiling seems low. It appears as though it could be higher because you know exactly how many times each edge in a path was taken, so there should be a way to reconstruct how many times each path was taken. I am curious if there are algorithms or heuristics for predicting paths using edge profiling that are more accurate than the heuristic used in this paper, which takes the most frequently executed outgoing edge. Taking the most frequently executed edge seems too black-or-white. It should still be able to take the least frequently executed edge in proportion to how many times it was counted.

3 replies

Annacaro22 Feb 20, 2025

Yeah, I agree that I don't think they explained well enough why the edge profiling results were inaccurate as in Fig 1. Some more explanation there would have benefitted the paper I think.

devv64 Feb 20, 2025

The point you made in your critique also caught my eye- it's very intriguing and I definitely want to know more about that. Another part that I thought similarly about was in section 5.3 where they are discussing routines with many paths. The use of hash tables is mentioned for routines at the 4000-6000 paths range, but then it also mentions very complex routines with num paths greater than the range of a 32 bit integer. It's insane that the threshold was set at 100,000,000 when the paper was written and paths were still being truncated (especially considering the number of executed paths is typically fewer than 2300)!
I was also impressed by the unique path numbering algorithm.

I was curious about applying this in a parallel/ concurrent environment. How can this strategy be adapted? How difficult is it to maintain determinism with regards to the path counters in an efficient manner? Should this path profiling data be captured on a thread local level, globally or is there value in both? Can this parallel path profiling idea be used in practice, maybe for something like scheduling algorithms?

sampsyo Feb 20, 2025
Maintainer

Maybe this goes without saying, but the reason that edge profiling is an inaccurate way to predict path profiles is because there is not enough information to disambiguate overlapping paths. As in, the authors are observing that the situation depicted in Figure 1 comes up frequently enough in practice to cause a problem.

I am curious if there are algorithms or heuristics for predicting paths using edge profiling that are more accurate than the heuristic used in this paper

I admit I am having trouble imagining what this could possibly look like. I guess I just don't see how there is enough information in an edge profile to allow any kind of heuristic that would be better than just "count the number of times the edges within the path show up in the edge profile…"

lisarli · 2025-02-19T23:01:02Z

lisarli
Feb 19, 2025

Critique: I thought the authors did a great job justifying and differentiating their path profiling approach from previous profiling techniques, and I was surprised at how straightforward their solution is, especially for finding a minimal set of chords on which to apply instrumentation. One thought I had was in regard to the overhead for procedures with many potential paths, as the authors mention that the most significant profiling overhead occurs when hashing becomes necessary, which is also pretty notable in the experimental results presented in Section 6. However, they also mention that often the number of actually executed paths is significantly less than the number of potential paths by many orders of magnitude, which makes me wonder if this could be leveraged to reduce the number of explored/tracked paths in some modified path profiling algorithm and further decrease the overhead. I also thought the discussion on the applications of path profiling for software testing was interesting, although I'm not sure in what cases it would be considered necessary or practical.

Discussion Question: The authors frequently mention how path profiling allows for the identification of longer paths (in terms of instruction count) as compared to edge profiling; on average, its paths are reported to be around 2x longer. I'm curious how well this metric correlates with the actual usefulness of path profiling results—how does it impact the quality of the resulting optimizations?

2 replies

noschiff Feb 20, 2025

I think another relevant point is that path profiling was more correct identifying paths compared to educated guesses from edge profiling, but I think this is a good question. I think the first answer that comes to mind is just basic block reordering. Ordering blocks so the most commonly taken paths are in the fall through (as opposed to branching / jumps) would likely help by allowing the processor to be more efficient and not have to rely too much on branch prediction. We could also perhaps unroll loops if we know how many times they're executed. Another interesting thought is with a JIT compiler (like Java). Efficiently profiling bytecode could allow the compiler to recompile hot sections of code on the fly down to machine code.

noschiff Feb 20, 2025

I just saw @ananyagoenka had some good thoughts on JIT and maybe knows more than I do!

mt-xing · 2025-02-20T01:12:43Z

mt-xing
Feb 20, 2025

I think this is the first time I've seen minimum spanning trees used since learning about them freshman fall. 😅

Critique

The biggest thing I wish the authors did was to expound more on why path profiling specifically is useful compared to simply knowing which branch is taken. I can imagine plenty of cases where knowing which branch is taken is useful (eg: CPU branch prediction, JIT compilers choosing to emit optimized code for a frequently taken branch, etc.) but it's less immediately obvious to me what types of optimizations are available with path information that's not available with branch information, even though the authors allude to the fact that they exist.

Discussion Questions

Given that the authors place much emphasis on minimizing instrumentation overhead in this approach, does that mean path profiling is feasible to incorporate in runtime optimizations, like what a JIT compiler would use? I mentioned this in my critique as well; I'm not sure specifically what optimizations are possible with a path profiler vs. a branch predictor, but are those things a JIT compiler can hook into?
I noticed that this paper is older than me, being published in 1996 (!). Has the field of compiler optimizations meaningfully changed as a result of this paper? That is, are there optimizations and research areas today that wouldn't have existed if not for the ability to do path profiling?

3 replies

emmanueljs1 Feb 20, 2025

I felt similarly when reading about spanning trees! It was pretty neat to see how they used them in their algorithm to find a minimal cost set of chord edges.

noschiff Feb 20, 2025

I too was shocked to see MSTs in use. I remember us speaking with Prof Myers deciding the topics and schedule for CS 2112 in the fall, and we decided MSTs didn't need to make a re-appearance.

KabirSamsi Feb 20, 2025

it's always fascinating when rudimentary data structures appear in unexpected places!

Annacaro22 · 2025-02-20T01:21:17Z

Annacaro22
Feb 20, 2025

Critique: Overall, I was impressed with the paper, I think they sold the necessity of their algorithm very well, to the point where I almost didn't want to believe it; my notes in my margins keep saying things like "This sounds too good to be true..." especially when they were emphasizing how path profiling not only is more accurate and collects more information than e.g. edge profiling, but performs well against it too. In some ways, I think my notes in my margins were right-- it is a bit too good to be true. They claim the average overheads of path profiling "can be lower and is usually comparable" to that of edge profiling, but then immediately follow it up by saying that path profiling's average overhead is about double (31% to 16%) that of edge profiling (p 47). Though, looking at Table 1 in section 6, this average seems to be skewed by some outliers; as they outline in the paper, any time the routines were complex enough as to require hashing, PP did pretty abysmally, performance wise. This leads me to wonder whether PP really scales; it seems to be great for analyzing smaller routines, but for larger ones that would require hashing, would it be better to just jump ship entirely and switch back to QPT2? I don't know enough about QPT2 to know the details, but it seems to perform better in the worst case than PP-- 53% max overhead rather than 97% max overhead. Maybe instead of the contingency plan discussed in section 5.3, it would be better to have the emergency brake be switching back to QPT2. Maybe even "every time a hash table is needed, switch back to QPT2". Though they did mention that sometimes even programs with considerable hashing still did well in PP, so maybe switching back to QPT2 every time they start hashing is a bit too extreme.

Question: They mention the disparity between executed paths and potential paths (e.g. on page 54): Would it be worth trying to explore a restriction on the calculations of potential paths to be more aligned with these executed paths? Would it even be possible to do so-- to not calculate all potential paths, only ones that we believe are likely used (using some heuristic to approximate which will be likely to be used)-- without compromising correctness?

1 reply

sampsyo Feb 20, 2025
Maintainer

Would it even be possible to do so-- to not calculate all potential paths, only ones that we believe are likely used (using some heuristic to approximate which will be likely to be used)-- without compromising correctness?

To take this idea one step farther, you can even imagine doing the profiling in two stages: run the program for a little while (slowly) to get the most common paths, and then again for a longer period of time while tracking only those "top" paths.

bryantpark04 · 2025-02-20T01:48:56Z

bryantpark04
Feb 20, 2025

Critique

Overall I really liked this paper, and I think it did a great job of taking me from not knowing what path profiling was to understanding both the need behind it and how their efficient implementation works. Something that I would have liked more from this paper is some rough quantification of what statements like "In most routines, EEL found unused local registers" (pg. 53) and "most routines require far less [counter] space" (pg. 54). I also would have liked to see a bit more on how effective path profiling-based optimizations were, but I can see how that would probably fall out of the scope of the paper.

Question

One point that stood out to me was that oftentimes, larger routines have too many states to use an array of counters, so some sort of hashing is required, which results in a lot more overhead. However, the paper also mentions how in practice, the number of dynamic paths can be much smaller than the number of potential static paths. PP is built on a library that analyzes compiled binary executables, not source code. My question is: what sort of information (if any) that a compiler has access to from the source code of a program would help in predicting which paths through a program would never be taken?

0 replies

zihan0822 · 2025-02-20T01:55:44Z

zihan0822
Feb 20, 2025

Critique:
This paper presents a simple but efficient algorithm for path profiling, which seems to me to be a strictly superior version of edge profiling. It can obtain more detailed profile than edge profiling by monitoring the same set of edges in cfg with similar set of hardware instructions. I hope they could elaborate more on the algorithm used to compute increments for chord edges of a maximal spanning tree. I am wondering why computing those values from a minimal cost set of chord edges is crucial if sum of increments along the path will be in the range of 0 to NumPath(entry) - 1 anyway.

Discussion Question:
The author mentions that their algorithm needs hash table in order to handle routines with many paths, which I believe is the norm of most the enterprise codebases. Does it underplay the advantage of compact path encoding they emphasize when comparing their method to bit tracing method? I am thinking about combining bit tracing with hash table as well.

0 replies

arnavm30 · 2025-02-20T02:16:20Z

arnavm30
Feb 20, 2025

The authors do a good job demonstrating the limitations of basic block and edge profiling, and then clearly deriving and proving the efficient path profiling algorithm. In their evaluation, though, (this ties into the first paper about performance measurement), I wish that the authors had provided more information about the “cache interference caused by profiling code and data.” I may be misinterpreting Table 1, but on the 145.fpppp benchmark, the fact that the overhead with QPT2 was negative (-2.6%), makes me think that the cache interference (and/or other uncontrolled variables) have an effect large enough that it shouldn’t be ignored. (I’m assuming that correctly instrumenting code should not make a program execute in less time.)

The authors mention that path profiles have many uses such as “program performance tuning, profile-directed compilation, and software test coverage.” I imagine some overhead isn’t much of a problem in performance tuning and software test coverage when it’s more about evaluating the program, but for profile-directed compilation (JIT compilers), I wonder what metrics exist to balance overhead with how much a path profile with better coverage can improve the compilation. Relatedly, how much does increased path coverage actually help with profile-directed optimizations?

0 replies

gerardogtn · 2025-02-20T02:19:40Z

gerardogtn
Feb 20, 2025

I appreciated the strong attention to detail that Ball and Larus had on making sure that the path profiling algorithm was efficient especially in lower level efforts to improve the runtime. From a high level point of view, I believe that the use of minimum spanning trees, compact identifiers for paths, and regeneration of paths from a frequency table seems enough to describe efficient path profiling; but they went the extra mile with early termination and optimizing the operations to update the r counter. I wish they'd explained the minimum spanning tree algorithm in more detail, although I understand that most of the details are described in an earlier paper by both authors.

Discussion questions.
The authors mentioned that if the number of paths overflows 32-bit integers then a hash table can be used instead, they also mention that "the relatively high cost of hashing accounts for higher overhead of path profiling" and they back up this claim with data in table 1. At the same time, they claim "the number of executed paths was small and dwarfed by the potential paths - which number hundreds of millions to tens of billions". Now, due to the optimization in 5.2 PP is not only constrained by the 32-bit number of paths, but also by the largest possible increment to 1023. So, are there any efficient dynamic techniques to keep track of executed paths instead of all possible paths? (seems like we would need a different way for generating path ids and possibly a more expensive way to regenerating paths from their identifiers, but worth exploring nonetheless)? What would be the impact of modifying the NumPaths algorithm to yield smaller increments (they do mention that this could improve the latter source of hashing, but the authors did not mention whether they evaluated this modification to the algorithm and, if so, whether there were any positive impact on the metrics they considered or not)?

0 replies

smd21 · 2025-02-20T02:45:09Z

smd21
Feb 20, 2025

Critique
I liked how the authors both explained their algorithm and proved correctness for the labeling piece. I was having some trouble wrapping my head around their explanation, so the proof of correctness really helped clear up my confusions. However, I felt while reading that the general tone and claims the authors were making were perhaps a bit too optimistic when compared with the results of their experiment. There were several concerns the authors brought up, particularly with performance on larger programs, where their explanations were rather handwavey. In particular, section 5.3 stood out to me as the authors didn't really discuss how their solutions maintained correctness and efficiency. In fact, in Table 1, it appears as though PP required significantly (2.8 times) more overhead than edge profiling. The authors didn't really reconcile these numbers with their original claim that PP is not significantly more inefficient than edge profiling. It seems to me like most programs would require hashing, so I would've liked to see more discussion about this.

Question
In Section 5.3, the authors mention that if the number of paths exceeds the integer cap, then the PP algorithm will begin removing outgoing edges from nodes that have over a certain amount of paths reachable from them. I'm not really sure how this would still preserve correctness or ensure the algorithm correctly profiles the program. Is there perhaps a better way of handling this (e.g by breaking the program into smaller chunks*)? The authors later mentioned how usually the number of executed paths is siginificantly smaller than the number of possible paths, so I was wondering if there is a way to include this type of analysis from the beginning in order to do a thorough path profiling without needing to eliminate edges that could potentially be important?

note that I don't know much about profiling so I honestly am not sure if this is a correct approach lol

0 replies

mariasoroka · 2025-02-20T02:49:51Z

mariasoroka
Feb 20, 2025

Critique:
I always find it interesting to read old papers or papers from domains different from my research area. At first, it is hard to understand what the authors are discussing, which problem they are trying to solve, and even what these words mean. It was interesting to learn about how profiling works under the hood. It surprised me that edge profiling can have a negative overhead (Table 1).

Discussion question: My main question is what is the practical application of path profiling except for test coverage? Authors mention that it can be used for "program optimization and performance tuning", but how will it work in practice? Usual profiling helps to identify blocks of code that take the most time and, thus, should be optimized, but how do I use the knowledge of which code paths are most frequently executed? It seems that it can be useful for JIT compilation, but I do not know enough about JIT compilers to be certain.

1 reply

KabirSamsi Feb 20, 2025

This is a good question - to the best of my knowledge, one can best answer this by comparing the incremental advantages that path profiling gives over edge profiling in itself. The advantages of edge profiling are, as you stated, primarily improved testing coverage and allowing programmers to better understand compiler workflow; section 1.1 of the paper though does outline an interesting diagram showing how applying this algorithm and working with paths over edges does give a powerful advantage. It is building a stronger addition to edge profiling.

parthsarkar17 · 2025-02-20T02:52:24Z

parthsarkar17
Feb 20, 2025

Like some other posters here, I've never seen any explanations about how runtime profiling actually works, and it was really interesting to read this paper that not only discussed better heuristics for profiling but also an efficient implementation.

In my opinion, the paper does a good job explaining the background of why existing profiling methods aren't that great and why path-profiling is a better solution. What really stood out to me was how, on the SPEC95 benchmark, existing profiling metrics could only predict 38% of the taken paths on acyclic CFGs. As a programmer, if you used those baseline methods to gauge the efficiency of your code, you could make some changes and not even realize any benefits.

Not much to say about this, but I found the algorithms and implementation details pretty cool. It was interesting how they found a way to uniquely and minimally indexed paths within a DAG (and efficiently updated path-usage by just indexing into an array). Their use of a spanning tree to find optimized spots within the DAG to compute each path-usage was also cool.

The reported number of times a path is taken is definitely a great help. Especially if a programmer knows there's some inefficiency in a loop body or something, I can see how this might paint a realistic picture of the dynamic execution behavior of the code. However, while there are often changes programmers can immediately make, I'd argue it's probably more helpful and realistic to see the microarchitectural behavior of the code. For example, if a commonly-used path uses a heap-based data structure with a bunch of nested pointers to random parts of memory, you might see a lot of cache-misses, which would clue the programmer in to maybe use better data structures, if at all possible. I know the authors mentioned that some processors have an interface to record this behavior-- and this isn't a criticism of the paper itself, but it would be cool if processors expose more of this information. That's probably unrealistic unfortunately because the microarchitecture is probably proprietary and it might reveal some unintended implementation details.

Question

Path profiling sounds really important, and the authors (IMO) made a good case why this specific profiling method may be better than existing, lower-overhead methods. If the high-level goal is to paint a picture of the dynamic behavior of a program, are there better ways to do this than displaying the number of times an acyclic path of your code runs? I'm not really sure what the alternatives would be... how do you convey to a programmer where they should focus their attention, if not for individual paths? At the same time, as I discussed above, it would definitely be helpful to have even more information than just a basic count, such as microarchitectural details.

1 reply

KabirSamsi Feb 20, 2025

This is a great question and one that I would wholeheartedly love to explore more in discussion. It strikes me though that this is one piece of information that is absolutely of key importance. Understanding both control flow and its frequency is one of the most key components of this. I imagine that rather than searching for alternatives, it could be helpful to provide more fine-tuned information on the topic of how different acyclic control paths are being explored. There's some cool info on this explored in Table 1 in Section 5.3!

ethanuppal · 2025-02-20T02:52:31Z

ethanuppal
Feb 20, 2025

I thought that the path profiling idea and implementation were super cool, but one thing that kind of bothered me (because I don't understand it enough) is that the EEL library they used (which I assume modifies executable files instead of doing runtime patching) does a pass over the code to determine unused registers. In any moderately "complex" program that has undergone compiler optimization, how likely is it for there to be abundant unused registers?

2 replies

gabizon103 Feb 20, 2025

The authors say that their implementation only requires 2 registers, so they don't exactly need abundant unused registers:

Path profiling requires a local register throughout each routine’s execution to hold the current path and a temporary register for some instrumentation code, such as the memory increment code

They also say this about finding unused registers:

If EEL cannot find an unused local reg- ister, it frees the least heavily used local register by spilling it to the routine’s stack frame. In most routines, EEL found un- used local registers, although many larger and computation- ally intensive routines require spill code.

This indicates that spill code is pretty common for computationally-intensive programs. It seems that there's greater performance overheads than spill code in their implementation, so even if spill code is common it probably matters less.

ethanuppal Feb 20, 2025

I think I completely missed that first passage! Thanks for answering, and yeah I judged that spill code probably wouldn't be that expensive given that they sometimes straight up use a hash table, I was just (incorrectly) bothered by the apparent confidence it could find "lots" of unused registers.

EDIT: I'm still kind of bothered by this.

An execution of a DAG produces an acyclic, directed path starting at ENTRY and ending at EXIT. The term path refers to an acyclic directed path, unless otherwise noted. Of course, a DAG may execute many times, as it may consist of a loop body or a procedure.

The register needs to be live over the entire DAG that you do path profiling on. Of course, as they state here, this could be a small section of code. However, their table presents information for benchmarks like "126.gcc"; I can't imagine they would have any luck finding a couple live registers over most (nontrivial) gcc functions, and I expect this to be the case for a lot of (nontrivial) functions? I guess it's still a "nice to have" when it works?

tean-lai · 2025-02-20T02:58:17Z

tean-lai
Feb 20, 2025

I thought this paper was very well written, it felt smooth to read. It was very refreshing to see an efficient algorithm paper to not talk about time-complexity but only in terms of registers and instructions. The motivation was very clear too. Interesting to see the algorithm examples were more like code than mathy pseudo-code in other papers, I wonder if this is a consequence of the time this paper was written, if this is just a compilers-paper type thing, or some other reason?

Considering there are optimizations that depend specific details like the number of bits in the immediate field (Section 5.2), I wonder what the codebase of optimizing compilers look like... do they make optimizations for different scenarios at an ISA level? Or a microarchitectural level? Or just not care and leave that for a more backend compiler. What kind of compiler engineer would have to care about these optimizations?

ananyagoenka
Feb 20, 2025

Reading this paper was a pretty interesting dive into profiling, and I liked how it framed the problem. The authors made a strong case for why path profiling is useful, especially compared to basic block and edge profiling. I hadn’t thought about how edge profiling could be misleading—like in the example with the control-flow graph in Figure 1, where following the most frequent edges doesn’t necessarily reveal the most frequent paths. That was a really clear way to show why you’d want more precise profiling.

The core idea of the algorithm—assigning unique numbers to paths and using arithmetic operations to track execution frequencies—was actually pretty intuitive once I got past the notation. It reminded me a bit of state machines and automata, where each branch updates the state in a systematic way. The spanning tree part took me a bit longer to grasp, though. I understood that they were trying to minimize instrumentation overhead, but the details of how they placed the updates efficiently weren’t immediately obvious. I had to go back and remind myself how spanning trees work in graph algorithms to fully get why that was a smart way to reduce redundant updates. The event counting step was another thing that wasn’t super intuitive at first—I saw what they were doing with propagating values, but it took some effort to really follow how it all fit together.

One thing I really appreciated was how the paper was structured. The authors built up the motivation really well, making it easy to follow along with their reasoning. Also, the fact that they implemented the algorithm and tested it on actual benchmarks was super useful. Seeing the SPEC95 benchmarks was reassuring—though at the same time, it made me wonder how relevant these results are today.

That was actually one of my biggest questions while reading. This paper is from 1996, and profiling tools have come a long way since then. Modern JIT compilers and sampling-based profilers (like perf or Intel VTune) don’t seem to work quite like this, so I was curious whether this method is still used in some form. The overhead seemed reasonable for the time, but would it still hold up with today’s larger and more complex software? Also, the way they handle loops—by stripping out backedges—felt like a bit of a simplification. I get that it makes path profiling feasible, but loops are a huge part of performance optimization, so I wondered if this limits its usefulness for loop-heavy programs.

Another thing that came to mind was scalability. They mentioned that if a function has too many possible paths, they have to fall back on a hash table, which introduces extra overhead. That made me wonder if there’s a way to make path profiling more scalable, maybe using probabilistic techniques like bloom filters. And what about interprocedural profiling? The paper only looks at paths within individual functions, but profiling across function calls could be even more useful for real-world optimization.

0 replies

katherinewu312 · 2025-02-20T06:24:41Z

katherinewu312
Feb 20, 2025

Overall, the paper describes clearly how the path profiling algorithm exhibits better efficiency than existing profiling methods, ie approximating path frequencies with edge frequencies. In the experiments, the authors show that path profiling achieves a 31% overhead, compared with edge profiling having a 16% overhead, and that the gains in accuracy from path profiling were significant. It seems as though their test suite was rather small though, consisting of only 18 test programs, thus making their results somewhat difficult to generalize. But since SPEC95 is a standard benchmark, I believe this is not really a problem. However, one thing I am questioning is whether their algorithm is really that efficient: to determine where to place the instrumentation, we would need to run edge profiling first, which is not that efficient or accurate. I am wondering if there are other ways to identify instrumentation more efficiently without runtime analysis.

My questions have more so to do with the trade-offs between accuracy and efficiency in profiling. Given the trade-offs in accuracy and runtime overhead between edge and path profiling (ie 31% PP vs 16% QPT2 overhead as demonstrated in the experiments), under what circumstances would the improved accuracy of path profiling justify its use? How might emerging technologies like better hardware support for profiling influence this decision?

0 replies

mse63 · 2025-02-20T07:50:39Z

mse63
Feb 20, 2025

The paper was interesting to read. I've used profiling tools before (mostly to create flame graphs), and I've seen them advertise how little of a performance impact they have, which always piqued my curiosity, because it seems initially impossible to glean meaningful information about the program without spending a lot of compute energy on it, especially if you consider how computationally heavy a naive attempt would be.

I found the solution of assigning numbers to edges and taking the sum of them to almost act as a hash value for the path very elegant, and it seems like a challenging solution that must have been difficult to come up with, because it isn't even initially obvious that it should be possible to programmatically label the edges in such a way. I wonder if there are other possible ways to label edges. For example, would it be possible to come up with a way to label the edges that would take the XOR instead of the sum for labeling the path, and still avoid any conflicts?

Discussion Question:
For compiled programming languages, this seems difficult to gain value from because any optimizations that profiling knowledge provides seems like it would require this knowledge at compile-time. Would it be possible to build some sort of framework that would collect such data at runtime and self-modify the program? Would the cost of profiling be too large for such a thing to be practical?

0 replies

aw578 · 2025-02-20T09:19:05Z

aw578
Feb 20, 2025

The authors' description of the algorithm and its usecases was very clear and I thought their approach was pretty clever as well. I also found it surprising that the train dataset was so representative of the reference one, especially because I don't think it was actually constructed to fulfill that purpose. I'd love to know whether that optimizing for relatively simple tests is still a good proxy for today's benchmarks / workloads and what the other barriers to more widespread path profiling for performance tuning are today. Is it just more difficult to make use of?

0 replies

ngernest · 2025-02-21T01:22:18Z

ngernest
Feb 21, 2025

I’m sorry this discussion response is late!

Critique:
The authors started using the term “increment” as a noun in section 3, without fully defining what “increment” as a noun meant. Perhaps this is obvious in hindsight, although I had to reread section 3 a few times before realizing that “increment” meant “how much to increase the value stored in register r”.
Also, in section 3.3, although the authors mention that the event counting algorithm used to compute Inc(c) comes from prior work, it would be nice if they gave a high-level overview of how the event counting algorithm works. (This blogpost from a past CS 6120 offering details the algorithm.)

Question:
In section 5.1, the authors mentioned that EEL (the library they used to edit executables) has to insert register spilling code in order to free the most heavily used local register, and that this has to happen for large / computationally intensive programs. I wonder how frequently this happens (among a benchnmark suite of such programs, for some definition of “computationally intensive”) and what the overhead of the register spilling code is?

0 replies

Efficient Path Profiling Discussion #487

Uh oh!

Uh oh!

Replies: 23 comments · 25 replies

Uh oh!

Uh oh!

Uh oh!

sampsyo Feb 19, 2025 Maintainer

Uh oh!

Uh oh!

sampsyo Feb 19, 2025 Maintainer

Uh oh!

Uh oh!

sampsyo Feb 19, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sampsyo Feb 20, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sampsyo Feb 20, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

sampsyo Feb 20, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sampsyo Feb 20, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sampsyo Feb 20, 2025 Maintainer

Uh oh!

Replies: 23 comments 25 replies

sampsyo Feb 19, 2025
Maintainer

sampsyo Feb 19, 2025
Maintainer

sampsyo Feb 19, 2025
Maintainer

sampsyo Feb 20, 2025
Maintainer

sampsyo Feb 20, 2025
Maintainer

sampsyo Feb 20, 2025
Maintainer

sampsyo Feb 20, 2025
Maintainer

sampsyo Feb 20, 2025
Maintainer