Simple and Efficient SSA Construction #488

neel-patel-1 · 2025-02-27T00:45:55Z

neel-patel-1
Feb 27, 2025

Discussion Leads: @InnovativeInventor @devv64 @neel-patel-1

samuelbreckenridge · 2025-03-02T22:32:31Z

samuelbreckenridge
Mar 2, 2025

I thought the paper generally did a good job of explaining the proposed SSA construction, but I had a hard time understanding what distinguished the construction from previous approaches, mostly because I was unfamiliar with the Cytron et al. algorithm. My current understanding is that a main difference is the Cytron et al. algorithm operates on a CFG, whereas the proposed approach does away with the inefficiency of first producing a CFG then converting by simultaneously constructing the CFG and converting. The construction of the CFG was not very explicit in the paper's algorithm (in particular in the pseudocode) so it took me a few passes to realize that phi node and CFG construction were simultaneous. I'm still a bit confused how we can know when to seal a block without a completed CFG. For instance in the Figure 3 example the authors write "Since the jump from the body exit needs to be added later, we cannot seal the while header yet", but how can we know about all jumps from other blocks without having constructed the CFG? My confusion makes me think that maintaining a clear delineation between CFG and SSA construction could be an advantage of the Cytron et al. approach in terms of interpretability and maybe as a consequence extensibility.

Discussion question: In section 3.1 the paper discusses making IR optimizations during the SSA construction and mentions that their algorithm allows conservative (local) optimizations. Why only conservative optimizations? Is this also true of other algorithms? It seems like in general many optimizations are possible both on-the-fly and after SSA construction, how should we choose when to apply an optimization?

4 replies

neel-patel-1 Mar 3, 2025
Author

I'm still a bit confused how we can know when to seal a block without a completed CFG. For instance in the Figure 3 example the authors write "Since the jump from the body exit needs to be added later, we cannot seal the while header yet", but how can we know about all jumps from other blocks without having constructed the CFG?

This is something I am confused about as well. I think they rely on information gathered as they fill blocks, sealing only blocks that they can be certain will not receive any new predecessors, but I am not sure how they decide this to be the case. For instance, "No further predecessors will be added to the body entry" from that same example seems to use information that would only be possible to have with the whole CFG.

My confusion makes me think that maintaining a clear delineation between CFG and SSA construction could be an advantage of the Cytron et al. approach in terms of interpretability and maybe as a consequence extensibility.

I am inclined to agree. This is a good point and a compelling argument in favor of Cytron's algorithm.

Discussion question: In section 3.1 the paper discusses making IR optimizations during the SSA construction and mentions that their algorithm allows conservative (local) optimizations. Why only conservative optimizations?

This is an interesting question. It seems they only applied local optimizations to account for the CFG not being entirely constructed. I think the global optimizations we looked at during lesson 5 are best applied when we have the whole CFG, but there might be others that could work on-the-fly.

Is this also true of other algorithms? It seems like in general many optimizations are possible both on-the-fly and after SSA construction, how should we choose when to apply an optimization?

I think there is room for on-the-fly optimizations for the dominator frontier variant of the SSA construction algorithm presented in class. As far as I can tell, it should be possible to do LVN (and the associated optimizations) using the in-class algorithm.

sampsyo Mar 3, 2025
Maintainer

I think it may be helpful to meditate here on the fact that this paper imagines that you are going from AST->CFG, and you want that CFG to be in SSA form. When you're going from AST directly to CFG, and your AST is for a "normal" programming language with structured control flow, you know a lot about the CFG edges you are going to construct at every step. You know that there cannot be willy-nilly edges just anywhere; the edges have to arise from the ordinary constructs of structured control flow (sequential combination, if, while, and so on).

About this in particular:

Since the jump from the body exit needs to be added later, we cannot seal the while header yet

You asked: but what if there are other predecessors of the header we don't know about yet? When going from AST to CFG, though, you do know this! The snippet of the program we're dealing with looks like this:

stuff;
while (stuff) {
  stuff;
}
stuff;

Assuming the absence of goto, the loop header in this code can, in general, have only two predecessors: the stuff before the loop, and the bottom of the loop body. After we add both of those, we know we can seal the header block.

If you do want to support goto (arbitrary control flow) in your AST, then you can deal with that also by adding special "landing" blocks for those jumps.

neel-patel-1 Mar 4, 2025
Author

I can see how this works when programs don't have gotos. But I'm trying to figure out how to construct ASTs that work for programs with gotos. I'm thinking you could generate a dummy predecessor block that precedes anything that might follow the goto to notify that there is an unexpected predecessor, but I'm not sure how you get this without some notion of control flow:

Blue indicates a predecessor the while is "aware of":

I feel like this makes AST construction harder. I'm guessing this is not what you meant by

adding special "landing" blocks for those jumps.

sampsyo Mar 4, 2025
Maintainer

Yes, the dummy blocks are exactly what I was suggesting.

mt-xing · 2025-03-02T22:58:46Z

mt-xing
Mar 2, 2025

This paper provides a seemingly strictly superior method of constructing SSA compared to the approach we saw in class of using the dominator trees. Of note, the paper's benchmarks seem to show marginally better final compiled code in terms of number of phi nodes, as well as generally slightly faster run times. The paper even explicitly notes that its approach to on-the-fly optimizations results in the overall compiler being noticeably faster (at the cost of a slower SSA construction phase), resulting in it being more suitable for JIT compilers. Thus, I propose as a discussion question: Are there any benefits to the dominator-based algorithm we saw in class for computing SSA? It admittedly seems simpler to understand, but if it's slower and produces less optimal output, it seems to be more of a toy than useful in practice.

In terms of critiques, I had some trouble understanding what the paper was trying to say in some of its more technical aspects. For instance, the paper describes its algorithm as not needing a CFG to be constructed ahead of time, but then I struggled to understand how it could perform basic CFG operations like finding predecessors, for instance. It seems like the pseudocode given in places like section 2.2 already assumes the CFG was computed. There were likewise some tables I struggled to understand at first, namely Table 1, where it took me multiple minutes to figure out which numbers were from the LLVM implementation and which ones were from the paper's. The label on the table says it compares the two, but there are three groupings of columns, not two, and the headings were not intuitive to me. Maybe just a skill issue on my end, but I felt there was opportunity to describe core aspects of the paper more accessibly.

4 replies

neel-patel-1 Mar 3, 2025
Author

discussion question: Are there any benefits to the dominator-based algorithm we saw in class for computing SSA?

Interesting discussion question. I agree that using the faster algorithm would make sense, but I wasn't entirely convinced that the speedup was all that noteworthy. The authors mention a "1.49 s faster overall compilation" across all benchmarks, which amounts to a ~1% improvement. The simplicity of the in-class algorithm may have some benefits. @samuelbreckenridge made an interesting point about constructing the CFG on the fly. It seems to add some complexity to the algorithm since it relies on checking whether a block is sealed, which may not be so easy to determine.

the paper describes its algorithm as not needing a CFG to be constructed ahead of time, but then I struggled to understand how it could perform basic CFG operations like finding predecessors, for instance.

I also had trouble understanding what the authors were trying to explain at some points. I believe they constructed the CFG on the fly, and relied on the sealedBlocks data structure to tell whether it was safe to recursively read variables from the predecessors.

sampsyo Mar 3, 2025
Maintainer

I think you can indeed count it as an advantage that the Cytron et al. approach separates CFG construction from SSA conversion. However, it's hard to say how much to weigh that advantage against other ones.

the paper describes its algorithm as not needing a CFG to be constructed ahead of time, but then I struggled to understand how it could perform basic CFG operations like finding predecessors, for instance

This is totally reasonable, and yeah, the paper never makes the exact procedure here 100% clear. I think this one of those cases (like so many in 6120) where I think things would become super clear if you were to actually attempt the implementation… or even attempt the implementation of AST->CFG without any SSA funny business. Figuring out the construction of CFG edges in that setting is something that (in my experience) feels fuzzy and undefined but becomes a lot clearer when you're actually doing it.

neel-patel-1 Mar 4, 2025
Author

@UnsignedByte 's comment made me realize that the speedup I cited above is for the compilation time with and w/o on-the-fly optimizations. I actually don't see a comparison between Cytron's algorithm and theirs in the paper

sampsyo Mar 4, 2025
Maintainer

It would be somewhat hard to do an apples-to-apples comparison because the two algorithms start with different inputs (a pre-constructed CFG in non-SSA form vs. an AST).

gerardogtn · 2025-03-03T23:16:08Z

gerardogtn
Mar 3, 2025

In Braun et al's paper an algorithm is presented to build SSA form from an AST representation that is efficient and minimal, the algorithm is contrasted with Citron's popular algorithm which converts a non-SSA CFG into an SSA CFG, and the goal is to avoid the detour from AST -> non-SSA CFG -> SSA CFG by avoiding the middle transformation. As others in this thread have mentioned (Michael, Neel, Samuel) I found the presentation of the algorithm confusing as it felt to me that information from the CFG would be needed to determine if a block was sealed or not; but the clarification that from the semantics of the nodes we can guess the number of expected predecessors helped clarify the idea. Finally, I appreciated the description of minimality in SSA Forms. In class we've talked about the process of SSA conversion is to first insert phi nodes and then prune them, by using the dominance frontier. But I found the use of a recursive algorithm on strongly connected components a clever alternative.

Discussion questions:

What's the tradeoff in scalability/maintainability between Braun's and Citron's algorithms? Although Citron relies on a middle step to build the SSA CFG, the separation in layers can make it easier to understand and easier to maintain from an engineering perspective. Are the benefits strong enough to prefer this over Citron's approach?

Similarly, I wonder if the use of on-the-fly optimizations while building the SSA CFG could lead to subtle bugs being introduced in the optimizations, which could make it difficult to debug or even verify correctness for. Once again, are the benefits of building the SSA in one pass worth it in this context?

How would the dominance frontier vs strongly connected components algorithms for pruning CFGs compare? I.e. what would be the difference between Citron's pruning a SSA-CFG using the dominance frontier vs using strongly connected components? Both in terms of number of phi nodes and dynamic instructions and in complexity of the code.

1 reply

sampsyo Mar 4, 2025
Maintainer

What's the tradeoff in scalability/maintainability between Braun's and Citron's algorithms? Although Citron relies on a middle step to build the SSA CFG, the separation in layers can make it easier to understand and easier to maintain from an engineering perspective. Are the benefits strong enough to prefer this over Citron's approach?

FWIW, I think this is a great question because there is so clearly no single answer! It seems like trade-off between incomparable alternatives (separation of concerns vs. optimality and efficiency).

UnsignedByte · 2025-03-04T01:08:18Z

UnsignedByte
Mar 4, 2025

I understood this paper as essentially computing SSA form from an AST without needing to do the intermediate analyses that we had to (basic blocks, CFGs, Domination) - which led to a faster conversion to SSA that also had comparable performance in terms of minimality to other algorithms. I think the mentioned goal at the end of JIT compilers really helped me see why this could be quite useful, as for most of the paper I was honestly quite confused as to why so many different optimizations were being merged into SSA conversion. I think the core algorithm and its main benefit (getting rid of unnecessary intermediate IRs) is quite useful.

My main critique is the paper's categorization of this algorithm as "simple" - I think the paper overall seemed to me more of an exercise in optimizing for a specific use case rather than necessarily creating the better "general" algorithm. The whole section on merging constant propogation and other optimizations into SSA for example seemed like a recipe for very spaghettified AST conversion code (and AST conversion is already hard enough as-is). Especially when targeting something meant to be very extensible (with target-specific optimizations, etc) like LLVM, many of the optimizations like constant propogation will likely be run again on LLVM IR anyway, so it seems somewhat not worth the developer cost to me.

Additionally, I felt that the algorithm felt somewhat frankensteined together - mainly with the multiple passes to get rid of trivial phi nodes, and the optional additions to generate minimal SSA form for irreducible CFGs. I know that the paper proved the algorithm to be correct, but I sort of felt that the algorithm was just not very clean, and it made me wonder if this was truly the best way to be doing SSA. I don't know if I believe the paper's claim that this general algorithm is comparable to Cytron's and could be helpful in JIT compilation - maybe I missed something, but the paper doesn't seem to show raw compile time comparisons between Cytron and their proposed algorithm, and I don't know if AST-IR conversion is even the bottleneck in most JIT compilers anyway. The added speedups to on-the-fly optimizations also seemed quite small, saving only ~1 second out of ~2 minutes seems like it might fail a $p$ test...

Discussion Questions

When is it worth it to design what amounts (to me) to an "inlined" compiler implementation, where we have minimal scalability but maximum performance, and are there more scalable/easier to understand methods to design these compiler passes?

One of the big wins of this paper is skipping the necessity of generating CFGs/Dominators/etc, while maintaining comparable performance. I wonder if this is actually as big of a win in practice - how often could the "work" spent generating CFGs/Dominators be reused in other passes, and is that factored into this comparison?

3 replies

neel-patel-1 Mar 4, 2025
Author

maybe I missed something, but the paper doesn't seem to show raw compile time comparisons between Cytron and their proposed algorithm, and I don't know if AST-IR conversion is even the bottleneck in most JIT compilers anyway. The added speedups to on-the-fly optimizations also seemed quite small, saving only ~1 second out of ~2 minutes seems like it might fail a p test...

I don't see an actual compile time comparison either. Not sure why they didn't include it.

Interesting discussion questions:

One of the big wins of this paper is skipping the necessity of generating CFGs/Dominators/etc, while maintaining comparable performance. I wonder if this is actually as big of a win in practice - how often could the "work" spent generating CFGs/Dominators be reused in other passes, and is that factored into this comparison?

The CFG seems useful, especially considering the analyses we learned about in lesson 5. Interested to see what others think about this.

dhan0779 Mar 4, 2025

I had a similar critique while reading this paper as well. The algorithm, at a glance, seems to handle many different cases to fulfill the goal of computing the SSA. It just seemed like each step was to get rid of more and more phi nodes until we obtain a minimal representation.

I’m also more curious about having the lack of empirical justification for the elimination of dominance frontiers. I think it would have been more helpful to have a benchmark or a better quantitative analysis where computing the dominance frontiers comes with significant overhead as inferred by the paper.

sampsyo Mar 4, 2025
Maintainer

Totally; I think this central issue is very much worth considering both sides of. Both extremes have real pitfalls, so there is a nuanced question here about how to strike a balance between integration and modularity.

In a way, the Bril ecosystem is on the other end of the modularity spectrum—each "pass" does very little, and they are loosely coupled together. It wouldn't really be practical for a "real" compiler where people eventually care about compilation speed.

Jonahcb · 2025-03-04T01:21:22Z

Jonahcb
Mar 4, 2025

Critique: The paper did an excellent job presenting the algorithm. It does come across as a simple algorithm, as they claim. However, they should have tuned their algorithm to the level that Cytron is tuned before comparing instead of speculating that if they did tune it, then we would see significant improvements over Cytron.

Question: I am curious about what threshold of improvement (over a current implementation) a new compiler pass algorithm must achieve before industry and academia will use the algorithm. For example, are the 0.28% fewer instructions of Marker compared with Cytron enough to modify current compilers? Is a 1.49 s faster overall compilation time on the SPEC CINT2000 benchmark suite significant enough?

3 replies

neel-patel-1 Mar 4, 2025
Author

Interesting Question:

I am curious about what threshold of improvement (over a current implementation) a new compiler pass algorithm must achieve before industry and academia will use the algorithm.

Maybe it depends on how hard it is to integrate into existing toolchains. I'd guess "simple" algorithms may have a better shot.

arnavm30 Mar 4, 2025

I thought the algorithm was neat in terms of being able to incorporate various optimizations that are usually done after construction into the construction of SSA form. I felt the same way about their evaluation section in that it could have been clearer. The authors state that their method executes slightly fewer instructions than the Cytron method (99.72% of the baseline), but that doesn't necessarily correspond to a lower execution time, which I think would be a more meaningful performance evaluation. In relation to your question, I also think execution time is what people in industry especially would care about more for adoption.

The paper mentions SSA form can be considered a restricted form of CPS, and SSA form seems to be quite popular (e.g., used in LLVM), so I’m curious where CPS may be useful?

sampsyo Mar 4, 2025
Maintainer

I also think execution time is what people in industry especially would care about more for adoption.

Indeed! The only positive thing you can say about instruction counts is that they are easier to measure objectively, i.e., they are probably deterministic (whereas actual execution time, which is what people actually care about, requires a statistical approach).

The paper mentions SSA form can be considered a restricted form of CPS, and SSA form seems to be quite popular (e.g., used in LLVM), so I’m curious where CPS may be useful?

Among other places, an actual CPS form comes up extremely often in the compilation of functional languages. (See Compiling with Continuations, another Appel joint.)

gabizon103 · 2025-03-04T01:57:47Z

gabizon103
Mar 4, 2025

I like that this paper presented an alternative method of constructing SSA to what we learned in class, it was fun to read and think about. I never really considered that there is overhead we want to avoid when we construct a non-SSA IR and then convert it into SSA. So, at first I bought the argument that we might want to use this on-the-fly algorithm for JIT compilers where the overhead is perhaps a significant bottleneck. However, having finished the paper, I'm kind of disappointed there were no experiments done to expand on this point. The authors only mention it in passing in the conclusion.

I'm also a bit skeptical about when we would want to use this algorithm. For me, it was harder to understand than the algorithm we discussed in class. One of the main advantages the authors cite is that they can avoid computation of dominator sets/frontiers. I don't think I buy this argument; while it is true that this algorithm avoids it, I imagine that any compiler will end up computing the dominance relation at some point -- any loop optimization would probably need it for the purpose of identifying loops? So I imagine that any cost of computing the dominator relation (which can't be that high) would be amortized across the many analyses/optimizations that use it.

The other advantage is eliminating overhead, which I can see vaguely as an argument but again, I don't really buy it without some sort of experiment to determine if this overhead is actually critical.

I think the algorithm was easy enough to understand, but I wish the tables were also easy to understand, especially Table 1. It wasn't clear to me what each column meant -- after reading the section a few times, I think "Before SSA Constr" means the default non-SSA LLVM code, "After SSA Constr" means the code generated by LLVM's SSA-converting pass, and "Marker" is the novel algorithm. The table is also just cluttered. Also I'm not sure what "Time" and "IR Time" are in Table 3.

Discussion: In the general case, what are some reasons compiler users/developers might prefer either the "traditional" approach (AST -> non-SSA CFG -> SSA CFG -> optimized SSA CFG), or the one described in the paper (AST -> optimized SSA CFG)? To me, it seems there is quite a large gap between an AST and an optimized SSA CFG, so going there in one step feels like it can be full of non-obvious bugs. I wonder if the slight performance benefits are worth the cost in developing/maintaining it. And a user definitely doesn't want a compiler that's prone to bugs.

2 replies

bryantpark04 Mar 4, 2025

From my understanding Time in Table 3 is the overall compile time and IR Time is the SSA IR construction time.

sampsyo Mar 4, 2025
Maintainer

I imagine that any compiler will end up computing the dominance relation at some point -- any loop optimization would probably need it for the purpose of identifying loops?

One setting where it's at least plausible that you would avoid computing dominators at all is in some of the lower tiers of a JIT. Many of those try really hard to do whatever optimizations they can without doing anything "fancy" with the CFG that requires iteration to convergence.

However, having finished the paper, I'm kind of disappointed there were no experiments done to expand on this point.

I agree that it would be really interesting to conduct an empirical comparison on the end-to-end compilation time between strategies that use the two high-level strategies. Could be an interesting 6120 course project, in fact…

bryantpark04 · 2025-03-04T02:14:51Z

bryantpark04
Mar 4, 2025

I found this paper interesting because their approach seemed to operate very differently from how we've been approaching compilers in this class, skipping the steps of constructing a CFG and dominance frontiers in favor of directly translating an AST to SSA form. It seems like the main idea was to use information encoded in the AST as a substitute for information encoded in a CFG and dominance frontiers, one example being that jumps come from language constructs with specific semantics, so we can infer when to declare a block as sealed. I wish that the paper had actually compared overall compile times between the on-the-fly optimized Marker algorithm and a tuned version of the Cytron et al. algorithm. The only compile time comparison I saw was between the optimized and unoptimized Marker algorithm. One question I had while reading this paper was that although using this approach to constructing the SSA form of a program skips the CFG and dominance frontier construction, could those structures be useful for other types of optimizations or program analysis? Either by making optimizations/analyses possible or just by making them easier to reason about. I think Cytron et al.'s algorithm is still easier to reason about, since there feels to be a stronger separation of concerns between the different program forms.

1 reply

devv64 Mar 4, 2025

Agreed about the compile times, I also wish optimization time was compared between the algorithms. I also agree about the separation of concerns. Cytron's algorithm is conceptually cleaner - each phase has a clear purpose. The approach in this paper gains efficiency but sacrifices some of that clarity by interleaving concerns.

tean-lai · 2025-03-04T02:42:21Z

tean-lai
Mar 4, 2025

I found this paper to be quite interesting overall, skipping a couple passes in a compiler does seem like it could be useful, but that would depend on how many passes are required by modern compilers. My impression is it would only be truly useful if it was substantially faster than currently-used algorithms, because it doesn't seem to provide anything extra for analyses or reasoning.

I think this paper would have heavily benefited from a bigger evaluation section. Others have pointed out that there is a lack of evaluation on compilation speed, which I find a little disappointing, what other benefits are there to having such a construction algorithm if not for the fact it's faster, and if so wouldn't you want to show it? Time comparisons seems to be a bad measurement to exclude if a claim of this paper's benefits include reducing overhead. Based on the evaluation data they do have, it seems that you could argue this algorithm is not worse than Cytron et al's, but I find it hard to be convinced that it's necessarily faster.

Overall, I do like the idea of cutting out a non-SSA CFG, because I see little benefit of constructing it directly if a SSA CFG is more useful in every way. Cutting out dominators on the other hand, I'm not so sure about. (Discussion question) Are there properties of dominators that we might care about that is not encoded in a SSA CFG?

I'm also having a hard time trying to find an industry-level compiler online that uses this algorithm, I wonder why this might be the case. Does anyone know if any compiler actually uses it? I can't tell if the algorithm is too new to be considered or if it is not that useful over other methods. But I also hear in general that LLVM-based compilers are often quite slow, so maybe there is some merit to doing optimizations on-the-fly and skipping compiler passes when you can.

1 reply

sampsyo Mar 4, 2025
Maintainer

Here are a few open-source compilers I found that seem to use this approach:

Within LLVM, there is a newish facility called "MemorySSA" for tracking memory accesses in SSA form. (This is different from its usual SSA for local variables.) This pass for maintaining SSA form references this paper.
This SPIR-V tool. (SPIR-V is an IR for GPUs used in Vulkan.)
MIR, a toolkit for implementing JITs, cites the "simple and efficient" paper as its mechanism for producing SSA.

zihan0822 · 2025-03-04T02:59:10Z

zihan0822
Mar 4, 2025

Critique
This paper presents a non-cfg based approach that is able to construct minimal-SSA form directly from AST. I like their recursive SCC based algorithm used to produce minimal ssa form for irreducible cfgs, which looks pretty neat to me. I’m curious about the practical impact of this paper, since their claim seems really promising: on-the-fly optimization during SSA-cfg construction, potentially faster than widely adopted cfg-based methods and is able to generator minimal ssa for arbitrary control flow. I wish they could discuss a bit more about their sealing algorithm implementation for irreducible cfgs.

Discussion Question
Is computing the intermediate non-SSA cfg really a bottleneck in practices? Also why previous cfg-based approach makes it hard to implement some of the SSA extension, notably SSA reconstruction.

1 reply

sampsyo Mar 4, 2025
Maintainer

Is computing the intermediate non-SSA cfg really a bottleneck in practices?

This is a great question that would be hard to answer! It would be interesting to think through how you'd isolate this part of a "real" production compiler for measurement…

parthsarkar17 · 2025-03-04T03:32:59Z

parthsarkar17
Mar 4, 2025

Critique

I thought this paper was really awesome. The highlight of the paper for me was their discussion of strongly connected components and how they used these to replace and optimize the number of phi nodes. Intuitively, this makes sense: if you have an "inner" subset of an SCC, in which all phi nodes only reference other phi nodes in this "inner" set, and you have exactly one other phi node in the SCC that isn't a part of "inner", then all information for the "inner" nodes must come from the one outer node. Really cool!

I also think it was cool how they derived all these strong conditions in Section 4. First, it was neat how they got "pruned SSA form" for free (which might seem trivial but probably can't be said about our naive, n x m implementation we discussed in class). Then, their proof of minimality was really cool. I thought it was cool how they derived a contradiction on the assumption of reducibility (which I wasn't really sure of the use of, up till reading this paper).

Questions + Confusion

There were some things I was confused about. Out of the two drawbacks they mentioned in the introduction, the second one (how Cytron et al. needed several other algorithms to construct minimality of phi nodes) wasn't convincing because all of this is done statically anyways; I feel like most people don't care about the time it takes to compile their code. Plus, with respect to their first argument, I'm not sure (i.e. I genuinely don't know) how many people use non-CFG representation; I'd imagine it's most common to use a CFG in non-SSA form to represent your programs, right?

More importantly, I was also confused by their sentence in Section 3, "This implies a definition of minimality that is independent of the source program and more strict than the definition by Cytron et al. [10]." Even after reading their proofs in Section 4, I wasn't sure how their definition of minimality was stronger than the existing one. They seem to have set up the definition of "necessary" phi nodes, but AFAIK didn't compare these definitions. Maybe I missed it?

1 reply

sampsyo Mar 4, 2025
Maintainer

I feel like most people don't care about the time it takes to compile their code.

I'm actually not so sure about this! Even without considering really big industry projects with many millions of lines of code, it's not hard to find people complaining about how long it takes (for example) for Rust to compile a medium-sized project. Or, it's one of the things I hear Go programmers being really happy about, i.e., that the edit-compile-run cycle's speed makes the development process feel different. Anyway, opinions differ, but I think there are a lot of people out there who think compiler speed has a big impact on their productivity.

More importantly, I was also confused by their sentence in Section 3, "This implies a definition of minimality that is independent of the source program and more strict than the definition by Cytron et al. [10]."

I also don't fully understand what they mean by "more strict" here; maybe someone else has a crisp explanation…

Annacaro22 · 2025-03-04T03:43:49Z

Annacaro22
Mar 4, 2025

Critique: At the end of section 1 (introduction), the authors mention that their SSA algorithm is the only algorithm to construct a minimal and pruned SSA program on reducible CFGs without relying on other analyses. However, then in section 3, they describe multiple different optimizations which, to me, seem like non-SSA analyses (i.e. common subexpression elimination, copy propagation, etc). I'm a bit confused on what, then, is unique between their algorithm and the Cytron algorithm; what exactly is meant by their algorithm "not relying on other analyses"?

Question: I'm still a bit confused about the undefined value, which is assigned to the case where the phi node contains only itself as an operand (pg 5). Why is this phi node necessary? Why can't we just drop it?

1 reply

sampsyo Mar 4, 2025
Maintainer

I think the big "other analysis" they are happy about not relying on is dominator information (or the domination frontier).

ananyagoenka · 2025-03-04T04:18:41Z

ananyagoenka
Mar 4, 2025

I found this paper really interesting because it takes a fresh approach to SSA construction that avoids some of the traditional inefficiencies we talked about in class. The main idea is to construct SSA form directly from an AST or bytecode, rather than first building a CFG in non-SSA form. That immediately seemed like a smart idea because it skips unnecessary intermediate transformations, which can be expensive in compilers that are already SSA-based.

The contrast with Cytron et al.’s classic SSA construction algorithm was particularly insightful. Cytron’s approach is widely used and ensures minimal φ-function placement, but it comes with a lot of overhead—requiring dominance frontiers and liveness analysis. This paper’s method flips that by working backward in a lazy manner: instead of precomputing where to insert φ-functions, it only places them when a variable’s definition is needed. That makes the process more dynamic and, in a way, feels more natural than the eager approach Cytron’s algorithm takes.

One thing that clicked for me pretty easily was the way the authors handle variable definitions. The local and global value numbering parts made a lot of sense—local value numbering works fine within a single block, and if a definition isn’t found, they search backward through predecessors to determine where to place φ-functions. The lazy placement of φ-functions was a neat trick, especially with the memoization step to avoid redundant computations.

That said, some parts of the paper took a bit longer to sink in. For example, the handling of incomplete CFGs felt a little abstract at first. The idea of “sealing” blocks—where a block is marked as final so no new predecessors can be added—wasn’t something I had seen before in SSA construction. It makes sense after thinking about it, though: when processing an AST directly, some control flow edges don’t exist yet, so you need a way to handle missing predecessors and later fill in the blanks when the structure is complete.

Another part that stood out was the emphasis on minimizing unnecessary φ-functions. The algorithm doesn’t just avoid placing them blindly but also actively removes redundant ones, even in irreducible control flow graphs (which Cytron’s algorithm doesn’t do as cleanly). I also liked how the paper connected SSA form to functional programming ideas like CPS. The connection between φ-functions and function parameters in CPS wasn’t something I had thought about before, but it makes sense that they’re just two ways of encoding control flow dependencies.

One thing I kept wondering was how this method compares to more modern compiler optimizations. The benchmarks showed that it’s competitive with LLVM’s implementation of Cytron’s algorithm, but I’d be curious to see how it scales in large, industry-scale compilers. Would just-in-time compilers benefit from this, or are there cases where the eager approach might still be better? Also, since this algorithm relies on backward search for definitions, I wonder if it ever leads to worst-case performance scenarios in deeply nested control structures.

1 reply

sampsyo Mar 4, 2025
Maintainer

The connection between φ-functions and function parameters in CPS wasn’t something I had thought about before, but it makes sense that they’re just two ways of encoding control flow dependencies.

In class, I mentioned that this analogy is made explicitly in the paper "SSA is Functional Programming" by Appel.

mse63 · 2025-03-04T04:43:29Z

mse63
Mar 4, 2025

This paper was a good read, as it introduced a way to create SSA form that seems lightweight yet intuitive. Instead of the algorithms we described in class which rely on having already created a dominance graph or create many extraneous definitions, this simply uses local value numbering and global value numbering and lazily works backwards from reads to find all writes that could have been the most recent write and uses a phi node to switch between those values.

I like the introduction of LVN and GVN into this because it easily handles situations where there would be multiple writes on different branches but all containing the same value. This prevents additional phi nodes from being created in those cases in a way that a naive implementation wouldn't.

I am a bit confused regarding the lack of a non-SSA CFG. The paper touts that their algorithm doesn't require creating a non-SSA CFG, but their algorithm requires knowing the predecessors to a block, and if you have those, then you might as well have constructed a non-SSA CFG (or at least, the transpose to one). Is there some level of nuance here that I missed?

Discussion question:
Why go through the trouble of building an incomplete CFG on the fly? Why not just make the CFG, then recursively work backwards on the complete CFG?

1 reply

sampsyo Mar 4, 2025
Maintainer

their algorithm requires knowing the predecessors to a block, and if you have those, then you might as well have constructed a non-SSA CFG

This is indeed somewhat nuanced, but I think the point is that—given an AST for a "traditional" language—the predecessors for a block are evident from the structure of the tree. As in, you have to do the AST->CFG conversion at some point, and this paper's main argument is that you can do SSA right then instead of first finishing the explicit CFG.

katherinewu312 · 2025-03-04T04:50:55Z

katherinewu312
Mar 4, 2025

Overall, I found this paper to be interesting. It presents an efficient and straightforward algorithm for SSA construction, one that is directly constructed from the source language without passing through a non-SSA CFG (unlike Cytron's algorithm) and that is well suited to perform several standard optimizations. I found that the authors substantiated their claim well by providing proofs that the SSA construction algorithm constructs pruned SSA form for all programs and minimal SSA form for programs with reducible control flow. For the on-the-fly optimizations section, I think it would be better if the authors were more explicit about the caching technique that is used to speed up the triviality checks. The caching mechanism assumes that if the two witnesses remain distinct, the phi function remains non-trivial, however it does not address the possibility of how optimizations elsewhere in the program could later make these operands identical, and thus make the phi function trivial. With this, the mechanism might require invalidating and recomputing witnesses under such circumstances, which is not explicitly discussed. Also, with regards to the variant that only optimizes the unique operand that does not guarantee minimal SSA form for all programs, I think maybe a brief discussion about the specific classes of programs where this simplification is particularly effective or problematic could be useful, or how often this approach fails to produce minimal SSA form in practice.

My discussion questions relate more so to the algorithm design of SSA construction algorithms. In the paper, the authors state that one reason the Cytron algorithm is still very popular is that it guarantees a form of minimality on the number of placed phi functions. They also state that there are a range of construction algorithms which aim for simplicity instead, the goal of their algorithm essentially. I was wondering what the trade-offs between simplicity and minimality in SSA construction algorithms are. How can compilers dynamically decide when to prioritize simplicity over minimality, and are there specific scenarios where the cost of achieving one outweighs the benefits of the other?

1 reply

sampsyo Mar 4, 2025
Maintainer

I was wondering what the trade-offs between simplicity and minimality in SSA construction algorithms are. How can compilers dynamically decide when to prioritize simplicity over minimality, and are there specific scenarios where the cost of achieving one outweighs the benefits of the other?

Great question; I wonder what it would like to adaptively switch between different strategies.

lisarli · 2025-03-04T05:08:58Z

lisarli
Mar 4, 2025

Critique: I found the discussion on on-the-fly optimizations to be pretty interesting, as it seems like this is one of the areas where the authors' implementation offers an advantage over Cytron's formulation. However, like many others, I am not fully convinced by the performance evaluation of this algorithm since they only briefly state the non-optimized algorithm is "as fast as Cytron et al.'s algorithm in practice," without any formal comparison. They mention that using on-the-fly optimizations could further improve this and that it offers a few seconds of improvement in total compilation time, but it's hard to tell whether this slight improvement is worth the more complicated algorithm design. I imagined that removing extra passes and exchanging them for this more "integrated" algorithm should come with a significant performance improvement to make the less clean design worth it.

Discussion Question: The authors describe their algorithm as simple and efficient. While they provide a complexity and minimality argument to argue for efficiency, I'm not entirely convinced about its simplicity. It does avoid auxiliary computations like finding dominance frontiers, but it seems they still recursively trace through program flow in a way that could be seen as indirectly performing the same computations. In practice, what is the "simplicity" advantage of this SSA construction algorithm? How well does this algorithm align with the practical concerns such as ease of implementation and maintainability?

0 replies

KabirSamsi · 2025-03-04T09:39:00Z

KabirSamsi
Mar 4, 2025

Critique. This was a great paper, indeed presenting a fascinating insight into how to work with SSA, and in fact combining principles from across the several topics that we have covered throughout this course. While we have principally been concerned with working from a basic block -> CFG -> analysis approach, this one seems to skip several steps and focus more on directly reaching SSA. At the same time though, the intelligent usage of LVN, along with the introduction of GVN, makes for a fascinating parallel. This definitely further opened by eyes to the power of SSA. As a final point, I think the paper did a great job of clearly outlining both the algorithm and any relevant information needed for the process.

Discussion Question. A big part of benchmarking previous algorithms (i.e Cytron et al) against one such as this, seems to revolve around a point I brought up earlier – distinguishing between thinking of compilers in terms of SSA, or the more primitive constructs from earlier. Barring aside efficiency, which poses a nicer mental model?

0 replies

ethanuppal · 2025-03-04T11:06:12Z

ethanuppal
Mar 4, 2025

This one was one of the most mind-blowing papers I've read. I was super impressed by how they computed phi nodes lazily etc. and optimized the result to be provably minimal. I thought their presentation was extremely lucid and most of my confusions usually stemmed from me misreading or glossing over a particular core definition rather than any failure of content or style.

I still don't fully understand the proof for Lemma 1 because they don't establish the nontriviality of $S$ --- I asked about this on the Zulip as well. However, for a more pertinent discussion question, I'm interested if there are any significant advantages of using this method beyond "it is super cool and smart." Cytron's method continues to be used in industrial-grade compiler infrastructures like LLVM, as the paper points out, and there's not really tangible benefit beside "skipping an initial CFG construction." On the other hand, their work in Section 6 (since they claim their algorithm is simpler to implement) means that this could very well be a Paxos/Raft-type situation.

1 reply

sampsyo Mar 4, 2025
Maintainer

Cytron's method continues to be used in industrial-grade compiler infrastructures like LLVM, as the paper points out, and there's not really tangible benefit beside "skipping an initial CFG construction."

This theme has come up in some other subthreads as well, so it's definitely something we should talk about in class. I am not aware of any papers with concrete, empirical measurements of the cost trade-offs here (how much does it really cost to (a) construct an explicit CFG before SSA conversion, and (b) run Cytron's algorithm itself?). Part of the problem, IMO, is that switching a compiler between these two modes is not a local change—it sorta requires a strategic rearchitecting. So doing an apples-to-apples comparison is not exactly trivial.

Which makes me think two things:

We are, for now, stuck guessing about the relative costs of the two high-level approaches.
It would be an interesting/useful project to do a rigorous job of somehow measuring the difference.

ngernest · 2025-03-04T18:56:17Z

ngernest
Mar 4, 2025

Like others have mentioned, it would be nice if there was some comparison of the interpretability / maintainability of their algorithm vs Cytron’s. It seems unclear from section 2 when the CFG (in SSA form) is actually constructed -- the pseudocode presented in sections 2.1 & 2.2 seems to assume that a half-built CFG is already present, since the functions they present take in blocks as arguments and refer to the block’s predecessors. (After reading some of the other comments here, I realized that the SSA CFG is built on the fly.) However, one might argue that 1.) the fact that their algorithm minimizes the no. of φ functions (since trivial ones are optimized away and the authors bake-in various optimizations that only require local analysis like CSE, per section 3.1), and 2.) their definition of minimality is more strict than Cytron’s (section 3.2) suggests that their algorithm produces more “readable” SSA code with fewer φ functions.

Discussion question: how does one balance the trade-off between interpretability of an algorithm that produces optimized code, versus the interpretability of the code that was generated? As the authors point out in section 1, Cytron’s algorithm produces minimal SSA code, but the fact that it requires computing dominator trees/frontiers and performing liveness analysis over a non-SSA CFG makes the algorithm hard to use within SSA-centric compiler pipelines.

0 replies

Simple and Efficient SSA Construction #488

Uh oh!

Replies: 18 comments · 26 replies

Uh oh!

Uh oh!

neel-patel-1 Mar 3, 2025 Author

Uh oh!

sampsyo Mar 3, 2025 Maintainer

Uh oh!

neel-patel-1 Mar 4, 2025 Author

Uh oh!

sampsyo Mar 4, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

neel-patel-1 Mar 3, 2025 Author

Uh oh!

sampsyo Mar 3, 2025 Maintainer

Uh oh!

neel-patel-1 Mar 4, 2025 Author

Uh oh!

sampsyo Mar 4, 2025 Maintainer

Uh oh!

Uh oh!

sampsyo Mar 4, 2025 Maintainer

Uh oh!

Discussion Questions

Uh oh!

neel-patel-1 Mar 4, 2025 Author

Uh oh!

Uh oh!

sampsyo Mar 4, 2025 Maintainer

Uh oh!

Uh oh!

neel-patel-1 Mar 4, 2025 Author

Uh oh!

Uh oh!

sampsyo Mar 4, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sampsyo Mar 4, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sampsyo Mar 4, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

sampsyo Mar 4, 2025 Maintainer

Uh oh!

Uh oh!

Critique

Replies: 18 comments 26 replies

neel-patel-1 Mar 3, 2025
Author

sampsyo Mar 3, 2025
Maintainer

neel-patel-1 Mar 4, 2025
Author

sampsyo Mar 4, 2025
Maintainer

neel-patel-1 Mar 3, 2025
Author

sampsyo Mar 3, 2025
Maintainer

neel-patel-1 Mar 4, 2025
Author

sampsyo Mar 4, 2025
Maintainer

sampsyo Mar 4, 2025
Maintainer

neel-patel-1 Mar 4, 2025
Author

sampsyo Mar 4, 2025
Maintainer

neel-patel-1 Mar 4, 2025
Author

sampsyo Mar 4, 2025
Maintainer

sampsyo Mar 4, 2025
Maintainer

sampsyo Mar 4, 2025
Maintainer

sampsyo Mar 4, 2025
Maintainer