Lesson 12: Dynamic Compilers #458

sampsyo · 2025-01-21T20:33:10Z

sampsyo
Jan 21, 2025
Maintainer

Here's the thread for the dynamic compilers task, which involves doing some speculative transformations on Bril IR!

UnsignedByte · 2025-04-24T04:41:21Z

UnsignedByte
Apr 24, 2025

Tracing

For tracing I modified the brilirs interpreter (just by copying over the source) to add tracing at every command. Tracing begins from the top of main, though this would be pretty trivial to change with my current implementation. This interpreter is shortened so that things like print do not actually write to any output stream. The trace I implemented enters functions and has a limited max length (configurable, but defaulted to 100 traced instructions). Traces automatically abort early if they see any instruction that may have "side effects", meaning any instruction outside of a specific pure set I defined (same as the ones used in LVN) would automatically abort the trace, which eliminates the issues caused by memory instructions, print, etc.

The tracer runs on every executed instruction, and automatically converts branches to guards and handles function scopes and returns. I used a pretty messy method of doing this - I keep a stack of all the function names, and simply prepend the stack to all names inside. Returns and function arguments are generated by creating id assignments between the variable names, which will later be removed by LVN and DCE.

All guards lead to a .__trace_failed label, which occurs right after the speculative block and essentially continues code from the top of main. Additionally, I create a .__trace_succeeded label to jump to after committing, which goes to the first instruction that was not added to the trace. With all this, the simple example we were shown in the lecture videos becomes as followws:

@main(x: int) {
  speculate;
  one: int = const 1;
  hundred: int = const 100;
  y: int = add x one;
  cond: bool = lt x hundred;
  guard cond .__trace_failed;    # Make sure x < 100
  __trace_f_r: int = sub y one;  # Inlined @f
  z: int = id __trace_f_r;
  commit;
  jmp .__trace_succeeded;        # Jump on success
.__trace_failed:
  one: int = const 1;            # Redo operations on failure
  hundred: int = const 100;
  y: int = add x one;
  cond: bool = lt x hundred;
  br cond .then .else;
.then:
  z: int = call @f y;
  jmp .exit;
.else:
  z: int = call @g y;
  jmp .exit;
.exit:
.__trace_succeeded:
  print z;                       # Print was not traced
}
@f(a: int): int {
  one: int = const 1;
  r: int = sub a one;
  ret r;
}

Testing

My implementation succeeds on all the benchmarks in the bril repository. I also tested specific cases to make sure the speculation works whether or not the speculated code succeeds, by modifying the turnt script to split the {args} array into two, using the first half for speculation and the second for actual interpretation.

Summary

I think my main takeaway from this was mostly how messy implementing traces felt. Obviously this was not helped by the fact that brilirs wasn't made to be a library, so there was a lot of copying code necessary to actually modify the interpreter. I think I spent more time handling all the "early exit" possibilities to leave the interpreter once the trace was done, as the interpreter was naturally recursive and already returned Results so it was a bit messy to propogate multiple possible situations up (error, done tracing, return value). Figuring out where to end the trace and jump to was also interesting, especially as traces could technically end in the middle of a function. I managed to avoid this issue as all local function variables were prefixed, but one downside of this is obviously that if a trace ends halfway through a function, the half that is computed is kept in the trace, and so work is redone when the trace commits and jumps straight back to the function call. This could be removed by later (global) DCE passes though, as all variables generated in the function body are never used anywhere in the program.

1 reply

sampsyo Apr 29, 2025
Maintainer Author

Super cool! I like your idea of mangling the names of locals to provide "automatic inlining" during tracing while still keeping the various scopes separate. (And I like that this would work even in the presence of (mutual) recursion.)

I honestly haven't heard of anyone else hacking up brilirs for this task before, so it does stand to reason that it would turn out somewhat messy. Glad it worked out in the end…

mt-xing · 2025-04-24T18:00:12Z

mt-xing
Apr 24, 2025

Code: https://github.com/mt-xing/cs6120/tree/main/l11

This was another unambitious week as I get increasingly swamped by end-of-semester stuff.

My tracing interepreter modifies brili to trace exactly one path starting at the top of main up until 64 instructions have been traced, or any instruction with potential side effects, including function calls. I opted not to trace into function calls because I didn't have an easy way to deal with collisions of variable names, nor an easy way to reconfigure the stack if my trace ended inside of a nested function due to reaching the instruction limit or a side-effecting function. The trace, along with additional debug information such as the exact index of the final instruction and also which of any branches was taken are all printed to standard error for my code to parse.

My code will invoke the modified interpreter with arguments it is given and read the trace. It then reconstructs a new program, with the entire path of the trace as the first thing executed, wrapped in a speculate block. If a guard fails, it jumps to a label at the bottom of the new program, with the full text of the original function pasted there. Each guard is composed of the same condition as a branch encountered in the original trace. If the branch in the trace was false, an extra instruction is emitted to first not the condition of the branch variable before passing it to guard. If the trace terminates because the maximum number of instructions was traced, then the speculation is committed before a jump instruction is run pointing to a new label that is inserted before the next instruction to run in the original program.

I tested my implementation against the entirety of the bril core benchmarks for correctness. On average, my implementation causes an additional 8.036 dynamic instructions to be executed, although the minimum was a decrease of 7 dynamic instructions - so there is at least one program for which my "optimization" actually optimized something. We'll just ignore the max of 289 additional dynamic instructions, and the median of +3.

I also hand-crafted test cases, deliberately causing the trace to take a different path then when I execute the final code. These too exhibited the same behavior as the reference interpreter on the original program, although they do not improve the performance at all, for obvious reasons.

The hardest part of this assignment was figuring out conceptually what information exactly I needed from my tracing interpreter to be able to reconstruct not just the trace itself but also all the jumps and labels needed to resume execution when the trace either ends or needs to abort. The implementation ended up being relatively easy once I conceptually figured out what I needed, although admittedly my extremely unambitious implementation probably helped a lot.

Still, this is a working tracing interpreter, and there's even a non-zero number of programs that it truly does optimize, so I'll consider that a successful implementation.

1 reply

sampsyo Apr 29, 2025
Maintainer Author

Very cool! Even if you don't think this was very ambitious, you got the whole thing working end-to-end and even managed to make a convincing case that it is correct. 😃 I say that counts as a great outcome!!

parthsarkar17 · 2025-04-24T18:27:30Z

parthsarkar17
Apr 24, 2025

Code

Here and here are my modifications to the Bril typescript interpreter, and here is my trace-injector code.

How it works

Getting the Trace

I modified the Typescript interpreter to write instructions (in .jsonified string format) to stdout, with commas and brackets inserted in the right spot. Then, in my own binary, I opened a channel to read these lines as input, concatenated them, and converted them to an OCaml json representation. From there, I got a list of Bril.Instr.t values representing the trace. For a given Bril program, the trace I get represents the entire program execution for the the default arguments (i.e. values following ARGS). I accept both the program file path and default arguments as parameters to my function get_trace.

Filtering the Trace

Since I was a little short on time this week, I made the simplifying assumption that all of my input programs would be standalone, call-less main functions. That way, I wouldn't have to deal with the added complexity of call or ret instructions. So, to transform the trace, I:
1. Got rid of jmp instructions
2. Transformed brinstructions into guard instructions. I did this by keeping a map from each label to the first instruction after the label. Then, when I encountered a br G L1 L2 in the trace, I would check which of L1 and L2 the instruction after this branch corresponded to. If the true branch was taken, I would simply replace br G L1 L2 with guard G __b0. If the false branch was taken, I would create another variable storing not G, and use this variable in my guard instruction.
3. I would pad this new trace with speculate on top and commit on bottom. Then, I would append a jmp instruction to a new dummy label at the bottom of the program, so that, if the trace executes fully, then no other part of the program would be executed. Finally, I would append the entire, original body of main after this trace, and then tag on the dummy label I mentioned before.
4. For every guard instruction, the default label if it failed would just be __b0, which is my code's default name for the first block to execute in the original program (before trace-insertion). Basically, if the trace failed, we could always just bail out and start doing what the program originally intended.

I then replaced the old main with the new, trace-inserted main

Testing + Results

I tested correctness on two designs that met my assumptions, loopfact.bril and collatz.bril. For each design, I used two inputs, one "default" input on which I benchmarked performance, and a second, different input to reduce overfitting.

For loopfact.bril, on the default argument of 8, I observed a total dynamic instruction count of 113 for my trace-inserted code, compared to 116 observed for the baseline. On the random entry (10), I observed an instruction count of 250 for my code, compared to the 142 for baseline. This was expected! At least the arguments I optimized for got a little better, haha.

For collatz.bril, on the default arguments, I observed 164 and 169 respectively. For the non-default argument I tried (6), I observed 102 and 89 respectively. Again, modest improvements for the case I specifically optimized for, but otherwise, significantly worse performance.

All in all, this was a really cool exercise! Given more time, I would have liked to improve the trace code and support interprocedural traces. That way, my code would hopefully be able to bail out of the trace earlier, saving me some of the performance decreases I just mentioned.

Hardest Part

This is a little dull, but I was having some trouble getting the trace to be correctly emitted by the interpreter. I kept running into JSON formatting errors, even though I thought I was emitting stuff correctly. It always turned out to be something minor, and I eventually got it working, but it took a lot of printing and debugging. Other than that, I spent a while thinking about interprocedural tracing, which got a little convoluted. Eventually, I got a good idea of what I thought I should do, but ultimately decided to stick with the simplification for this week.

Star

I think my work deserves a star!

1 reply

sampsyo Apr 29, 2025
Maintainer Author

Awesome; nice work here! Even restricting yourself to single-function call-free code is a perfectly reasonable way to get this to work. 😃

aw578 · 2025-04-25T01:03:11Z

aw578
Apr 25, 2025

Code

For this week's assignment, I modified the interpreter to add tracing by printing out every instruction and label read, starting at the top of main. For branch statements, I also recorded which path the branch took. I then took the trace up to the first instruction with a side effect or backedge. When I encountered a branch, I added a not statement (depending on which branch was taken) and then a guard. Failed executions went to a label I added at the top of the original main statement, and successful executions went to a label I inserted before the original location of the last instruction in the trace. The one edge case I found was that I needed to ignore the last instruction if it was a branch.

I tested the code against the core benchmarks, with both the default inputs and the default inputs with 1 subtracted from every numeric input. Unfortunately, stopping the code at the first backedge kinda meant that I was unable to really improve performance, as the number of instructions executed for every program besides pythagorean_triple increased.

I don't think anything stood out as particularly difficult this week, but there were a lot of errors when trying to match the instructions output by the interpreter to the original instructions in the Bril program. I used Copilot to generate the initial code from a description and sample trace, then passed in the documentation for the Bril speculate extension. I still had to fix a few issues and figure out all the edge cases myself, but it did fine for a first draft. I think I deserve a Michelin star this week for fully supporting and testing against the core benchmarks!

1 reply

sampsyo Apr 29, 2025
Maintainer Author

Cool; that sounds good, @aw578! You're right that making this correct on all the benchmarks is a worthy achievement. 😃

katherinewu312 · 2025-04-25T01:55:22Z

katherinewu312
Apr 25, 2025

Group (@ngernest, @katherinewu312, @samuelbreckenridge)

Code

Our modified reference interpreter records traces from the beginning of the main function for a configurable number
of dynamic instructions (currently 1000). Using the resulting trace as output, our optimizer (a separate Python script) parses the trace to identify the sequence of instructions between two backedges that occurs the most frequently in the main function and chooses this as the "hot path" to optimize for. We kept our analysis intraprocedural and just bailed out of tracking paths on function calls / effect operations e.g. print. Once the hot path is identified, we prepend a speculation block to the original sequence of instructions that eliminates jumps and converts branches to guard conditions that if failed will cause execution to jump to a hotpathfailed label where the instructions will execute normally not on the "fast path". At the end of the speculation block we commit and jump to a hotpathsuccess block. The most difficult part of this implementation was probably determining what information needed to be recorded by the trace versus could be inferred by the optimizer. Initially, we thought we could do all the work in the TypeScript interpreter, but we quickly realized this was not going to be the best design since we want to enforce some separation of concerns between actually interpreting
instructions and emitting the modified trace. We settled on a relatively "dumb" interpreter and did most of the work of identifying hot paths and constructing the new program in the optimizer, but we had to iterate a bit to find what worked.

Here is essentially a high-level overview of how we stitch the trace back into the program:

@main(num: int) {
    # code 1...
    # hottest trace...
    # code 2...
}

@main(num: int) {
    # code 1...
    speculate;
    # hottest trace...
    # something about branching to .hotpathfail
    commit;
    jmp .hotpathsuccess;
    .hotpathfail:
    # hottest trace...
    .hotpathsuccess:
    # code 2...
}

We tested our implemention on several core bril benchmarks. Specifically, we tested on br.bril, factors.bril, perfect.bril, and while.bril, all of which but one contained loops. First, we used brench to test correctness on one input, i.e. when the default input and the input to reduce overfitting matched. Despite an increase in dynamic instruction count between the baseline and speculative optimization pipelines, the tests all pass. We then tested on multiple inputs, using one input to generate the trace and optimize the program and the other to check correctness. We used 3 as our default input for all benchmarks. Specifically, here are our results, br.bril is left out since it doesn't take in any arguments:

factors.bril:

default total_dyn_inst: 24

input	total_dyn_inst
10	64
15	64
20	74

perfect.bril:

default total_dyn_inst: 13

input	total_dyn_inst
10	46
15	46
20	58

while.bril:

default total_dyn_inst: 22

input	total_dyn_inst
10	119
15	174
20	229

1 reply

sampsyo Apr 29, 2025
Maintainer Author

Awesome; nice work, y'all! Thanks for the illustrative example of how the "stitching" worked for you—and it's interesting to see how this affected the "running times" for a few of the benchmarks individually.

zihan0822 · 2025-04-25T02:08:44Z

zihan0822
Apr 25, 2025

Code

bril fork

Tracing

I modified the typescript version of bril interpreter. Tracing starts when the number of times a back edge has been executed passes a pre-determined threshold (in my case: 5) and ends when this back edge is reached again. My tracing is inter-procedural. If we encounter call during tracing, the invoked function is inlined in the trace. We add some extra register copy instructions to pass the parameters and retrieve the return value. To avoid the instructions in the callee function shadows the variable defined in the caller function, we maintain a stack of function name we have entered during tracing, we prepend all variables defined in the invoked function with its function name. We get rid of jmp and replace br with guard. For false guards, an extra not instr will be added. I have not introduced an extra fail label for guard, instead when guard fails it will directly jump to the label where the trace starts. I kind of take advantage of how brili is implemented to make this work, guard instruction will be turned into a jump action to the original label. All br and jmp will be intercepted to see if there is a trace of the label it plans to jump to, if so we will redirect that jump action to our trace.

I optimize the collected trace with lvn, dce and const prop passes. I rewrite those passes in typescripts. The trace is wrapped with a speculate and commit before it gets appended to the instruction buffer. My current impl won't trace nested loop, only the inner loop that gets hot first is traced.

Testing

I handcrafted a program that runs a single loop and whether its branching behavior changes after the trace is collected (fifth time the backedge is reached) is controlled by the input. I am able to see my tracing jit behaves very differently for different inputs. I also tested my impl to check it won't break any code in benchmarks/core and tried to compare it with baseline brili. I did not run any extra optimization passes on the input program. It's pretty sad that for almost every benchmarks my version of tracing jit has worse performance, except for euclid.bril, which gets a 3.5x speed up. I don't have that much time this week to figure out what happens. I'm expecting tracing jit to give me better performance than this given that I have already optimized the trace. I might come back to this when i have time.
.

Conclusion

I think the most difficult part of this task is to think about what stuff we need to bookkeep in order to generate the correct trace. I think i deserve a Michelin star this week since i got a working tracing jit with trace optimization and test it against all core benchmarks, not perform that well tho.

1 reply

sampsyo Apr 29, 2025
Maintainer Author

Super cool; extremely nice work, @zihan0822! This is great. If you ever have extra time to look, I'd be SUPER interested to know what it was about euclid that made the tracing thing so effective… I wonder if some kind of constant-folding was able to kick in, for example.

scober · 2025-04-25T02:17:35Z

scober
Apr 25, 2025

code

I tried to do the simplest version of tracing I could. For me, that meant inserting the full execution of the program as a speculative trace at the top of main and then bailing to the normal program if any guards failed. In some ways, that is maybe too simple because one way to implement that is to just guard on the inputs and recreate any observed prints with constants (because bril is fully deterministic AFAIK). I didn't want to do that since I felt like it was against the spirit of the assignment so I tried to do some "real" tracing on the honor system.

The trickiest part ended up being handling function calls. I realized that tracing in an interpreter you can use the interpreter to maintain proper lexical scopes for all of your variables, but ahead of time tracing requires some trickery to handle calling functions with colliding variable names (recursive functions, for example). I ended up basically prepending the traced call stack to each variable name to avoid collisions.

Testing wise, I did not work very hard. I hand-tested for correctness and performance on a handful of bril programs. My tracing optimization preserved correctness on the programs I tested it for and it made everything slower. I guess the good news is that it didn't make everything much slower? Because I guard on the inputs at the top of main, when the trace fails it fails early. And the slowdowns for executions where the trace completes are due to extra instructions I add to guard on the program inputs and to guard on false values in the trace body. So the result is that the slowdown is generally a few 10s of dynamic instructions regardless of the original runtime.

1 reply

sampsyo Apr 29, 2025
Maintainer Author

Great! That's a good insight about how this task could degenerate to something especially uninteresting: i.e., tracing just the print effects and replaying those when the input matches. It's not wholly cheating, but it does take on a different flavor.

FWIW, prepending the entire call stack to the variable name is also where @UnsignedByte ended up. It occurs to me that the static approach to "maintain proper lexical scopes for all of your variables" could run into trouble with tracing if the call graph has cycles (i.e., there is recursion), so this could maybe be the best you can do if you want to do interprocedural tracing.

Annacaro22 · 2025-04-25T03:09:02Z

Annacaro22
Apr 25, 2025

Code: here and here

I thought this task was not too bad! I admittedly was not the most ambitious with it, but I found the speculative execution pretty easy to understand, and most of my grief came from the finicky details of having to move instructions around. I only did one speculation, for the first branch found in the first function; I tried for a bit originally to do nested speculations, but got quite muddled in the details of moving basic blocks inside of other basic blocks, and eventually just scaled it back to the one speculation. That much was quite feasible to do.

For testing, I ran it against a few benchmarks to check for correctness (and found a couple bugs-- I was infinite looping at first because I also copied over labels into the speculation code, so the control flow coming from non-speculation code was spontaneously jumping into speculation code when it wasn't supposed to be-- oops! deleting the labels inside the speculation code was enough to fix this). I also used the total_dyn_inst flag to check for performance and saw an increase of ~4 new instructions (makes sense, since I statically added 4 new instructions), and I also checked for clock time, and found a very slight decrease in my clock time for my predicted path (e.g. 0.11 secs to 0.08 secs for predicted input in check-primes), but an increase on non-predicted paths (e.g. 0.08 to 0.09, 0.24 to 0.27, on various size non-predicted inputs to the check-primes benchmark in core). This matches expectation; I'm doing a bit more unnecessary work on non-predicted inputs, but get a performance increase (slight, but would likely be bigger if I did more speculations) on predicted paths.

1 reply

sampsyo Apr 29, 2025
Maintainer Author

Great; this sounds awesome! And that's a useful point about the potential hazard of including labels in the trace—this is easy to do with Bril's specific data structure where labels are "just" instructions, meaning that you need to do something active to skip them when emitting the trace.

gabizon103 · 2025-04-25T06:08:28Z

gabizon103
Apr 25, 2025

code

I implemented a pretty simple form of tracing. I modified the interpreter to print a JSON containing every instruction executed during a run of the program. Then, I pass that to my program which prunes it down to the first 20 instructions from the start of main (this can be any number, I just chose 20). I also create CFGs for all the functions in the input program. Then, I further prune the trace by removing calls (the trace is nice because you kinda get inlining "for free") and empty returns. To make the inlining work, whenever I encounter a call I push the destination to a stack and then pop off the stack when I hit the corresponding return, and replace the return with an instruction like <dest> = id <return value>. I also replace all branches with a guard instead. Then, I put a speculate instruction at the beginning, a commit at the end, and then a jump to a trace_success label after the commit. And that's the trace, which I insert at the top of main.

I also needed to find how to stitch this trace back into the rest of the program, which was a bit tricky. I had a function to compute where this point is, more specifically where we should jump after a successful trace. Jumping out of a trace where a guard fails is easy, you just jump back to whatever the start of main used to be. To find the successful trace jump target, I took the original trace and turned it into basic blocks. Then, the number of blocks and the number of instructions in the last block tells me where in the original program I should split up the old block and jump to. This was a little finicky, so maybe there's a better way, but it at least worked for the hand-written things I tested it on.

I didn't do terribly thorough testing since it was a busy week for me. I tested it on some hand-written programs that aren't super contrived, which I thought was pretty ok. For example, this program

@main(x: int):int {
  one: int = const 1;
  y:int = add x one;
  z: int = const 0;
  hunnit:int = const 100;
  cond: bool = lt x hunnit;
  br cond .true .false;


  .true:
  one: int = const 1;
  two: int = const 1;
  three: int = const 1;
  four: int = const 1;
  five: int = const 1;
  six: int = const 1;
  sev: int = const 1;
  eight: int = const 1;
  nine: int = const 1;
  ten: int = const 1;
  elevent: int = const 1;
  twelv: int = const 1;
  thrteen: int = const 1;
  frteen: int = const 1;
  fifteen: int = const 1;
  z:int = call @f y;
  jmp .return;

  .false:
  z:int = call @g y;
  jmp .return;

  .return:
  print z;
  ret;
}

@f(a: int):int {
  one: int = const 1;
  s: int = sub a one;
  ret s;
}

@g(a: int):int {
  one: int = const 1;
  s: int = add a one;
  ret s;
}

gets transformed to this:

@main(x: int): int {
.main:
  speculate;
  one: int = const 1;
  y: int = add x one;
  z: int = const 0;
  hunnit: int = const 100;
  cond: bool = lt x hunnit;
  guard cond .quit_trace;
  one: int = const 1;
  two: int = const 1;
  three: int = const 1;
  four: int = const 1;
  five: int = const 1;
  six: int = const 1;
  sev: int = const 1;
  eight: int = const 1;
  nine: int = const 1;
  ten: int = const 1;
  elevent: int = const 1;
  twelv: int = const 1;
  thrteen: int = const 1;
  frteen: int = const 1;
  commit;
  jmp .trace_success;
.true:
  one: int = const 1;
  two: int = const 1;
  three: int = const 1;
  four: int = const 1;
  five: int = const 1;
  six: int = const 1;
  sev: int = const 1;
  eight: int = const 1;
  nine: int = const 1;
  ten: int = const 1;
  elevent: int = const 1;
  twelv: int = const 1;
  thrteen: int = const 1;
  frteen: int = const 1;
  fifteen: int = const 1;
  z: int = call @f y;
  jmp .return;
.false:
  z: int = call @g y;
  jmp .return;
.return:
  print z;
  ret;
.trace_success:
  fifteen: int = const 1;
  z: int = call @f y;
  jmp .return;
.quit_trace:
  one: int = const 1;
  y: int = add x one;
  z: int = const 0;
  hunnit: int = const 100;
  cond: bool = lt x hunnit;
  br cond .true .false;
}
@f(a: int): int {
.f:
  one: int = const 1;
  s: int = sub a one;
  ret s;
}
@g(a: int): int {
.g:
  one: int = const 1;
  s: int = add a one;
  ret s;
}

I added all those random const instructions just so the trace wouldn't just be the entire program. Overall I thought this went pretty well. It made me think about all the changes I'd make to this infrastructure if I set out to write a JIT compiler. I think mostly I would want to add a lot more instrumentation to the interpreter, so that you could make smarter choices about when to start/stop tracing.

1 reply

sampsyo Apr 29, 2025
Maintainer Author

Great; all sounds good! And this illustrative program is helpful to see how the stitching worked.

mariasoroka · 2025-04-25T16:28:41Z

mariasoroka
Apr 25, 2025

Here is the code: link

I am recording a trace for a specified function (-f parameter for brili) until the first call instruction. The trace is stored in a file. Stitching of the trace is implemented in stitch.py. I test my implementation on three benchmarks: 'birthday', 'gcd', and 'sum-sq-diff'. I trace the probability, main, and squareOfSum functions, respectively. I record the trace for the first set of arguments specified in test.py and run the assembled code on the complete set of arguments.

Here is the comparison of performance.

For the first set of arguments, the modified code performs a bit slower because of the additional guard, commit, and speculate instructions. Also, to make guards work correctly, I had to include not instructions. Now, my code looks like that in places where the branching condition did not hold: v5 = not v5; guard v5 .default; v5 = not v5;.

For the rest of the arguments, the modified code performs slower because it cannot reuse the trace.

I routinely use Copilot when coding. I used it this time as well. I did not use any other GenAI tools.

3 replies

sampsyo Apr 29, 2025
Maintainer Author

Looks good, and thanks for including some plots of performance! What is the y-axis on these plots? Is that the input value to the program? (I'm guessing that the colors indicate tracing enabled vs. tracing disabled.)

mariasoroka Apr 29, 2025

Right, I completely forgot. The y-axis is the number of instructions (brili -p). The green color is performance before tracing and the yellow color is after tracing. Inputs are just some arguments I came up with. For gcd, for example, they are 10 and 20, 15 and 25, 9 and 42, 12 and 18.

sampsyo Apr 30, 2025
Maintainer Author

Ah, that makes sense! Thanks!

neel-patel-1 · 2025-04-26T02:17:35Z

neel-patel-1
Apr 26, 2025

Offline Trace-Based Optimizer

This task was a collaboration between @arnavm30 and @neel-patel-1. We implemented an ahead-of-time, interprocedural, trace-based optimizer which uses intruction counts for hotness detection.

Hotness detection && Trace Collection: We modify the baseline brili interpreter to implement hot code detection and trace collection. For hot code detection, we add a counter to the Opobject to count the number of times an instruction is executed. When a hotness_threshold (2 in our implementation) is reached, trace-brili will begin emitting instructions to a trace file.
Trace Optimization and Insertion: Traces are optimized in a multi-stage pipeline. First traces are passed through inline.py to remove call and ret instructions, substitute arguments with the caller's variable names and handle any naming conflicts within the callee. Next, they are passed through lvn.py for local value numbering and tdce.py for dead code elimination. To prepare traces for insertion back into the original program, optimize_and_insert_trace.py (1) wraps the optimized trace with a pair of speculate and commit instructions, (2) replaces each branch with a guard instruction, and (3) uses trace-brili's <start_label><start_label_offset><end_label><end_label_offset> trace naming convention to find the code location to insert the optimized trace.
Limitations: Our implementation does not properly handle recursion. Our test cases and evaluation was limited to non-recursive programs.

One interesting aspect of our implementation that highlights a challenge with trace insertion in an ahead-of-time, trace-based optimizer is shown in the below example when the optimizer is applied to code containing a loop. Before optimization hot_loop.bril's loop body executes five times:

@main(x: int) {
  one: int = const 1;
  five: int = const 5;

  it: int = id one;
  accum: int = const 0;

.for.header:
  cond: bool = lt it five;
  br cond .for.body .for.done;

.for.body:
  it: int = add it one;
  accum: int = add accum it;

  jmp .for.header;

.for.done:
  print accum;
  ret;
}

on the second iteration, the loop header and body become hot, so trace-brili begins collecting instructions. The trace contains only the hot instructions from the executed instruction sequence, so is missing the first iteration of the loop:

.for.header:
.for.body:
.for.header:
  cond: bool = lt it five;
  br cond .for.body .for.done;
.for.body:
  it: int = add it one;
  accum: int = add accum it;
  jmp .for.header;
.for.header:
  cond: bool = lt it five;
  br cond .for.body .for.done;
.for.body:
  it: int = add it one;
  accum: int = add accum it;
  jmp .for.header;

Such a trace would cause the program to produce incorrect results. To address this, we pass traces through fill_labels.py before subsequent optimizations. fill_labels.py can detect whether a trace contains a loop and fill in any missing iterations. fill_labels.py was generated by ChatGPT's o3 model. In this case it was helpful to have a model that could quickly generate a script that could fill in missing loop iterations based on a pattern in the label names.:

.for.header:
  cond: bool = lt it five;
  br cond .for.body .for.done;
.for.body:
  it: int = add it one;
  accum: int = add accum it;
  jmp .for.header;
.for.header:
  cond: bool = lt it five;
  br cond .for.body .for.done;
.for.body:
  it: int = add it one;
  accum: int = add accum it;
  jmp .for.header;
.for.header:
  cond: bool = lt it five;
  br cond .for.body .for.done;
.for.body:
  it: int = add it one;
  accum: int = add accum it;
  jmp .for.header;
.for.header:
  cond: bool = lt it five;
  br cond .for.body .for.done;
.for.body:
  it: int = add it one;
  accum: int = add accum it;
  jmp .for.header;
.for.header:
  cond: bool = lt it five;
  br cond .for.body .for.done;

We run our optimizer on three of our own examples and one core bril benchmark.

hot_loop tests our optimizers ability to detect hot loop headers/bodies and generate correct, optimized traces
assign_and_print repeatedly assigns to a variable and prints immediately. Since traces are emitted when operations with side-effects (prints) are invoked, this example generates many small traces.
loopfact computes the factorial of a number using a loop. This example is similar to hot_loop, but with a bigger body.

Program	Dyn Inst Count (Unoptimized)	Dyn Inst Count (w/ Trace-based Opts )
`hot_loop`	30	26
`assign_and_print`	60	150
`loopfact`	116	53

2 replies

sampsyo Apr 29, 2025
Maintainer Author

Really cool! It's awesome that you were able to do some actual hotness detection to decide where to initiate a trace.

Makes sense that you didn't address recursive programs when tracing interprocedurally; that does add another layer of complexity. Just in case you're curious, @UnsignedByte (the first post in this thread) describes one potential way that you can deal with arbitrary call graphs.

fill_labels.py can detect whether a trace contains a loop and fill in any missing iterations.

Huh, that sounds really interesting… I'd be interested to know more. How does it do this? What constitutes a "loop" in a trace—is it the second occurrence of a given label? And what does it mean to generate additional iterations—just copying and pasting code?

neel-patel-1 May 2, 2025

Thanks! This was an interesting assignment.

Our goal was to be able to reinsert optimized traces into the program and implement some form of hotness detection. Our hotness detection procedure -- which used an instructions' execution count to decide hotness -- ended up complicating trace reinsertion, since our traces would leave out some iterations of hot loops. We ended up GPT'ing a script to fill in the missing iterations using the labels emitted in the trace, since trace-brili always emits labels.

Although our implementation works on the examples we presented, we would probably be able to break this with a more complex example (e.g., loops with control flow). The fill_labels script would not be able to accurately "fill in" instructions between empty label pairs -- currently fill_labels only fills in instructions between an empty pair of labels if every other occurrence of those labels has the same set of instructions in between.

One fix would be to disable hotness detection and just start tracing at the beginning of main, which is what we did originally.

We might also be able to keep our current hotness detection and modify trace-brili to prepend traces with the cold instructions that turn hot later on (e.g., due to being in a loop). Then we wouldn't need fill_labels at all.

Cool to see that @UnsignedByte was able to get recursion working. We were thinking of either trying to get recursion or hotness detection working and ended up going with the latter.

lisarli · 2025-04-27T03:45:32Z

lisarli
Apr 27, 2025

In collaboration with @bryantpark04 and @dhan0779
source

Constructing Traces: We construct traces by modifying brili to log up to 20 instructions, starting from the beginning of main and including all function calls, to an output file. We also track the label and offset of the final traced instruction, which informs where the trace should exit to once we stitch it into the original program. While tracing, we also log whether branches are taken or not taken. To stitch the trace into the program, we wrap the trace in speculate and commit, and we insert two labels .trace_success and .guard_failed. We handle function calls by maintaining a rough call stack to rename arguments and assign return values, and we remove jumps and replace branches with guards.

Performance: We checked the correctness of our stitched programs by ensuring the outputs matched the original program for various inputs. To evaluate performance, we used a handwritten program jumps.bril and two core benchmarks, reverse.bril and perfect.bril. jumps.bril is a short program intended to highlight the advantage of using the hot path to eliminate jumps and branches, and we observed a reduction of 4 dynamic instructions between the baseline and trace-optimized versions for hot path inputs. For other inputs, we observed an increase of 6 dynamic instructions over the baseline. For reverse.bril, we actually observed an increase of one instruction even for the hot path inputs, which could be associated with the extra instruction for negating branch conditions. The increase is greater for non-optimized inputs (around 17 instructions). We observed a similar pattern for perfect.bril, with an increase of 2 instructions on the hot path and 16 on the non-optimized path.

Conclusions: This assignment was pretty enjoyable, as it was cool to work with JIT concepts even in an AOT manner. The main challenge we faced was determining how to track where to stitch the trace back into the original program, particularly how to track the location in which to insert the .trace_success label, and we also had a couple of bugs at first with handling function calls and overwriting variable names.

1 reply

sampsyo Apr 29, 2025
Maintainer Author

Great; all sounds good! You're absolutely right that doing all the "plumbing" to figure out where a trace belongs in the original program is a tricky part. Glad it worked out in the end!

ananyagoenka · 2025-04-30T22:28:47Z

ananyagoenka
Apr 30, 2025

Source Code

I extended the reference Bril interpreter (brili.ts) to support a runtime trace (--trace N) and an injection tool (apply_trace.ts) that turns the first N dynamic operations of the main loop into a speculative “fast path.” At runtime we:
1. Trace the first N instructions (pure integer ops + branches → guards) into trace.json.
2. Run apply_trace.ts to wrap those N ops in

speculate;
  …fast-path ops… 
commit;
jmp loop;
.trace_fail:
  …original loop body…

On any guard failure we abort back to .trace_fail and replay the original code; on success we commit and jump back up.

I ran it on a handful of core benchmarks—Armstrong numbers, factorial, power, and so on—both with and without tracing applied. For example, on armstrong.json (input 407) the plain interpreter did 133 dynamic instructions, and the traced version (with my unoptimized 4-iteration trace) did 149; on fact.json (input 10) I saw 119 instructions untraced vs. about 120 traced. I also confirmed correctness by comparing program outputs (true / 3628800) on multiple inputs.

Keeping the trace buffers, the tracing flag, and the speculative state (specparent) correctly threaded through recursive calls was tricky. My first attempt reset or dropped trace state on each call. The final solution simply carries the single tracing boolean and the growing traceInstrs array in the shared State object (and only forks it when you actually enter a speculate), so nested calls no longer break the recording.

This is admittedly a pretty rudimentary implementation.
• No heap rollback: speculative heap stores stick even on abort. A next step is snapshotting or logging writes during speculation and undoing them on failure.
• No trace optimization: I currently inject the raw “first N” ops; running a tiny LVN/DCE pass over that slice before wrapping in commit would remove dead arithmetic.
• Single-loop focus: traces always start at the first backedge in main. It’d be more powerful to pick the hottest loop header dynamically, or even do interprocedural tracing.

1 reply

sampsyo May 8, 2025
Maintainer Author

This sounds great; nice work! And all those limitations seem very much within the normal bounds. :)

Jonahcb · 2025-05-01T01:58:50Z

Jonahcb
May 1, 2025

Code

I modified the Bril interpreter to record each instruction that is evaluated (this skips over labels). I hardcoded the interpreter to stop tracing after recording 20 instructions. The interpreter starts recording immediately. When a "br" instruction is evaluated in the interpreter, I check whether the "cond" is true or false. If it is false, I insert an artificial instruction in my trace to flip it. Then I substitute a guard to a "hotpathfailed" label on "cond" for the "br" instruction. Tracing stops at the first "ret" or "call".

My optimizer takes in the original program and the trace and forms a new program by inserting a "speculate", then the entire trace, then "commit", then a "hotpathfailed" label, and then the original program. I didn't actually run any optimizations on the trace (although I left a stub file to do that) because my original code for optimizations is in Python, and I did this assignment in C++.

For testing, I ran my tracing optimizer on a handful of benchmarks. I quickly realized that it is ineffective for the majority of benchmarks because the benchmarks usually have a single instruction in 'main' before calling another function that does the heavy-lifting (but my tracing will stop itself at that first function call). I found a few good benchmarks though. 'Fizz-buzz' is perfect because the entire program is crammed into 'main'. My optimizer increased the number of dynamic instructions by 23 instructions. But this is without running any optimizations on the trace so it makes sense that it's pretty bad.

I found this assignment interesting. I liked that I had to think about how much logic to put into the interpreter and how much to put into my optimizer. My gut-feeling was that all logic should be in the optimizer file, but this wasn't possible, which is interesting. This assignment also made me realize the importance of understanding how compilers work because I couldn't do very much without understanding how the interpreter works (and I don't so that is why my implementation is very basic).

I used ChatGPT for documentation (because I'm lazy, but I want good documentation when I come back to my code). I did not use AI for anything else because I want to learn the material and learn C++.

1 reply

sampsyo May 8, 2025
Maintainer Author

Cool! You make a good point about tracing-without-calls being limited by the small number of benchmarks with substantial work in their main. In the future, we could alleviate this by doing a lot of inlining…

gerardogtn · 2025-05-03T03:12:10Z

gerardogtn
May 3, 2025

Code
Samples

For this assignment I worked on using jit compilation for loop unrolling. The main idea was to detect hot loops in a program, flatten them into a series of instructions and duplicate them four times, if at the end of the speculative execution doesn't hold then we just jump back into the original execution of the program. For simple loops (no nesting, no branching inside the loop) that execute a significant number of times (tens or hundreds of times) we can get significant performance wins (10% on computing the 500th fibonacci number0 but on the majority of programs the difference is negligible or even slightly negative.

In more detail, I modified the brili interpreter to have a tracing mode that keeps track of the frequency of execution of different basic blocks; once it detects a basic block as being hot (executed 20 times) it enters tracing mode and tracks the blocks that form the hot path. At the end of the execution all the hot loops are printed in json format. A turnt command makes stores all hot paths in a program in a .trace file.

Once we have access to traces, we can call a Kotlin program that stitches the traces and the original program into an unrolled program. To do so we decide what the best loop to unroll for each bril function in the source with a couple of heuristics: (1) avoid loops that have side effects (2) prefer shorter loops over longer loops (3) avoid loops that have too many instructions (less than 20 instructions). The main idea of (1) was to avoid speculating on instructions that have side effects and of (2) and (3) to avoid having a large program due to unrolling large blocks. I worked on this part by first writing a version by hand of what I would expect my optimization to look like (you can find it here) and then writing my solution until I approached a better and better approximation of what I wanted. To decide how many times to unroll a program I just selected four as my hyperparameter of choice and didn't play around with tuning the best number of loop unrolls.

How well does this work? Well for simple loops that execute a large number of times like fibonacci. I saw a decrease of around 15% dynamic instructions (3,509 vs 3,032). But for the majority of benchmarks in the core directory there was no benefit or even a negative impact to performance. Here's a summary of results:

Program	Original	Unrolled	Rate (Larger is better_
fibonacci 500	3,509	3,032	1.158
primes between	574,100	571,440	1.001
fizz buzz	3,652	3,459	1.060
check primes	8,468	8,420	1.006
ackermann	1,464,231	1464232	1.000
armstrong	137	261	0.525

I'm super late with this assignment but had a lot of fun working on unrolling and stitching code.

1 reply

sampsyo May 8, 2025
Maintainer Author

Really great work!! To be perfectly honest, it's pretty hard to get consistent instruction-count performance wins in this "pure tracing" setting, so it's pretty cool that this yielded positive results even for a handful of benchmarks. :) Nice end-to-end engineering work here!

smd21 · 2025-05-07T00:56:43Z

smd21
May 7, 2025

code
This assignment took me a while due to tech week for a performance I was in last week (tech week = i basically was living in risley since all my time after class was spent at rehearsals). I was planning to implement some optimizations (like LVN) but unfortunately did not have time to translate the code into Rust (it was originally in Typescript).

Tracing
I start tracing programs from the first instruction, then stop at the first backedge. I do not add jumps. Branches are translated into guard instructions. To do this, if the condition cond is false then I add a new variable notcond that negates cond and have this be what the guard is conditioned on. The label that is not taken becomes where the guard jumps to if it fails (I am now realizing this will break on function calls, so I'll fix that shortly). For function calls with n arguments, I first insert n id instructions that "translate" the values passed into the function to the names the function is expecting. Then I continue tracing the function as normal. If the call assigns to some return variable dest, then I push dest onto the top of an "awaiting_return_values" stack. When the function call finishes, an instruction assigning dest to the return value is inserted into the trace. This is failing recursive functions (probably due to an issue with variable names not being unique) but seems to be working on non-recursive cases.

Once I am done tracing, I write the list of traced instructions to std.error, which is sent to a file via piping. (the bash scripting took me a while to figure out since I'm pretty bad at it)

optimizations + stitching the trace back in
To stitch the trace back in, I decided to add a speculate before every guard and a commit before the next guard. This basically ensures that we get as far as possible in the traced + optimized code before we have to jump back to the rest of the program. I didn't have time to implement any optimizations, but was going to try and translate my LVN and integrate that at some point during/after finals.

testing
I tested on a few different benchmarks from the core directory. Since mine doesn't really work on recursive functions, this limited what I was able to use. I ran my implementation on sum-divisors, perfect, factors, and quadratic. For recursive testing, I ran on fibonnaci and fact. Since I don't do any optimizations (I just add a bunch of speculate and commits) I increase the overall dynamic instruction count lol.

Testing was pretty challenging since if something was broken it was really hard to figure out what was going wrong and how to fix it. I spent a lot of time just combing through my output file and doing the program by hand to see why things weren't behaving properly.

Even though I didn't accomplish everything I had hoped, I was still able to get a much better understanding of the bril compiler and on making dynamic compilers in general

1 reply

sampsyo May 8, 2025
Maintainer Author

Cool cool; this sounds great! TBH ruling out recursive programs is a very reasonable assumption here. Nice work getting this to work at all, really!

calciiium · 2025-05-09T04:22:54Z

calciiium
May 9, 2025

code
I implemented a trace-based speculative optimizer for Bril. I modified brili.ts so that it will trace first 20 instructions (or up to call or ret commands) it encouters in the main function, and for branching instruction, I also trace whether the condition is met or not. Then in Python I read in those json instructions, and convert the bril program to the following format:

@main {
	speculate;
	// traced instructions with 'guard' commands
	// if 'br' is traced to the else branch, 
	// create a new bool variable that computes
	// the negation of the branch condition.
	// If guard fails, jmp to .main__b0.
	commit;
	jmp ...
.main__b0
	...
	all original code
	...
}

I ran several different inputs on several different programs from core benchmarks. For example, for perfect.bril,

input	Traced dyn instr	original dyn instr
496 (traced)	236	233
523	241	224
40	73	70
25	73	56

There aren't that much of different, but we can still see a slight increase for traced case for other inputs.

1 reply

sampsyo May 18, 2025
Maintainer Author

Looks good overall.

InnovativeInventor · 2025-05-11T22:53:18Z

InnovativeInventor
May 11, 2025

I wrote a very simple tracing implementation that is effectively a branch predictor that gets to "cheat" by seeing what branches were taken during the profiled/traced execution (on the first run through a basic block). My code only support a handful of operations.

I did this by: (1) instrumenting the reference interpreter to spit out each instruction that happened and the branches taken (when a br was spotted), and (2) inserting jmps and commits such that I would only speculate for at most up to one basic block (and immediately end if I bump into a print). If no tracing information is provided for a branch, it will by default speculate that the condition for a br is True (e.g. for a while / for loop this is usually true).

I tested my code¹ against my BBS PRNG implementation, with varying traced inputs (during traced execution) and varying run-time inputs (during program execution). It seems to behave correctly for my BBS implementation.

https://github.com/InnovativeInventor/bril/blob/lesson-12/lessons/tracing.py ↩

1 reply

sampsyo May 18, 2025
Maintainer Author

Looks good overall, although it would be good to know whether the end-to-end transformation is correct for a large swath of programs.

KabirSamsi · 2025-05-13T15:57:06Z

KabirSamsi
May 13, 2025

Partner: @noschiff
Code

Sorry about how late this is! Unfortunately both of us have been too busy to give this as much focus as our past work.

Implementation

For this assignment, we wrote a straightforward tracer operation using our TypeScript Bril infrastructure. We are able to support a couple of different operations in Bril.

We began first by setting up the reference interpreter; subsequently, we implemented a trace that makes a pass through the program, and set up functions evalInstr and evalFunc, extended with tracing. To this effect, we turned on tracing only when viewing the main function in a given Bril program representation, properly verifies that the instruction's opcode is valid and that its argument count is consistent with its opcode.

We then have more holistic functions traceToSpec and optimize, which combine together to form our tracing pass. Iterating over our trace, we do a scan to check which branch was taken for br branching instructions, and log this as part of our speculation.
Subsequently with optimization, we re-insert the optimized Bril instructions into our program.

Testing

As with past assignments, we utilized the core directory's benchmarks. We seem to be able to preserve correctness for the more basic cases, but its limitation is being able to deal with function calls, i.e. references to the other function names itself in Bril.

1 reply

sampsyo May 18, 2025
Maintainer Author

Recorded—looks good at a high level, y'all.

Lesson 12: Dynamic Compilers #458

Uh oh!

sampsyo Jan 21, 2025 Maintainer

Replies: 19 comments · 22 replies

Uh oh!

Tracing

Testing

Summary

Uh oh!

sampsyo Apr 29, 2025 Maintainer Author

Uh oh!

Uh oh!

sampsyo Apr 29, 2025 Maintainer Author

Uh oh!

Code

How it works

Getting the Trace

Filtering the Trace

Testing + Results

Hardest Part

Star

Uh oh!

sampsyo Apr 29, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

sampsyo Apr 29, 2025 Maintainer Author

Uh oh!

Uh oh!

sampsyo Apr 29, 2025 Maintainer Author

Uh oh!

Code

Tracing

Testing

Conclusion

Uh oh!

sampsyo Apr 29, 2025 Maintainer Author

Uh oh!

Uh oh!

sampsyo Apr 29, 2025 Maintainer Author

Uh oh!

Uh oh!

sampsyo Apr 29, 2025 Maintainer Author

Uh oh!

Uh oh!

sampsyo Apr 29, 2025 Maintainer Author

Uh oh!

Uh oh!

sampsyo Apr 29, 2025 Maintainer Author

Uh oh!

Uh oh!

sampsyo Apr 30, 2025 Maintainer Author

Uh oh!

Uh oh!

sampsyo Apr 29, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

sampsyo Apr 29, 2025 Maintainer Author

Uh oh!

Uh oh!

sampsyo May 8, 2025 Maintainer Author

Uh oh!

sampsyo
Jan 21, 2025
Maintainer

Replies: 19 comments 22 replies

sampsyo Apr 29, 2025
Maintainer Author

sampsyo Apr 29, 2025
Maintainer Author

sampsyo Apr 29, 2025
Maintainer Author

sampsyo Apr 29, 2025
Maintainer Author

sampsyo Apr 29, 2025
Maintainer Author

sampsyo Apr 29, 2025
Maintainer Author

sampsyo Apr 29, 2025
Maintainer Author

sampsyo Apr 29, 2025
Maintainer Author

sampsyo Apr 29, 2025
Maintainer Author

sampsyo Apr 29, 2025
Maintainer Author

sampsyo Apr 30, 2025
Maintainer Author

sampsyo Apr 29, 2025
Maintainer Author

sampsyo Apr 29, 2025
Maintainer Author

sampsyo May 8, 2025
Maintainer Author