Try to write code with fewer branches? #126
Replies: 13 comments
-
Wouldn't that miss out on short-circuiting? |
Beta Was this translation helpful? Give feedback.
-
Indeed, that's the whole point -- if you have a condition where short-circuiting is not essential, you may be better off combining two flags using e.g. Of course this isn't an option if condition |
Beta Was this translation helpful? Give feedback.
-
Also the branch prediction table is a finite size cache, right? so more branches mean more evictions. |
Beta Was this translation helpful? Give feedback.
-
Kind of, though there also seem to be ways to separately detect loops and "unimportant" branches. How these works is all very complicated... (Try reading about it in this paper: https://www.agner.org/optimize/microarchitecture.pdf) |
Beta Was this translation helpful? Give feedback.
-
For anyone curious what this looks like in practice: |
Beta Was this translation helpful? Give feedback.
-
My takeaway from that video is that the compiler is smarter than us 😢 ....and it seems that branchless is sometimes slower. When I reduced branches for We could also consider using the We will need very careful profiling though. E.g. I know that 90% of the time, Note: these don't emit hints to the CPU, they hint the compiler to rearrange the branches to aid branch prediction for us. |
Beta Was this translation helpful? Give feedback.
-
Some clarification: Clang outright ignores likely/unlikely during PGO as it already collects branch weights. GCC seems to combine the weights with PGO data. (No clue if that information is still up to date). MSVC seems to also collect branch weights during PGO. It seems that good PGO training is ever more important, linking back to #99. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Interesting. He leaves out Itanium, where the compiler is expected to use predication instead of branching for small code blocks. The predicate value needs to be known only at the end of the pipeline, before the instruction commits its effects, and you don't need to flush if it's false, just not commit. https://www.cs.nmsu.edu/~rvinyard/itanium/predication.htm |
Beta Was this translation helpful? Give feedback.
-
The textbook I'm reading indicates that Itanium was a multi-billion-dollar mistake. Probably in part because it was relying too much on the compiler. |
Beta Was this translation helpful? Give feedback.
-
Hmm, interesting. That was the whole idea - why make the chip do at runtime things that the compiler can do? |
Beta Was this translation helpful? Give feedback.
-
The cost of branching is almost entirely down to mispredictions. A correctly predicted branch, when the comparison involves values in registers or small constants, has basically zero cost. However, if the branch is not predictable, then branchless is likely to be a win. As Guido says, one way to remove a branch is to combine two of them. All the major compilers will do this for you if they can, The if cond:
A
else:
B would normally be laid out:
If
|
Beta Was this translation helpful? Give feedback.
-
The two are connected. At least on x86 the convention is that, absent any other information, forward branches are predicted non-taken and backwards branches are taken. Further, if a condition really is likely then the compiler can shift the else block to a different page (improving effective code density).
Linus Torvalds has a good rant on CMOV and friends and why branch-less code is often not all what it is cracked up to be on modern CPUs. Modern branch predictors are extremely good, even for interpreters such as Python. See Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore with Table 3 being of particular interest. From the paper, we observe that the MPKI rate (Mispredictions Per 1000 Instructions) for the Python 3 interpreter has dropped by the best part an order of magnitude due to improvements in CPU branch predictors, to the point where you're looking at ~50 cycles of branch misprediction stall per 1,000 instructions. In fact there are good arguments for more branchy code. A good example of this is devirtualization where, from PGO data, compilers will see if a function pointer call tends to go a certain way and, if so, rewrite the code to explicitly check the function pointer against a set of known targets such that they can be called directly rather than through a pointer. What matters nowadays is cache and data layout. CPUs have the ILP to handle more work per cycle and are good at spotting trends. What they don't like is pointer chasing. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The more I read about processor architecture the more I realize that branch prediction is fickle and the penalty for a failed prediction is high (10-20 clock cycles). Even when branch prediction is effective, having too many branches close together can slow things down (I read something about branch prediction being slower on some i7 version if there is more than 1 branch per 16 bytes of instructions).
So perhaps it would behoove us to use idioms that combine multiple (easily-tested) conditions in one word before branching. Something like
(where it works, say) might actually be faster than
Then again, maybe the compiler (optimizer) also knows this and will do it for us?
Beta Was this translation helpful? Give feedback.
All reactions