-
Notifications
You must be signed in to change notification settings - Fork 5.2k
[clr-interp] Add a scheme for adding peephole optimizations for known sequences of IL opcodes #120827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… sequences of IL opcodes - This is needed to ensure that some of our SIMD tests finish in a vaguely reasonable timeframe - The peeps implmented here make that almost possible, although the behavior in CI hasn't yet been verified - Peeps implemented - stloc/ldloc ... This is a minor improvement, but will be needed to handle a future optimization around allowing the il stack to have constant values - box/unbox.any - Removes unnecessary boxing/unboxing - typeof(T)==typeof(Y) - This allows us to handle the type testing specialization behavior that is used heavily in the BCL - typeof(T).IsValueType - Used in the Unsafe.BitCast function Not yet implemented is giving the interpreter stack a concept of constant values, so that we can optimize brtrue/brfalse and friends into unconditional branches. With this set of changes the if (typeof(T)==typeof(int)) { ... } else if (typeof(T)==typeof(float)) pattern is *much* faster than before, but the not taken paths are still fully generated interpreter code, and there are still many branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds peephole optimization infrastructure to the interpreter to recognize and replace specific IL opcode sequences (store/load locals, type equality/value type checks, box/unbox.any) with more efficient forms to improve performance (notably for SIMD-related scenarios).
- Introduces OpcodePeep data structures and matching/apply logic.
- Implements specific peepholes: stloc/ldloc, typeof(T)==typeof(U), box/unbox.any, typeof(T).IsValueType.
- Extends intrinsic recognition for System.Type methods.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
File | Description |
---|---|
src/coreclr/interpreter/intrinsics.cpp | Adds intrinsic mapping for System.Type methods used by new peepholes. |
src/coreclr/interpreter/compiler.h | Declares peephole pattern structures and adds methods for identifying/applying optimizations. |
src/coreclr/interpreter/compiler.cpp | Defines peephole patterns and logic; integrates peephole application into code generation loop. |
Tagging subscribers to this area: @BrzVlad, @janvorli, @kg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance of this will probably get bad as we add more peeps, but it looks fine to me (EDIT: Minus the stuff Copilot found). Is there a specific reason why we're peephole optimizing the IL (and having to do redundant ResolveTokens and getCallInfos) instead of the interpreter opcodes? I can imagine either approach being the best one.
Maybe a better approach is to build the interpreter as a codegen backend of RyuJIT so that we get all the optimizations for free? |
Something along these lines has been actually the plan so far - stay away from optimizations in the interpreter and use RyuJIT to do them. @davidwrighton do you see this change as a temporary thing until we can do that in RyuJIT? |
This pattern matching is done by RyuJIT even with optimizations off. It is table stakes rather than an optimization. It is similar to must-expand intrinsics that are not an optimization either. The byte code generated by interpreter today is super inefficient in places. We need to do some work like this to make it more reasonable. We shouldn’t be introducing new IRs or passes - that would make it seem like something suited for RyuJIT.. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@kg I want to do this optimization at the IL level, since the interpreter compiler doesn't have a functional general purpose optimization framework, and so the way to avoid operations in it is to never emit the interpreter opcodes. We may want to do a few tweaks around mov instructions, and we have a small number of existing details where we can omit branch instructions in some cases after emission, but we can't in general look back more than 1 instruction or so. So its best to avoid emitting the opcodes in the first place. @janvorli I think this is a long term play to have this sort of optimization. Notably this is allowing us to get substantial wins in a few targeted scenarios, not a general purpose optimization system. I see us adding a few more details, such as a concept of constants/possible ranges on the IL stack, which will allow us to elide emitting various conversion, and branches. And I think we may also want to add a very simple inliner. (For the inliner, if we inline only functions which have no branches, and use a restricted set of instructions it's possible to get VERY big wins on performance without much complexity by making property accessors cheap, and having something which can inline the behavior of the various Unsafe.As/Cast/etc functions would be much cheaper than actually writing the giant stacks of intrinsics that we will otherwise require to get decent performance out of the BCL. This scheme is based on work Brian Smith did years ago when building the Compact Framework when building the JIT for it. That jit was actually very similar to this interpreter compiler, and the simple inliner produced extraordinary performance wins relative to its level of complexity. I have yet to discuss with @jkotas the value of this sort of inliner, but I think it would be reasonable.) @hez2010 Yeah, RyuJit would be a better optimizing backend than writing a new optimizer. There is NO desire to write a high quality optimizing backend here, and we will not support work to do that, but work to make normal C# logic and logic of the sort that is in important parts of the BCL compile into something reasonable is in scope. |
Sounds reasonable to me. We will want to do this only for optimized code (interpreter does not have concept of optimized code today) so that it does not impact managed debugging. |
Mono already has this concept AFAIK, so it might be useful to take some inspiration from there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you.
Not yet implemented is giving the interpreter stack a concept of constant values, so that we can optimize brtrue/brfalse and friends into unconditional branches. With this set of changes the if (typeof(T)==typeof(int)) { ... } else if (typeof(T)==typeof(float)) pattern is much faster than before, but the not taken paths are still fully generated interpreter code, and there are still many branches. That work will follow on in a future PR once we agree on the structure of this one