-
Notifications
You must be signed in to change notification settings - Fork 14.4k
[TableGen][DecoderEmitter] Add option to emit type-specialized decodeToMCInst
#146593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
What's the motivation? |
Please see the PR description that I just added. I also have data to support this, will add soon (tabulating it ATM) |
Failure is in the unit test (TableGen/VarLenDecoder.td), I need to update it. |
Why can't the disassembler emitter figure out the bitwidths from the tablegen input? It already makes separate tables for each instruction size. |
Oh is it because RISC-V uses 3 bit widths, but only 2 types for DecodeInstruction? |
Can we store the type in the Instruction class in the .td files like the bitwidth instead of introducing a complex command line argument? |
Right, this is POC at this which shows that the proposed optimization works. I am open to changing the interface here as well. The command line one was simple enough to not mess with tablegen instruction class etc, but that is an option, though it feels more intrusive. The command line is moderately complex and localized to the decoder emitter. |
Repeating the type per-instruction record might be redundant (and we would need more verification as well to verify for a given size, all insts of that size have the C++ type specified and its consistent). One option is to add a new InstructionTypeAndSize class that records this information, and DecoderEmitter can use that if its present else fall back to templated code. Something like
and a particular backend can define a single record of type InstructionDecoderTypeAndSizes<> which the DecoderEmitter will use. This is essentially encoding the command line option as a record.
or more simply
|
30d0838
to
2d7d1dc
Compare
RISCV uses a common base class for each of the 3 instruction sizes. Other targets may be similar.
|
Right, but nonetheless, we will have the type specified per instruction instance and we will still need to validate for example that for all instructions with a particular size, the type string is same. To me that seems unnecessary duplication of this information and then additional verification to make sure that it's consistent. Also, unlike the size in bytes, which is a core property of the instruction, its C++ type to represent its bits in memory seems not a core property. Many backends seems to choose the same type (for example uint64_t) for all their 16/32/48/64 bit insts. Adoption wise as well, sticking it in the per-inst record seems more invasive (for example, in our and several other downstream backends the core instruction records are auto-generated so the adoption curve for this increases further). |
Requesting not review per-se but opinion on the user interface for this optimization. Choices proposed are:
|
I'm probably going to change to uint64_t for RISC-V. The 48-bit instructions are only used by one vendor and are relatively recent additions. I think the duplication cost just wasn't considered when they were added. I agree adding to the Inst class might be too invasive. I still think it should be in .td files somehow. Needing to change a CMake file and replicating to GN and the other build systems when a new instruction width is added seems bad. |
Right, is the option #3 above palatable? We essentially encode it as a standalone record that the DecoderEmitter will look for. |
Maybe it should be stored in the |
…6. NFC Insn is passed to decodeInstruction which is a template function based on the type of Insn. By using uint64_t we ensure only one version of decodeInstruction is created. This reduces the file size of RISCVDisassembler.cpp.o by ~25% in my local build. This should get even more size benefit than llvm#146593.
|
2d7d1dc
to
04366ee
Compare
@topperc Please let me know if this new scheme looks ok. If yes, I'll migrate the rest of the targets (Right now I just changed AARCH64 and RISCV) to use this, and add some unit tests for a final review. |
I have not tested. My speculation is, no binary size change but just minor compile time improvement by avoiding template specialization. I'll check and report back. |
Looks like templating adds a little bit to the code size. Building the RISCVDisassembler.cpp.o in a release config with/without this change results in the following:
So, 16 bytes less. Not significant though. |
Could just be a difference in the name mangling of the function name? Or are you checking the .text size? |
yeah, your guess was right. I dumped the sizes with
That is, text sizes are the same but mangled names are different and that likely leads to larger object file sizes. |
Right, I am not sure what exactly might contribute to the compile time cost. But you're right, if we want we can specialize fieldFromInstruction as well once we have the types. May be in a follow-on PR. |
Have you been able to measure a compile time cost or is this hypothetical? |
Hypothetical at this point. But the original benefits of ~20% reduction in size still applies (without going to the largest type requires). |
Anyway, since the interface is fine, let me at least finish the rest of the PR (unit tests etc) and adopt it for AMDGPU (which uses uint32/64/and DecoderUInt128) types. We can then see if we want to adopt for other targets (either to generate non-templated code if it uses a single type) or to reduce code size (if it uses multiple types). Does that make sense? |
That makes sense. |
+1 for this.
Why can't they be different? |
For all targets today, the same C++ type is used as a payload for all insts of a given size. For example, all 32-bit insts might use uint32_t. So if the type is specified on each instance of the instruction, we may run into the case of |
What if we duplicated decodeMCInst and decodeInstruction based on bitwidth but left them as template functions. That would remove the unused switch cases in each decodeMCInst. |
I don't understand this. Can you elaborate? Do you mean something like:
? So based on the actual size of the template type used, its automatically routed to the appropriate variant? And then there is no change in any .td files? May be just a bool CL option to opt-in into this. Or this could also be just a if constexpr ()... in the decodeInstruction function to route to the appropriate variant on non-templated decodeMCInst |
Actually, this won't work because we know that the sizeoof(CPPType) may not always match the inst size (for example, using of uint64_t for 48 bit insts) |
I meant make decodeMCInst16, decodeMCInst32, and decodeMCInst48 for RISC-V using just the bitwidth from the instructions. And have them called by decodeInstruction16, decodeInstruction32, and decodeInstruction48. Update RISCVDisassembler.cpp to call decodeInstruction16/32/48 instead of decodeInstruction. That decreases the size of the switch in each of the decodeMCInst functions. You've already shown doesn't have much sharing between bit widths so the split decodeMCInst shouldn't be much larger than the single decodeMCInst function we have today on targets that use the same CPP type. But the cost is the duplication of of decodeInstruction for targets that use the same CPP type for all instructions. On targets the use different cpp types, decodeInstruction was already duplicated by the templating. |
3bcc091
to
d0c8b1d
Compare
d0c8b1d
to
1ecd40e
Compare
I have spent more time on this and refined the change a bit. Essentially, two things. (1) I made the interface a little more structured:
and (2) when this option is provided, we do not generate any templated code. So, all functions in the emitted code are non-templated, including
The reduction in .text size is expected. The reduction in data size is coincidental. Because the decoder index is now assigned per-CPP type, its value is generally smaller for the 2nd and 3rd CPP type, so the resulting ULEB128 encoding is smaller as well, resulting in a net reduction in the decoder table size. Please let me know how this looks. I still need to add unit tests for this. Once this is in, I think there are 2 additional improvements here that can be made: a. Improve the robustness. Encode the bitwidth as the first entry in the decoder table, and in the appropriate overload of |
The .rodata reduction is cool. I suspected that might happen. I kind of don't like that we have to describe the cpp types to tablegen at all. It feels like something we should be able to infer from the instruction widths. Needing to go look elsewhere to describe the types feels like its going to confuse someone when they want to add a new bit width. @s-barannikov what do you think? |
Couple of points:
|
Actually, |
Aren't those all part of the description for a searchable table? They're contained in the description of the table along with the other table properties. So it's at least connected and not some extra information off to the side somewhere. I think this goes back to my previous suggestion of it being part of the instruction class. I also don't like it being opt-in. Ideally we wouldn't have two different ways of creating decoders that we need to maintain. It would be nice if out of tree developers could get benefits without needing to stumble onto this new feature. I'll think more about this over the weekend. |
Here's one possibility (along the lines you suggested). If we do not want to pass in the C++ types, we have to keep generating templated code. But we can do the following:
So the existing code continues to work but may be a little fatter due to the switch. Backends can switch to using the |
I agree that ideally we should be able to infer the type from the instruction size. However,
+1 as well. |
The suggestion to use |
Can you do a field extract from a std::bitset without resorting to manually repacking the bits yourself? |
I believe so, something like:
Also, looks like the templated insertBits() is never actually instantiated for a non-integer type. so no need to support that with the |
I do kind of like this. We could maybe use |
So just to summarize, here's the delta being proposed from what is currently in the tree (a mini-design-RFC of sorts):
@topperc @s-barannikov @mshockwave PTAL and see if this seems reasonable. Should we put this up on discourse before proceeding? I am thinking this can be done in steps with AMDGPU being the first target, followed by other in-tree ones. |
Sent #147613 for the |
This change attempts to reduce the size of the disassembler code generated by DecoderEmitter.
Current state:
decodeInstruction
which is the entry point into the generated code anddecodeToMCInst
which is invoked when a terminal state is reached in the decoding step after traversing through the decoder table. Both functions are templated onInsnType
which is the raw instruction bits that are supplied todecodeInstruction
.decodeInstruction
with different types, leading to several template instantiations of this function in the final code. As an example, RISCV instantiates this function withuint32_t
type for decoding 16/32 bit instructions anduint64_t
for decoding 48-bit instructions.decodeToMCInst
generated, it has code that handles all instruction sizes. The decoders emitted for different instructions sizes rarely have any intersection with each other. That means, in the RISCV case, the instantiation with InsnType == uint32_t has decoder code for 48-bit instructions that is never exercised. Conversely, the instantiation with InsnType == uint64_t has decoder code for 16/32 bit instructions that is never exercised.With this change, targets can opt-in into generating non-templated code in the DecoderEmitter. This is triggered by a new
DecoderOptions
list inInstrInfo
class in the TableGen file. TheDecodeOptions
is a list of C++ types that the target intends to call the generateddecodeInstruction
and the list of one or more instruction bitwidth(s) that that C++ type will be used as a payload for the instruction bits. As an example, for the AMDGPU backend, the C++ type "uint64_t" is used as a payload for all 64-bit instructions and the type "DecoderUInt128" is used as payload for all 96 and 128 bit instructions.When targets opt-in into generating non-templated code, the DecodeEmitter will generate several overloaded variants of
decodeInstruction
(and associated helper functions, includedecodeToMCInst
) and the overloaded variant ofdecodeToMCInst
for a given C++ type will include only decoders for the bitwidths associated with that type. This naturally fixes the above code duplication and results in a reduction in code size. As a side effects, the decoder tables generated might be smaller as well since we assign the decoder index per CPP type, so their values are smaller than before, leading to less space needed for their ULEB128 encoding.Adopt this feature for the AMDGPU backend. In a release build, this results in a net 35% reduction in the .text size of libLLVMAMDGPUDisassembler.so and a 5% reduction in the .rodata size. Actual numbers measured locally for a Linux x86_64 build using clang-18.1.3 toolchain are:
For targets that do not use multiple instantiations of
decodeInstruction
, opting in into this feature may not result in code/data size improvement but potential compile time improvements by avoiding the use of templated code.