Skip to content

[TableGen][DecoderEmitter] Add option to emit type-specialized decodeToMCInst #146593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jurahul
Copy link
Contributor

@jurahul jurahul commented Jul 1, 2025

This change attempts to reduce the size of the disassembler code generated by DecoderEmitter.

Current state:

  1. Currently, the code generated by the decoder emitter consists of two key functions: decodeInstruction which is the entry point into the generated code and decodeToMCInst which is invoked when a terminal state is reached in the decoding step after traversing through the decoder table. Both functions are templated on InsnType which is the raw instruction bits that are supplied to decodeInstruction.
  2. Several backends call decodeInstruction with different types, leading to several template instantiations of this function in the final code. As an example, RISCV instantiates this function with uint32_t type for decoding 16/32 bit instructions and uint64_t for decoding 48-bit instructions.
  3. Since there is just one decodeToMCInst generated, it has code that handles all instruction sizes. The decoders emitted for different instructions sizes rarely have any intersection with each other. That means, in the RISCV case, the instantiation with InsnType == uint32_t has decoder code for 48-bit instructions that is never exercised. Conversely, the instantiation with InsnType == uint64_t has decoder code for 16/32 bit instructions that is never exercised.

With this change, targets can opt-in into generating non-templated code in the DecoderEmitter. This is triggered by a new DecoderOptions list in InstrInfo class in the TableGen file. The DecodeOptions is a list of C++ types that the target intends to call the generated decodeInstruction and the list of one or more instruction bitwidth(s) that that C++ type will be used as a payload for the instruction bits. As an example, for the AMDGPU backend, the C++ type "uint64_t" is used as a payload for all 64-bit instructions and the type "DecoderUInt128" is used as payload for all 96 and 128 bit instructions.

When targets opt-in into generating non-templated code, the DecodeEmitter will generate several overloaded variants of decodeInstruction (and associated helper functions, include decodeToMCInst) and the overloaded variant of decodeToMCInst for a given C++ type will include only decoders for the bitwidths associated with that type. This naturally fixes the above code duplication and results in a reduction in code size. As a side effects, the decoder tables generated might be smaller as well since we assign the decoder index per CPP type, so their values are smaller than before, leading to less space needed for their ULEB128 encoding.

Adopt this feature for the AMDGPU backend. In a release build, this results in a net 35% reduction in the .text size of libLLVMAMDGPUDisassembler.so and a 5% reduction in the .rodata size. Actual numbers measured locally for a Linux x86_64 build using clang-18.1.3 toolchain are:

.text 378780 -> 244684, i.e., a 35% reduction in size
.rodata 352648 -> 334960 i.e., a 5% reduction in size

For targets that do not use multiple instantiations of decodeInstruction, opting in into this feature may not result in code/data size improvement but potential compile time improvements by avoiding the use of templated code.

@topperc
Copy link
Collaborator

topperc commented Jul 1, 2025

What's the motivation?

@jurahul
Copy link
Contributor Author

jurahul commented Jul 1, 2025

Please see the PR description that I just added. I also have data to support this, will add soon (tabulating it ATM)

@jurahul
Copy link
Contributor Author

jurahul commented Jul 1, 2025

Failure is in the unit test (TableGen/VarLenDecoder.td), I need to update it.

@topperc
Copy link
Collaborator

topperc commented Jul 1, 2025

Why can't the disassembler emitter figure out the bitwidths from the tablegen input? It already makes separate tables for each instruction size.

@topperc
Copy link
Collaborator

topperc commented Jul 1, 2025

Why can't the disassembler emitter figure out the bitwidths from the tablegen input? It already makes separate tables for each instruction size.

Oh is it because RISC-V uses 3 bit widths, but only 2 types for DecodeInstruction?

@topperc
Copy link
Collaborator

topperc commented Jul 1, 2025

Can we store the type in the Instruction class in the .td files like the bitwidth instead of introducing a complex command line argument?

@jurahul
Copy link
Contributor Author

jurahul commented Jul 1, 2025

Right, this is POC at this which shows that the proposed optimization works. I am open to changing the interface here as well. The command line one was simple enough to not mess with tablegen instruction class etc, but that is an option, though it feels more intrusive. The command line is moderately complex and localized to the decoder emitter.

@jurahul
Copy link
Contributor Author

jurahul commented Jul 1, 2025

Repeating the type per-instruction record might be redundant (and we would need more verification as well to verify for a given size, all insts of that size have the C++ type specified and its consistent). One option is to add a new InstructionTypeAndSize class that records this information, and DecoderEmitter can use that if its present else fall back to templated code. Something like

class InstructionDecoderTypeAndSize<string CPPType, list<int> Bitwidths> {
}

class InstructionDecoderTypeAndSizes<list<InstructionDecoderTypeAndSize>> {
}

and a particular backend can define a single record of type InstructionDecoderTypeAndSizes<> which the DecoderEmitter will use. This is essentially encoding the command line option as a record.

// RISCV.td
// Opt-in to non-templated deocder code.
def : InstructionDecoderTypeAndSizes<[
                InstructionDecoderTypeAndSize<"uint64_t", [48]>,
                InstructionDecoderTypeAndSize<"uint32_t", [16,32]>]>;

or more simply

class InstructionDecoderTypeAndSizes<list<string> CPPTypes, list<list<int>> Bitwidths> {
}

def : InstructionDecoderTypeAndSizes<
           [ "uint32_t", uint64_t"],
           [ [16,32],    [64]     ]>;

@jurahul jurahul force-pushed the decoder_emitter_type_specialization branch from 30d0838 to 2d7d1dc Compare July 1, 2025 22:19
@topperc
Copy link
Collaborator

topperc commented Jul 1, 2025

Repeating the type per-instruction record might be redundant (and we would need more verification as well to verify for a given size, all insts of that size have the C++ type specified and its consistent). One option is to add a new InstructionTypeAndSize class that records this information, and DecoderEmitter can use that if its present else fall back to templated code. Something like

class InstructionDecoderTypeAndSize<string CPPType, list<int> Bitwidths> {
}

class InstructionDecoderTypeAndSizes<list<InstructionDecoderTypeAndSize>> {
}

and a particular backend can define a single record of type InstructionDecoderTypeAndSizes<> which the DecoderEmitter will use. This is essentially encoding the command line option as a record.

// RISCV.td
// Opt-in to non-templated deocder code.
def : InstructionDecoderTypeAndSizes<[
                InstructionDecoderTypeAndSize<"uint64_t", [48]>,
                InstructionDecoderTypeAndSize<"uint32_t", [16,32]>]>;

or more simply

class InstructionDecoderTypeAndSizes<list<string> CPPTypes, list<list<int>> Bitwidths> {
}

def : InstructionDecoderTypeAndSizes<
           [ "uint32_t", uint64_t"],
           [ [16,32],    [64]     ]>;

RISCV uses a common base class for each of the 3 instruction sizes. Other targets may be similar.

class RVInst<dag outs, dag ins, string opcodestr, string argstr,                 
             list<dag> pattern, InstFormat format>                               
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<32> Inst;                                                           
  // SoftFail is a field the disassembler can use to provide a way for           
  // instructions to not match without killing the whole decode process. It is   
  // mainly used for ARM, but Tablegen expects this field to exist or it fails   
  // to build the decode table.                                                  
  field bits<32> SoftFail = 0;                                                   
  let Size = 4;                                                                  
}                                                                                
                                                                                 
class RVInst48<dag outs, dag ins, string opcodestr, string argstr,               
               list<dag> pattern, InstFormat format>                             
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<48> Inst;                                                           
  field bits<48> SoftFail = 0;                                                   
  let Size = 6;                                                                  
}                                                                                
                                                                                 
class RVInst64<dag outs, dag ins, string opcodestr, string argstr,               
               list<dag> pattern, InstFormat format>                             
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<64> Inst;                                                           
  field bits<64> SoftFail = 0;                                                   
  let Size = 8;                                                                  
}

@jurahul
Copy link
Contributor Author

jurahul commented Jul 1, 2025

Repeating the type per-instruction record might be redundant (and we would need more verification as well to verify for a given size, all insts of that size have the C++ type specified and its consistent). One option is to add a new InstructionTypeAndSize class that records this information, and DecoderEmitter can use that if its present else fall back to templated code. Something like

class InstructionDecoderTypeAndSize<string CPPType, list<int> Bitwidths> {
}

class InstructionDecoderTypeAndSizes<list<InstructionDecoderTypeAndSize>> {
}

and a particular backend can define a single record of type InstructionDecoderTypeAndSizes<> which the DecoderEmitter will use. This is essentially encoding the command line option as a record.

// RISCV.td
// Opt-in to non-templated deocder code.
def : InstructionDecoderTypeAndSizes<[
                InstructionDecoderTypeAndSize<"uint64_t", [48]>,
                InstructionDecoderTypeAndSize<"uint32_t", [16,32]>]>;

or more simply

class InstructionDecoderTypeAndSizes<list<string> CPPTypes, list<list<int>> Bitwidths> {
}

def : InstructionDecoderTypeAndSizes<
           [ "uint32_t", uint64_t"],
           [ [16,32],    [64]     ]>;

RISCV uses a common base class for each of the 3 instruction sizes. Other targets may be similar.

class RVInst<dag outs, dag ins, string opcodestr, string argstr,                 
             list<dag> pattern, InstFormat format>                               
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<32> Inst;                                                           
  // SoftFail is a field the disassembler can use to provide a way for           
  // instructions to not match without killing the whole decode process. It is   
  // mainly used for ARM, but Tablegen expects this field to exist or it fails   
  // to build the decode table.                                                  
  field bits<32> SoftFail = 0;                                                   
  let Size = 4;                                                                  
}                                                                                
                                                                                 
class RVInst48<dag outs, dag ins, string opcodestr, string argstr,               
               list<dag> pattern, InstFormat format>                             
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<48> Inst;                                                           
  field bits<48> SoftFail = 0;                                                   
  let Size = 6;                                                                  
}                                                                                
                                                                                 
class RVInst64<dag outs, dag ins, string opcodestr, string argstr,               
               list<dag> pattern, InstFormat format>                             
    : RVInstCommon<outs, ins, opcodestr, argstr, pattern, format> {              
  field bits<64> Inst;                                                           
  field bits<64> SoftFail = 0;                                                   
  let Size = 8;                                                                  
}

Right, but nonetheless, we will have the type specified per instruction instance and we will still need to validate for example that for all instructions with a particular size, the type string is same. To me that seems unnecessary duplication of this information and then additional verification to make sure that it's consistent. Also, unlike the size in bytes, which is a core property of the instruction, its C++ type to represent its bits in memory seems not a core property. Many backends seems to choose the same type (for example uint64_t) for all their 16/32/48/64 bit insts. Adoption wise as well, sticking it in the per-inst record seems more invasive (for example, in our and several other downstream backends the core instruction records are auto-generated so the adoption curve for this increases further).

@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

Requesting not review per-se but opinion on the user interface for this optimization. Choices proposed are:

  • Command line option, as in this PR. +: non-intrusive in terms of .td files, -: need to parse it and parsing can be flaky.
  • Per instruction record carries cpp type (@topperc 's suggestion): +: No command line option parsing flakiness, -: (IMO) too invasive, I see difficulty or increased complexity in adoption for auto-generated inst defs used in several downstream backends, needs additional validation for consistent across all insts of a given size.
  • Option embedded as a new singleton records in the .td file: +:No command line option parsing flakiness, less intrusive than the option below (as the single def is standalone not attached to anything else), no consistency checks. -: ?

@topperc
Copy link
Collaborator

topperc commented Jul 2, 2025

Many backends seems to choose the same type (for example uint64_t) for all their 16/32/48/64 bit insts.

I'm probably going to change to uint64_t for RISC-V. The 48-bit instructions are only used by one vendor and are relatively recent additions. I think the duplication cost just wasn't considered when they were added.

I agree adding to the Inst class might be too invasive. I still think it should be in .td files somehow. Needing to change a CMake file and replicating to GN and the other build systems when a new instruction width is added seems bad.

@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

Many backends seems to choose the same type (for example uint64_t) for all their 16/32/48/64 bit insts.

I'm probably going to change to uint64_t for RISC-V. The 48-bit instructions are only used by one vendor and are relatively recent additions. I think the duplication cost just wasn't considered when they were added.

I agree adding to the Inst class might be too invasive. I still think it should be in .td files somehow. Needing to change a CMake file and replicating to GN and the other build systems when a new instruction width is added seems bad.

Right, is the option #3 above palatable? We essentially encode it as a standalone record that the DecoderEmitter will look for.

@topperc
Copy link
Collaborator

topperc commented Jul 2, 2025

Many backends seems to choose the same type (for example uint64_t) for all their 16/32/48/64 bit insts.

I'm probably going to change to uint64_t for RISC-V. The 48-bit instructions are only used by one vendor and are relatively recent additions. I think the duplication cost just wasn't considered when they were added.
I agree adding to the Inst class might be too invasive. I still think it should be in .td files somehow. Needing to change a CMake file and replicating to GN and the other build systems when a new instruction width is added seems bad.

Right, is the option #3 above palatable? We essentially encode it as a standalone record that the DecoderEmitter will look for.

Maybe it should be stored in the InstrInfo class like isLittleEndianEncoding or the Target class?

topperc added a commit to topperc/llvm-project that referenced this pull request Jul 2, 2025
…6. NFC

Insn is passed to decodeInstruction which is a template function
based on the type of Insn. By using uint64_t we ensure only one
version of decodeInstruction is created. This reduces the file size
of RISCVDisassembler.cpp.o by ~25% in my local build.

This should get even more size benefit than llvm#146593.
@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

InstrInfo seems reasonable. Let me rework the PR to add that

@jurahul jurahul force-pushed the decoder_emitter_type_specialization branch from 2d7d1dc to 04366ee Compare July 2, 2025 14:38
@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

@topperc Please let me know if this new scheme looks ok. If yes, I'll migrate the rest of the targets (Right now I just changed AARCH64 and RISCV) to use this, and add some unit tests for a final review.

@topperc
Copy link
Collaborator

topperc commented Jul 2, 2025

@topperc Please let me know if this new scheme looks ok. If yes, I'll migrate the rest of the targets (Right now I just changed AARCH64 and RISCV) to use this, and add some unit tests for a final review.

Does this have any binary size effect on RISCV after #146619?

@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

@topperc Please let me know if this new scheme looks ok. If yes, I'll migrate the rest of the targets (Right now I just changed AARCH64 and RISCV) to use this, and add some unit tests for a final review.

Does this have any binary size effect on RISCV after #146619?

I have not tested. My speculation is, no binary size change but just minor compile time improvement by avoiding template specialization. I'll check and report back.

@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

Looks like templating adds a little bit to the code size. Building the RISCVDisassembler.cpp.o in a release config with/without this change results in the following:

Old:
196112 ./build/lib/Target/RISCV/Disassembler/CMakeFiles/LLVMRISCVDisassembler.dir/RISCVDisassembler.cpp.o 
New:
196096 ./build/lib/Target/RISCV/Disassembler/CMakeFiles/LLVMRISCVDisassembler.dir/RISCVDisassembler.cpp.o

So, 16 bytes less. Not significant though.

@topperc
Copy link
Collaborator

topperc commented Jul 2, 2025

Looks like templating adds a little bit to the code size. Building the RISCVDisassembler.cpp.o in a release config with/without this change results in the following:

Old:
196112 ./build/lib/Target/RISCV/Disassembler/CMakeFiles/LLVMRISCVDisassembler.dir/RISCVDisassembler.cpp.o 
New:
196096 ./build/lib/Target/RISCV/Disassembler/CMakeFiles/LLVMRISCVDisassembler.dir/RISCVDisassembler.cpp.o

So, 16 bytes less. Not significant though.

Could just be a difference in the name mangling of the function name? Or are you checking the .text size?

@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

yeah, your guess was right. I dumped the sizes with size -A and I see:

New:

.text._ZN12_GLOBAL__N_114decodeToMCInstEjN4llvm14MCDisassembler12DecodeStatusEmRNS0_6MCInstEmPKS1_Rb     27023

.text._ZN12_GLOBAL__N_117decodeInstructionEPKhRN4llvm6MCInstEmmPKNS2_14MCDisassemblerERKNS2_15MCSubtargetInfoE        9238 

Old:                                                                                                                                                                                                                                                  

.text._ZN12_GLOBAL__N_114decodeToMCInstImEEN4llvm14MCDisassembler12DecodeStatusEjS3_T_RNS1_6MCInstEmPKS2_Rb                             27023
 .text._ZN12_GLOBAL__N_117decodeInstructionImEEN4llvm14MCDisassembler12DecodeStatusEPKhRNS1_6MCInstET_mPKS2_RKNS1_15MCSubtargetInfoE      9238

That is, text sizes are the same but mangled names are different and that likely leads to larger object file sizes.

@topperc topperc reopened this Jul 2, 2025
@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

@topperc My question is still unanswered. WDYT of this new interface to op-in into this optimization?

Interface seems fine. Do you plan to opt-in all targets or just the ones that benefit?

Thanks. Maybe all targets? If nothing else, they probably benefit from compile time improvements due to non-templated code.

It's not many functions and not many callers per file right? If there's a big compile time cost for templates, fieldFromInstruction feels like it would be a bigger issue than decodeToMCInst.

Right, I am not sure what exactly might contribute to the compile time cost. But you're right, if we want we can specialize fieldFromInstruction as well once we have the types. May be in a follow-on PR.

@topperc
Copy link
Collaborator

topperc commented Jul 2, 2025

@topperc My question is still unanswered. WDYT of this new interface to op-in into this optimization?

Interface seems fine. Do you plan to opt-in all targets or just the ones that benefit?

Thanks. Maybe all targets? If nothing else, they probably benefit from compile time improvements due to non-templated code.

It's not many functions and not many callers per file right? If there's a big compile time cost for templates, fieldFromInstruction feels like it would be a bigger issue than decodeToMCInst.

Right, I am not sure what exactly might contribute to the compile time cost. But you're right, if we want we can specialize fieldFromInstruction as well once we have the types. May be in a follow-on PR.

Have you been able to measure a compile time cost or is this hypothetical?

@jurahul
Copy link
Contributor Author

jurahul commented Jul 2, 2025

@topperc My question is still unanswered. WDYT of this new interface to op-in into this optimization?

Interface seems fine. Do you plan to opt-in all targets or just the ones that benefit?

Thanks. Maybe all targets? If nothing else, they probably benefit from compile time improvements due to non-templated code.

It's not many functions and not many callers per file right? If there's a big compile time cost for templates, fieldFromInstruction feels like it would be a bigger issue than decodeToMCInst.

Right, I am not sure what exactly might contribute to the compile time cost. But you're right, if we want we can specialize fieldFromInstruction as well once we have the types. May be in a follow-on PR.

Have you been able to measure a compile time cost or is this hypothetical?

Hypothetical at this point. But the original benefits of ~20% reduction in size still applies (without going to the largest type requires).

@jurahul
Copy link
Contributor Author

jurahul commented Jul 3, 2025

Anyway, since the interface is fine, let me at least finish the rest of the PR (unit tests etc) and adopt it for AMDGPU (which uses uint32/64/and DecoderUInt128) types. We can then see if we want to adopt for other targets (either to generate non-templated code if it uses a single type) or to reduce code size (if it uses multiple types). Does that make sense?

@topperc
Copy link
Collaborator

topperc commented Jul 3, 2025

Anyway, since the interface is fine, let me at least finish the rest of the PR (unit tests etc) and adopt it for AMDGPU (which uses uint32/64/and DecoderUInt128) types. We can then see if we want to adopt for other targets (either to generate non-templated code if it uses a single type) or to reduce code size (if it uses multiple types). Does that make sense?

That makes sense.

@s-barannikov
Copy link
Contributor

Can we store the type in the Instruction class in the .td files like the bitwidth instead of introducing a complex command line argument?

+1 for this.

we will still need to validate for example that for all instructions with a particular size, the type string is same

Why can't they be different?

@jurahul
Copy link
Contributor Author

jurahul commented Jul 3, 2025

Can we store the type in the Instruction class in the .td files like the bitwidth instead of introducing a complex command line argument?

+1 for this.

we will still need to validate for example that for all instructions with a particular size, the type string is same

Why can't they be different?

For all targets today, the same C++ type is used as a payload for all insts of a given size. For example, all 32-bit insts might use uint32_t. So if the type is specified on each instance of the instruction, we may run into the case of Inst1 { Size = 4, CPPType = "uint32_t"} and Inst2 { Size = 4, CPPType = "uint64_t"}. I am assuming we do not want to support something like this (we can, but then all 32-bit decoders will be replicated into the 64-bit InstType code, which is the thing this PR avoids). so we will need additional sanity check that all insts of a given size use the same CPPType for their payload. But with the interface being proposed, we are avoiding that as this info is now in InstrInfo class and not replicated per instruction instance.

@topperc
Copy link
Collaborator

topperc commented Jul 3, 2025

What if we duplicated decodeMCInst and decodeInstruction based on bitwidth but left them as template functions. That would remove the unused switch cases in each decodeMCInst.
Unfortunately, we'd have to update targets to call the correct decodeInstruction with the bitwidth in the name.

@jurahul
Copy link
Contributor Author

jurahul commented Jul 3, 2025

What if we duplicated decodeMCInst and decodeInstruction based on bitwidth but left them as template functions. That would remove the unused switch cases in each decodeMCInst. Unfortunately, we'd have to update targets to call the correct decodeInstruction with the bitwidth in the name.

I don't understand this. Can you elaborate? Do you mean something like:

template <typename InsnType>
std::enable_if<sizeof(InsnType) == 32); /// use the proper syntax here
decodeToMCInst(...) {
  // just 32-bit decoder cases
}

template <typename InsnType>
std::enable_if<sizeof(InsnType) == 64); /// use the proper syntax here
decodeToMCInst(...) {
  // just 64-bit decoder cases
}

? So based on the actual size of the template type used, its automatically routed to the appropriate variant? And then there is no change in any .td files? May be just a bool CL option to opt-in into this.

Or this could also be just a if constexpr ()... in the decodeInstruction function to route to the appropriate variant on non-templated decodeMCInst

@jurahul
Copy link
Contributor Author

jurahul commented Jul 3, 2025

What if we duplicated decodeMCInst and decodeInstruction based on bitwidth but left them as template functions. That would remove the unused switch cases in each decodeMCInst. Unfortunately, we'd have to update targets to call the correct decodeInstruction with the bitwidth in the name.

I don't understand this. Can you elaborate? Do you mean something like:

template <typename InsnType>
std::enable_if<sizeof(InsnType) == 32); /// use the proper syntax here
decodeToMCInst(...) {
  // just 32-bit decoder cases
}

template <typename InsnType>
std::enable_if<sizeof(InsnType) == 64); /// use the proper syntax here
decodeToMCInst(...) {
  // just 64-bit decoder cases
}

? So based on the actual size of the template type used, its automatically routed to the appropriate variant? And then there is no change in any .td files? May be just a bool CL option to opt-in into this.

Or this could also be just a if constexpr ()... in the decodeInstruction function to route to the appropriate variant on non-templated decodeMCInst

Actually, this won't work because we know that the sizeoof(CPPType) may not always match the inst size (for example, using of uint64_t for 48 bit insts)

@topperc
Copy link
Collaborator

topperc commented Jul 3, 2025

What if we duplicated decodeMCInst and decodeInstruction based on bitwidth but left them as template functions. That would remove the unused switch cases in each decodeMCInst. Unfortunately, we'd have to update targets to call the correct decodeInstruction with the bitwidth in the name.

I don't understand this. Can you elaborate? Do you mean something like:

template <typename InsnType>
std::enable_if<sizeof(InsnType) == 32); /// use the proper syntax here
decodeToMCInst(...) {
  // just 32-bit decoder cases
}

template <typename InsnType>
std::enable_if<sizeof(InsnType) == 64); /// use the proper syntax here
decodeToMCInst(...) {
  // just 64-bit decoder cases
}

? So based on the actual size of the template type used, its automatically routed to the appropriate variant? And then there is no change in any .td files? May be just a bool CL option to opt-in into this.
Or this could also be just a if constexpr ()... in the decodeInstruction function to route to the appropriate variant on non-templated decodeMCInst

Actually, this won't work because we know that the sizeoof(CPPType) may not always match the inst size (for example, using of uint64_t for 48 bit insts)

I meant make decodeMCInst16, decodeMCInst32, and decodeMCInst48 for RISC-V using just the bitwidth from the instructions. And have them called by decodeInstruction16, decodeInstruction32, and decodeInstruction48. Update RISCVDisassembler.cpp to call decodeInstruction16/32/48 instead of decodeInstruction.

That decreases the size of the switch in each of the decodeMCInst functions. You've already shown doesn't have much sharing between bit widths so the split decodeMCInst shouldn't be much larger than the single decodeMCInst function we have today on targets that use the same CPP type. But the cost is the duplication of of decodeInstruction for targets that use the same CPP type for all instructions. On targets the use different cpp types, decodeInstruction was already duplicated by the templating.

@jurahul jurahul force-pushed the decoder_emitter_type_specialization branch from 3bcc091 to d0c8b1d Compare July 4, 2025 00:05
@jurahul jurahul force-pushed the decoder_emitter_type_specialization branch from d0c8b1d to 1ecd40e Compare July 4, 2025 02:53
@jurahul
Copy link
Contributor Author

jurahul commented Jul 4, 2025

I have spent more time on this and refined the change a bit. Essentially, two things.

(1) I made the interface a little more structured:

// InstrDecoderOption - This class is used to provide some options to the
// TableGen DecoderEmitter backend.
class InstrDecoderOption<string ty, list<int> bws> {
  string CPPType = ty;       // C++ type for generating non-templated code.
  list<int> Bitwidths = bws; // List of bitwidths supported by the above type.

  assert !not(!empty(CPPType)), "CPP type cannot be empty";
  assert !not(!empty(Bitwidths)), "Bitwidths cannot be empty";
}

class InstrInfo {
...
  list<InstrDecoderOption> DecoderOptions = [];
}

def AMDGPUInstrInfo : InstrInfo {
  let guessInstructionProperties = 1;

  // Opt-in into non-templated code for instruction decoder.
  let DecoderOptions = [
    InstrDecoderOption<"uint32_t", [32]>,
    InstrDecoderOption<"uint64_t", [64]>,
    InstrDecoderOption<"DecoderUInt128", [96, 128]>,
  ];
}

and

(2) when this option is provided, we do not generate any templated code. So, all functions in the emitted code are non-templated, including decodeInstruction. @topperc this is similar to what you suggested, except without the 16/32 suffix and these are just overloaded functions. To avoid coincidental errors, I made the insn parameter a reference so that we don't inadvertently call the wrong one. I then measured the size of the AMDGPU disassembler code by using size -A build/lib/libLLVMAMDGPUDisassembler.so command. With that I see the following delta:

.text 378780 -> 244684, i.e., a 35% reduction in size
.rodata 352648 -> 334960 i.e., a 5% reduction in size

The reduction in .text size is expected. The reduction in data size is coincidental. Because the decoder index is now assigned per-CPP type, its value is generally smaller for the 2nd and 3rd CPP type, so the resulting ULEB128 encoding is smaller as well, resulting in a net reduction in the decoder table size.

Please let me know how this looks. I still need to add unit tests for this. Once this is in, I think there are 2 additional improvements here that can be made:

a. Improve the robustness. Encode the bitwidth as the first entry in the decoder table, and in the appropriate overload of decodeInstruction, fail immediately if it's not one of the supported ones for that variant.
b. If its beneficial, change NumToSkip encoding per CPP type. Because the code is now specialized per type, we can support the case where say all 32-bit tables are small enough to use 8-bits but only 64-bit tables need 16 bits. I am not sure if this will pan out but seems this per-size/CPP type specialization can open this possibility.

@topperc
Copy link
Collaborator

topperc commented Jul 4, 2025

I have spent more time on this and refined the change a bit. Essentially, two things.

(1) I made the interface a little more structured:

// InstrDecoderOption - This class is used to provide some options to the
// TableGen DecoderEmitter backend.
class InstrDecoderOption<string ty, list<int> bws> {
  string CPPType = ty;       // C++ type for generating non-templated code.
  list<int> Bitwidths = bws; // List of bitwidths supported by the above type.

  assert !not(!empty(CPPType)), "CPP type cannot be empty";
  assert !not(!empty(Bitwidths)), "Bitwidths cannot be empty";
}

class InstrInfo {
...
  list<InstrDecoderOption> DecoderOptions = [];
}

def AMDGPUInstrInfo : InstrInfo {
  let guessInstructionProperties = 1;

  // Opt-in into non-templated code for instruction decoder.
  let DecoderOptions = [
    InstrDecoderOption<"uint32_t", [32]>,
    InstrDecoderOption<"uint64_t", [64]>,
    InstrDecoderOption<"DecoderUInt128", [96, 128]>,
  ];
}

and

(2) when this option is provided, we do not generate any templated code. So, all functions in the emitted code are non-templated, including decodeInstruction. @topperc this is similar to what you suggested, except without the 16/32 suffix and these are just overloaded functions. To avoid coincidental errors, I made the insn parameter a reference so that we don't inadvertently call the wrong one. I then measured the size of the AMDGPU disassembler code by using size -A build/lib/libLLVMAMDGPUDisassembler.so command. With that I see the following delta:

.text 378780 -> 244684, i.e., a 35% reduction in size
.rodata 352648 -> 334960 i.e., a 5% reduction in size

The reduction in .text size is expected. The reduction in data size is coincidental. Because the decoder index is now assigned per-CPP type, its value is generally smaller for the 2nd and 3rd CPP type, so the resulting ULEB128 encoding is smaller as well, resulting in a net reduction in the decoder table size.

Please let me know how this looks. I still need to add unit tests for this. Once this is in, I think there are 2 additional improvements here that can be made:

a. Improve the robustness. Encode the bitwidth as the first entry in the decoder table, and in the appropriate overload of decodeInstruction, fail immediately if it's not one of the supported ones for that variant. b. If its beneficial, change NumToSkip encoding per CPP type. Because the code is now specialized per type, we can support the case where say all 32-bit tables are small enough to use 8-bits but only 64-bit tables need 16 bits. I am not sure if this will pan out but seems this per-size/CPP type specialization can open this possibility.

The .rodata reduction is cool. I suspected that might happen.

I kind of don't like that we have to describe the cpp types to tablegen at all. It feels like something we should be able to infer from the instruction widths. Needing to go look elsewhere to describe the types feels like its going to confuse someone when they want to add a new bit width.

@s-barannikov what do you think?

@jurahul
Copy link
Contributor Author

jurahul commented Jul 4, 2025

Couple of points:

  1. Note that there is some precedent of describing CPP names or types in MLIR TableGen.
  2. Adding a new bitwidth to an existing target should be a rare event, so this should affect a relatively small number of developers.
  3. This is an opt-in, so by default folks will get the current behavior.

@jurahul
Copy link
Contributor Author

jurahul commented Jul 4, 2025

Actually, git grep CppTypeName shows a lot of hits in LLVM, so there is precedent on LLVM side as well.

@topperc
Copy link
Collaborator

topperc commented Jul 4, 2025

Actually, git grep CppTypeName shows a lot of hits in LLVM, so there is precedent on LLVM side as well.

Aren't those all part of the description for a searchable table? They're contained in the description of the table along with the other table properties. So it's at least connected and not some extra information off to the side somewhere. I think this goes back to my previous suggestion of it being part of the instruction class.

I also don't like it being opt-in. Ideally we wouldn't have two different ways of creating decoders that we need to maintain. It would be nice if out of tree developers could get benefits without needing to stumble onto this new feature.

I'll think more about this over the weekend.

@jurahul
Copy link
Contributor Author

jurahul commented Jul 4, 2025

Here's one possibility (along the lines you suggested). If we do not want to pass in the C++ types, we have to keep generating templated code. But we can do the following:

  1. generate one variant of decodeToMCInst and decodeInstruction for each size. So decodeToMCInst<Sz>, decodeInstruction<Sz> for each size Sz.
  2. Encode the bitwidth/byte size of the instruction as first byte on the decoder table.
  3. Generate a top-level decodeInstruction which looks like:
template<typename InsnType>
decodeInstruction16(const uint8_t DecoderTable*) {
  if (*DecoderTable++ * 8 != 16)
    return MCDisassembler::Fail;
  // decoder table traversal
}

template<typename InsnType>
decodeInstruction(const uint8_t DecoderTable*) {
  const uint8_t Bitwidth = DecoderTable[0] * 8;
  switch (Bitwidth) {
    case 16: return decodeInstruction16(...);
    case 32: return decodeInstruction32(...);
    default: return MCDisassembler::Fail;
  }
}

So the existing code continues to work but may be a little fatter due to the switch. Backends can switch to using the decodeInstruction16 etc functions and then they will see the code size reduction. However, that means they have to actively "opt-in" by changing the API they call. I am not sure if that's any better than what I have.

@s-barannikov
Copy link
Contributor

I kind of don't like that we have to describe the cpp types to tablegen at all. It feels like something we should be able to infer from the instruction widths.

I agree that ideally we should be able to infer the type from the instruction size. However, InsnType can be a custom type like AMDGPU's DecoderUInt128. We will need to be able to synthesize such types or provide a template supporting arbitrary widths (a wrapper over std::bitset?).

I also don't like it being opt-in. Ideally we wouldn't have two different ways of creating decoders that we need to maintain. It would be nice if out of tree developers could get benefits without needing to stumble onto this new feature.

+1 as well.

@jurahul
Copy link
Contributor Author

jurahul commented Jul 8, 2025

The suggestion to use std::bitset<> seems reasonable. So we will use uint_8/16/32/64_t when possible, else fall back to std::bitset<N>. @topperc WDYT? Given that that will be an backward incompatible change, we still need a single bool option in InstrInfo, and once we have migrated all backends, we will plan to deprecate the templated code at some point in future? (giving enough time for downstream backends to migrate) Maybe that can discussed at a later point.

@topperc
Copy link
Collaborator

topperc commented Jul 8, 2025

The suggestion to use std::bitset<> seems reasonable. So we will use uint_8/16/32/64_t when possible, else fall back to std::bitset<N>. @topperc WDYT? Given that that will be an backward incompatible change, we still need a single bool option in InstrInfo, and once we have migrated all backends, we will plan to deprecate the templated code at some point in future? (giving enough time for downstream backends to migrate) Maybe that can discussed at a later point.

Can you do a field extract from a std::bitset without resorting to manually repacking the bits yourself?

@jurahul
Copy link
Contributor Author

jurahul commented Jul 8, 2025

I believe so, something like:

template <size_t N>
uint64_t fieldFromInstruction(const std::bitset<N>& Insn, unsigned StartBit,
                              unsigned Size) {
  assert(StartBit + Size <= N && "Instruction field out of bounds!");
  if (Size == N)
    return Insn.to_ullong();
  const std::bitset<N> Mask((1ULL << Size) - 1);  
  return ((Insn >> StartBit) & Mask).to_ullong();
}

Also, looks like the templated insertBits() is never actually instantiated for a non-integer type. so no need to support that with the bitset type. I will send a cleanup PR to change insertBits() to only support integer types.

@topperc
Copy link
Collaborator

topperc commented Jul 8, 2025

Here's one possibility (along the lines you suggested). If we do not want to pass in the C++ types, we have to keep generating templated code. But we can do the following:

  1. generate one variant of decodeToMCInst and decodeInstruction for each size. So decodeToMCInst<Sz>, decodeInstruction<Sz> for each size Sz.
  2. Encode the bitwidth/byte size of the instruction as first byte on the decoder table.
  3. Generate a top-level decodeInstruction which looks like:
template<typename InsnType>
decodeInstruction16(const uint8_t DecoderTable*) {
  if (*DecoderTable++ * 8 != 16)
    return MCDisassembler::Fail;
  // decoder table traversal
}

template<typename InsnType>
decodeInstruction(const uint8_t DecoderTable*) {
  const uint8_t Bitwidth = DecoderTable[0] * 8;
  switch (Bitwidth) {
    case 16: return decodeInstruction16(...);
    case 32: return decodeInstruction32(...);
    default: return MCDisassembler::Fail;
  }
}

So the existing code continues to work but may be a little fatter due to the switch. Backends can switch to using the decodeInstruction16 etc functions and then they will see the code size reduction. However, that means they have to actively "opt-in" by changing the API they call. I am not sure if that's any better than what I have.

I do kind of like this. We could maybe use decodeInstruction as the name if the target only has 1 bit width. If there are multiple bitwidths, require the target to call decodeInstruction16, decodeInstruction32 etc. That should be a pretty easy migration for out of tree targets with multiple bit widths.

@jurahul
Copy link
Contributor Author

jurahul commented Jul 8, 2025

So just to summarize, here's the delta being proposed from what is currently in the tree (a mini-design-RFC of sorts):

  1. Auto-deduce C++ types based on instruction bit width. The logic will be simple: Use APInt for VarLenInsts, else use uint8_t/uint16_t/uint32_t/uint64_t for 8/16/32/64 bit wide instructions, else use std::bitset<N>.
  2. For each bit width, generate a decodeInstruction<N> function using the auto-deduced type from (1) above. For targets with a single instruction bit width, the <N> suffix will be dropped.
  3. This entails changes to the disassembler code to use the appropriate auto-deduced C++ type for the instruction payload and call the appropriate (optionally suffixed) function.
  4. Change the decoder table to encode the instruction bit width as the first entry in the table for a quick sanity check to ensure that decoding fails if the decodeInstruction<N> function is called with a decoder table of mismatching instruction width. This is not applicable to variable length instructions.
  5. We still want to continue supporting the current templated codegen for backward compatibility and for gradual migration of targets to this new scheme. So currently we will add a bit to InstrInfo to opt-in to using new scheme and migrate one target at a time.
  6. After upstream targets are migrated, put a PSA on discourse to ask downstream backends to migrate within some timeframe and then at appropriate point in future, drop support for generating the templated code and delete the opt-in bit in InstrInfo.

@topperc @s-barannikov @mshockwave PTAL and see if this seems reasonable. Should we put this up on discourse before proceeding? I am thinking this can be done in steps with AMDGPU being the first target, followed by other in-tree ones.

@jurahul jurahul requested a review from topperc July 8, 2025 23:36
@jurahul
Copy link
Contributor Author

jurahul commented Jul 9, 2025

Sent #147613 for the insertBits limited to integer types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants