Apolloclipse derivates from Apollo, eclipse (CTRL^F "eclipse") and apocalypse.
X86-64 bilateral instruction tokenizer implemented in C. Supports the following processor extensions: AES, AVX, AVX2, AVX512, FMA, MMX, SSE, SSE2, SSE3, SSE4, x87(FPU), VMX. In order to ease testing, a diassembler which transforms tokens into compilable assembly (for NASM compiler) has been implemented.
The library perform bilateral conversion (machine code <=> tokens) between x64-86 machine code instructions and tokens (AVL_instruction_t (see token)).
The token (AVL_instruction_t) contains all the avalavaile information of the x86-64 processor instruction that represents, in 32-bytes of data.
The AVL_instruction_t prototype is:
typedef struct
{
uint32_t i_flags;
AVL_mnemonic_t i_mnemonic;
uint8_t i_opcode[3];
uint8_t i_vp[3];
uint8_t i_mod_rm;
uint8_t i_sib;
uint32_t i_disp;
uint8_t i_size;
AVL_reg_t i_reg1;
AVL_reg_t i_reg2;
AVL_reg_t i_reg3;
uint64_t i_imm;
} AVL_instruction_t;-
The field
i_flagshas its own subsection here. -
The field
i_mnemonicis an enum in which each value represents an unique mnemonic (e.g.MOV,VPCMUB,VPTERNLOGQ, ...). The value of0is reserved, ai_mnemonicwith a value== 0represents an invalid instruction. The enumAVL_mnemonic_tcan be found at the root of the tokenizer on the fileincludes/user/AVL_mnemonic.h. -
The field
i_opcodeholds the map and the index used. It has its own subsection here. -
The field
i_vpcontains the raw data of theVEXprefixes (2 or 3 bytes) or the 3 last bytes of theEVEXprefix. It has its own subsection here. -
The field
i_mod_rmcontains the raw data of theModR/Mbyte of the instruction. -
The field
i_sibcontains the raw data of theSIBbyte of the instruction. Addressing is not resolved by the tokenizer (see operands), however you can easily resolve addressing using theModR/M, theSIBand thedisplacement. A complete exemple is provided in the code source of the disassembler. The use of theSIBbyte is defined by theModR/M. -
The field
i_dispcontains the raw data of thedisplacementused in addressing (e.g.mov rax, [rdi + 0x69420]). The displacement size can be either 8 or 32 bits, it's defined by theModR/Mor theSIBbyte. -
The field
i_sizeholds the size of the instruction (prefixes, opcode, suffixes plus immediate data). The size of a x86 instruction is from 1 to 16 bytes. -
The fields
i_reg1,i_reg2,i_reg3represent the operands used by the instruction. An operand can be either a register or memory. Further information such as the state of the instruction (readonly, read&write or writeonly) durring the instruction excution can he found here. -
The field
i_immcontains the rawimmediate dataof the instruction. Some instructions encode an aditional operand like the (ModR/M.rm,ModR/M.reg, ...) in the first 4-bits of theimmediate data(e.g.VBLENDVPS).
The flags (AVL_instruction_t.i_flags) encode several information about the instruction such as prefixes, suffixes (immediate data), operand size, operand modifiers and CPU flags.
| BYTES/BITS | 0b001 | 0b010 | 0b011 | 0b100 | 0b101 | 0b110 | 0b111 | 0b1000 |
|---|---|---|---|---|---|---|---|---|
| 1 | lp_lock | lp_rpx | lp_rpnx | lp_fs | lp_gs | lp_nbr | lp_br | lp_opsz |
| 2 | lp_adsz | REX.B | REX.X | REX.R | REX.W | has_imm | is_evex | is_mdrm |
| 3 | om1_r | om1_w | om2_r | om2_w | om3_r | om3_w | fl_car | fl_par |
| 4 | fl_adj | fl_zero | fl_sign | fl_ovfw | OP_SZ | OP_SZ | OP_SZ | reserved |
lp_lock: The instruction has theLOCKlegacy prefix (execute certain read-modify-write instructions atomically).
TheAVL_HAS_LP_LOCK_PFX(flags)macro enables to retrieve its value conditionally.lp_rpx: Either the instruction has theREPNE/Zlegacy prefix (repeat string handling) or the0xF2prefix for indexing within opcodes tables is used.
TheAVL_HAS_LP_REPNX_PFX(flags)macro enables to retrieve its value conditionally.lp_prnx: Either the instruction has theREPE/Zlegacy prefix (repeat string handling) or the0xF3prefix for indexing within opcode tables is used.
TheAVL_HAS_LP_REPX_PFX(flags)macro enables to retrieve its value conditionally.lp_fs: The instruction has theFS segment overwritelegacy prefix (use the FS segment instead of the stack while addressing).
TheAVL_HAS_LP_FS_PFX(flags)macro enables to retrieve its value conditionally.lp_gs: The intruction has theGS segment overwritelegacy prefix (use the GS segment instead of the stack while addressing).
TheAVL_HAS_LP_GS_PFX(flags)macro enables to retrieve its value conditionally.lp_nbr: The instruction has thebranch not takenlegacy prefix (which is used to lessen the impact of branch misprediction (>= Pentium 4)). Weak hint.
TheAVL_HAS_LP_NOBRANCH_PFX(flags)macro enables to retrieve its value conditionally.lp_br: The instruction has thebranch takenlegacy prefix (which is used to lessen the impact of branch misprediction (>= Pentium 4)). Strong hint.
TheAVL_HAS_LP_BRANCH_PFX(flags)macro enables to retrieve its value conditionally.lp_opsz: Either the instruction has its operand size overwriten or the0x66prefix for indexing within opcode tables is used.
TheAVL_HAS_LP_OPSZ_PFX(flags)macro enables to retrieve its value conditionally.lp_adsz: The instruction overwrites addressing size to 32-bits.
TheAVL_HAS_LP_ADDRSZ_PFX(flags)macro enables to retrieve its value conditionally.REX.B: The instruction has theREXprefix with the bitB. The value is used as extension for theModR/M.rmor theSIB.basefield.
TheAVL_HAS_REXB_PFX(flags)macro enables to retrieve its value conditionally.REX.X: The instruction has theREXprefix with the bitX. The value is used as extension for theSIB.indexfield.
TheAVL_HAS_REXX_PFX(flags)macro enables to retrieve its value conditionally.REX.R: The instruction has theREXprefix with the bitR. The vale is used as extension of theModR/M.regfield.
TheAVL_HAS_REXR_PFX(flags)macro enables to retrieve its value conditionally.REX.W: The instruction has theREXprefix with the bitW. Overwrites the operand size.
TheAVL_HAS_REXW_PFX(flags)macro enables to retrieve its value conditionally.has_imm: The instruction has animmediate valueon the fieldAVL_instruction_t.i_imm.
TheAVL_HAS_OP_IMM_PFX(flags)macro enables to retrieve its value conditionally.has_evex: The instruction has theEVEXprefix. TheAVL_instruction_t.i_vpfield holds the 3 last bytes of theEVEXprefix.
TheAVL_HAS_OP_EVEX_PFX(flags)macro enables to retrieve its value conditionally.is_mdrm: The instruction has aModR/Mbyte prefix. TheAVL_instruction_t.i_mod_rmholds its value.
TheAVL_HAS_OP_MODRM_PFX(flags)macro enables to retrieve its value conditionally.om1_r: The instruction, while execution, reads the operand held byAVL_instruction_t.i_reg1.
TheAVL_OM1_IS_READ(flags)macro enables to retrieve its value conditionally.om1_w: The instruction, while execution, might modify the operand held byAVL_instruction_t.i_reg1.
TheAVL_OM1_IS_WRITE(flags)macro enables to retrieve its value conditionally.om2_r: The instruction, while execution, reads the operand held byAVL_instruction_t.i_reg2.
TheAVL_OM2_IS_READ(flags)macro enables to retrieve its value conditionally.om2_w: The instruction, while execution, might modify the operand held byAVL_instruction_t.i_reg2.
TheAVL_OM2_IS_WRITE(flags)macro enables to retrieve its value conditionally.om3_r: The instruction, while execution, reads the operand held byAVL_instruction_t.i_reg3.
TheAVL_OM3_IS_READ(flags)macro enables to retrieve its value conditionally.om3_w: The instruction, while execution, might modify the operand held byAVL_instruction_t.i_reg3.
TheAVL_OM3_IS_WRITE(flags)macro enables to retrieve its value conditionally.fl_car: The instruction, on execution, might modify thecarry flagstatus.
TheAVL_HAS_AF_CARRY(flags)macro enables to retrieve its value conditionally.fl_par: The instruction, on execution, might modify theparity flagstatus.
TheAVL_HAS_AF_PARITY(flags)macro enables to retrieve its value conditionally.fl_adj: The instruction, on execution, might modify theadjust flagstatus.
TheAVL_HAS_AF_ADJUST(flags)macro enables to retrieve its value conditionally.fl_zero: The instruction, on execution, might modify thezero flagsstatus.
TheAVL_HAS_AF_ZERO(flags)macro enables to retrieve its value conditionally.fl_sign: The instruction, on execution, might modify thesign flagsstatus.
TheAVL_HAS_AF_SIGN(flags)macro enables to retrieve its value conditionally.fl_ovfw: The instruction, on execution, might modify theoverflow flagsstatus.
TheAVL_HAS_AF_OVERFLOW(flags)macro enables to retrieve its value conditionally.OP_SZ: Unlike previous flags, these are not 1-bit flags. Theoperand sizeis encoded in 3-bits. The operand size can be either:AVL_OPSZ_BYTE(1-byte),AVL_OPSZ_WORD(2-bytes),AVL_OPSZ_DWORD(4-bytes),AVL_OPSZ_QWORD(8-bytes),AVL_OPSZ_DQWORD(16-bytes),AVL_OPSZ_QQWORD(32-bytes) orAVL_OPSZ_DQQWORD(64-bytes).
TheAVL_GET_OPERAND_SZ(flags)macro enables to retrieve the operand size.
Futhermore, the following macros enable type check conditionally the value of the operand size:AVL_OPSZ_IS_BYTE(flags),AVL_OPSZ_IS_WORD(flags),AVL_OPSZ_IS_DWORD(flags),AVL_OPSZ_IS_QWORD(flags),AVL_OPSZ_IS_DQWORD(flags),AVL_OPSZ_IS_QQWORD(flags),AVL_OPSZ_IS_DQQWORD(flags).
The opcode (ALV_instruction_t.i_opcode) is composed of 3 bytes of data. The first 2 bytes holds the opcode map index and the index within the map is held by the last one.\
The 2 first byte can be either:
[0x00][0x00]for unprefixed opcode map.[0x0F][0x00]for two byte opcode map.[0x0F][0x38]for three byte 0x38 opcode map.[0x0F][0x3A]for three byte 0x3A opcode map.
If the map is unprefixed and its index within is in range of 0xD8 >= INDEX <= 0xDF, the instructions are escaped to x87 opcode maps.
If the instruction has a VEX prefix, its raw data can found on AVL_instruction_t.i_vp (3-bytes) field.
If the instruction has an EVEX prefix, the raw data of its 3 last bytes can be also found on AVL_instruction_t.i_vp field.
Furthermore, to easily access to VEX/EVEX elements some types has been implemented.
The i_vp field is polyphormic, it can be casted either in AVL_vex_t, AVL_vex2_t or AVL_evex_t depenting of the nature of the instruction. I recomend to use this sequence to determine which one use:
#include <AVL_disassembler.h>
AVL_instruction* inst;
/* ... */
if (AVL_OP_EVEX_MASK(inst->i_flags))
// has EVEX prefix
else if (AVL_ISVEX3_PFX(inst))
// has VEX 3 bytes prefix
else if (AVL_ISVEX2_PFX(inst))
// has VEX 2 bytes prefixThe prototypes of the VEX/EVEX prexixes types:
/// Vector EXtension (VEX) 3-bytes prefix.
typedef struct
{
union
{
struct
{
uint8_t vx_header; // Mandatory VEX 3-bytes prefix, always 0xC4.
uint8_t vx_opmap:5; // Opcode Map Prefix(es).
uint8_t vx_rexb:1; // VEX REX.B bit.
uint8_t vx_rexx:1; // VEX REX.X bit.
uint8_t vx_rexr:1; // VEX REX.R bit.
uint8_t vx_prefix:2; // Instruction prefix.
uint8_t vx_vlen:1; // Vector Operand Size, either 128-bits or 256-bits.
uint8_t vx_vvvv:4; // Addtional Instruction Argument.
uint8_t vx_rexw:1; // VEX REX.W bit.
};
uint8_t v_rawdat[3];
};
} AVL_vex_t;
/// Vextor EXtension (VEX) 2-bytes prefix.
typedef struct
{
union
{
struct
{
uint8_t vx2_header; // Mandatory VEX 2-bytes prefix, always 0xC5.
uint8_t vx2_prefix:2; // Instruction prefix.
uint8_t vx2_vlen:1; // Vector Operand Size, either 128-bits or 256-bits.
uint8_t vx2_vvvv:4; // Addtional Instruction Argument.
uint8_t vx2_rexr:1; // VEX REX.R bit.
};
uint8_t vx2_rawdat[3];
};
} AVL_vex2_t;
/// Enhanced Vector EXtension (EVEX) prefix.
typedef struct
{
union
{
struct
{
uint8_t evx_opmap:2; // Opcode Map Prefix(es).
uint8_t __evx_ZeRo:2; // Reserved, always 0b00.
uint8_t evx_rexr2:1; // Extends EVEX REX.X extensions.
uint8_t evx_rexb:1; // EVEX REX.B bit.
uint8_t evx_rexx:1; // EVEX REX.X bit.
uint8_t evx_rexr:1; // EVEX REX.R bit.
uint8_t evx_prefix:2; // Instruction prefix.
uint8_t __evx_ZeRo:1; // Reserved, always 0b1.
uint8_t evx_vvvv:4; // Addtional Instruction Argument.
uint8_t evx_rexw:1; // EVEX REX.W bit.
uint8_t evx_mask:3; // Operand Mask Register.
uint8_t evx_v:1; // Expands EVEX.VVVV.
uint8_t evx_brcst:1; // Source Broadcast, Rounding Control or Supress Exceptions.
uint8_t evx_vlen:1; // If == 1, operand size is 256-bits, else 128-bits.
uint8_t evx_vlen2:1; // If == 1, operand size is 512-bits (overwrite EVEX.L (evx_vlen)).
uint8_t evx_zero:1; // Specify merging mode (merge or zero).
};
uint8_t evx_rawdat[3];
};
} AVL_evex_t;The operands are represented by the 3 fields i_reg[X] of the AVL_intruction_t type.
An operand can be either a register or memory.
This is a brief of the avalaible operands, the full list can be found on includes/user/AVL_register.h:
- Memory:
AVL_OP_MEM8toAVL_OP_MEM512. - General Purpose Registers:
AVL_OP_ALtoAVL_OP_R15. - Segment Registers:
AVL_OP_EStoAVL_OP_GS. - Control Registers:
AVL_OP_CR0toAVL_OP_CR15. - Debug Registers:
AVL_OP_DR0toAVL_OP_DR15. - Stack (FPU) "Registers":
AVL_OP_STOtoAVL_OP_ST7. - MMX Registers:
AVL_OP_MMX0toAVL_OP_MMX7. - XMM Registers:
AVL_OP_XMM0toAVL_OP_XMM31. - YMM Registers:
AVL_OP_YMM0toAVL_OP_YMM31. - ZMM Registers:
AVL_OP_ZMM0toAVL_OP_ZMM31. - K Registers:
AVL_OP_K0toAVL_OP_K7.
Also, information about the state of each operand on excution is avalaible through the macros:
AVL_OM[X]_IS_READ: On excution, the operandi_reg[X]is read.AVL_OM[X]_IS_WRITE: On excution, the operandi_reg[X]is written,
Some utils has been in implemented in order enable to "play with the instructions":
void AVL_disassemble_instructions(AVL_instruction_t* dest, uint64_t destlen, const uint8_t** text);Tokenizes into dest an amount of destlen instructions from a pointer to x86-64 machine code address (*text). Note that the address pointed by text is, each call, incremented by pointing to the begining of the next instruction.
void AVL_assemble_instructions(uint8_t* dest, AVL_instruction_t src[], uint64_t amount);Convert amount of tokens (src) into x86-64 machine code, the result is written into dest address.
uint64_t AVL_inst_iszeroed(AVL_instruction_t* const target);Performs a zeroed check to the token pointed by target. If all the pointed data is equal to 0, return non zero.
uint64_t AVL_inst_getlen(AVL_instruction_t insts[], uint64_t limit);Return the lenght of the firts sequence of non zeroed tokens in insts with an upper bound of limit iterations.
AVL_instruction_t* AVL_inst_find(AVL_instruction_t insts[], AVL_mnemonic_t key, uint64_t insts_len);Search for a key matching with the mnemonic of the tokens within array insts with an upper bound of insts_len tokens.
typedef uint64_t (*const AVL_condition_t)(AVL_instruction_t* const);
AVL_instruction_t* AVL_inst_findif(AVL_instruction_t insts[], uint64_t insts_len, AVL_condition_t cond);Search for a matching condition cond within the insts with an upper bound of insts_len tokens.
void AVL_inst_insert(AVL_instruction_t* const dest, uint64_t destlen, AVL_instruction_t* const src, uint64_t srclen);Insert srclen tokens src after dest address which is followed by at least destlen tokens.
void AVL_inst_erase(AVL_instruction_t* const target, uint64_t amount, uint64_t targetlen);Erase amount of tokens at target address which is followed by at least targetlen tokens.
void AVL_inst_swap(AVL_instruction_t* const l, AVL_instruction_t* const r);Swap the data between l and r.
-
The
AVL_HAS_OP_VEX_PFX(inst)macro enables to conditionally check if the instruction has aVEXprefix. Note: this macro should be always preceded ofAVL_HAS_OP_EVEX_PFX(flags)check, since bothEVEXandVEXfill thei_vpfield, this macro return true in both cases.
Another solution could be the(AVL_ISVEX2_PFX(inst) || AVL_ISVEX3_PFX(inst)) != 0expression which is not true when the instruction has anEVEXprefix. -
The macros
AVL_ISVEX2_PFX(inst)andAVL_ISVEX3_PFX(inst), respectively, conditionally check if the instruction has a 2 or 3 bytesVEXprefix. -
The
AVL_GET_EVEX_VVVV(evex)macro enables to get the extended value of theEVEX.VVVVfield. -
The
AVL_GET_MODRM_MOD(modrm)macro enables to get the value of theModR/M.modfield. -
The
AVL_GET_MODRM_RM(inst)macro enables to get the extended value of theModR/M.rmfield. -
The
AVL_GET_MODRM_REG(inst)macro enables to get the extended value of theModR/M.regfield. -
The
AVL_GET_SIB_SCALE(sib)macro enables to get the value of theSIB.scalefield. -
The
AVL_GET_SIB_BASE(inst)macro enables to get the extended value of theSIB.basefield. -
The
AVL_GET_SIB_INDEX(inst)macro enables to get the extended value of theSIB.indexfield.
Some instructions such as Jcc, JMP, CALL and RET modify the rip pointer with diferent bounds which are specified by their spetializations, this macros enable to identify these bounds:
AVL_IS_JCC_SHORT(inst): Isshortconditional jump.AVL_IS_JCC_LONG(inst): Islongconditional jump.AVL_IS_JMP_SHORT(inst): Isshortjump.AVL_IS_JMP_NEAR(inst): Isnearjump.AVL_IS_JMP_FAR(inst): Isfarjump.AVL_IS_CALL_NEAR(inst): Isnearcall.AVL_IS_CALL_FAR(inst): Isfarcall.AVL_IS_RET_NEAR(inst): Isnearreturn.AVL_IS_RET_FAR(inst): Isfarreturn.
In order to perform tests, a function for disassemble tokens in NASM syntax has been implemented. With "NASM syntax" i mean that every instruction can be displayed in compilable assembly code. It might be an exemple about how to use the tokens, the file is: srcs/tests/fprint_instruction.c. The prototype of the disassembler is:
// file: includes/dev/tests.h
void fprint_instruction(FILE* where, AVL_instruction_t* const target);The current main takes a file as argument (the file must be filled with x64-86 machine code) and disassemble the instruction in NASM compilable format to stdout.
Here is a sample of firts lines of the code generated by the disassembler while disassembling the object file result of the compilation of srcs/tests/samples/avx512.S (disassembler output: ${TESTDIR}/avx512.log.S after running automated testing):
vpaddb xmm31 {k1}, xmm30, xmm29
vpaddb xmm4 {k1} {z}, xmm14, xmm1
vpaddb xmm4 {k1}, xmm14, [r12]
vpaddb xmm4 {k1} {z}, xmm14, [r12]Here some of non-processor-extension also generated by the disassembler (form ${TESTDIR}/basic.log.S after running automated testing):
imul r8b
imul BYTE [r8]
imul r8w
imul WORD [r8]
imul r8d
imul DWORD [r8]
imul r8
imul QWORD [r8]
imul r8w, r9w
imul r8w, WORD [r9]
imul r8d, r9d
imul r8d, DWORD [r9]
imul r8, r9
imul r8, QWORD [r9]
imul r8w, r9w, 0x8
imul r8w, WORD [r9], 0x8
imul r8d, r9d, 0x8
imul r8d, DWORD [r9], 0x8
imul r8, r9, 0x8
imul r8, QWORD [r9], 0x8
imul r8w, r9w, 0x6969
imul r8w, WORD [r9], 0x6969
imul r8d, r9d, 0x69696969
imul r8d, DWORD [r9], 0x69696969
imul r8, r9, 0x69696969
imul r8, QWORD [r9], 0x69696969Yesss, looks like fresh compilable assembly code ... :)
The tests are performed through the script tester.sh. All the instructions (and all their spetializations) of the default plus processor extensions previouly listed are tested. The files containing the all the instructions can be found on the srcs/tests/samples/ directory.
The script, for each test file, firstly compiles the file, then extracts the .text section into a temporary file which is used as disassembler input. The diassembler will output compilable NASM assembly which is compiled using the NASM compiler. Finally checks the diff between the object file which is result of the compilation of the disassembler output and the object file compiled at the begin (through objdump).
All the log files are preserved. Take a look of the script if you wanna see theses files.