Description
See also discussion at https://discourse.llvm.org/t/are-the-latencies-of-vextractf128-correct-for-zen2-3-in-mca/86422
LLVM MCA relies on LLVM's scheduler models to predict cycle counts. This is the predicted timeline graph for a small snippet on Zen3:
[0,0] DeeeeeeeeER . . vmovapd (%rdi), %ymm0
[0,1] D=eeeeeeeeeeER . . vsubpd (%rsi), %ymm0, %ymm0
[0,2] D===========eeeER . vmulpd %ymm0, %ymm0, %ymm0
[0,3] D==============eeeeER vextractf128 $1, %ymm0, %xmm1
[0,4] D==============eE---R vmovhlps %xmm0, %xmm0, %xmm2
As you can see, vextractf128
is predicted to have 4 cycles of latency. This however is inconsistent with both Agner Fogs latency tables (which list 3 cycles) and my own measurements with llvm-exegesis.
./llvm-exegesis -mode=latency -opcode-name=VEXTRACTF128rri -mcpu=znver3 --benchmark-repeat-count=100000 -min-instructions=1000 --repetition-mode=loop
---
mode: latency
key:
instructions:
- 'VEXTRACTF128rri XMM0 YMM0 i_0x1'
config: ''
register_initial_values:
- 'YMM0=0x0'
cpu_name: znver3
llvm_triple: x86_64-unknown-linux-gnu
min_instructions: 1000
measurements:
- { key: latency, value: 3.15, per_snippet_value: 3.15, validation_counters: {} }
error: ''
info: Repeating a single explicitly serial instruction
assembled_snippet: 4883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C5FE6F04244883C42049B80200000000000000662E0F1F840000000000C4E37D19C001C4E37D19C0014983C0FF75EEC3
...
Confusingly, AMD's official instruction latency table for Zen3 (Family_19h_Instruction_Latencies_version_1-00.xlsx, AMD Publication No. 56665 Revision 3.00 November 2020) lists vextractf128
as having 4 cycles of latency. Perhaps I am misinterpreting my measurement results, but I cannot see how that figure could be correct. My confidence in the accuracy of the official latency table is further eroded by the fact that the two vextractf128
variants are both listed with empty operand fields.