Skip to content

Zen3 scheduler model for the latency of VEXTRACTF128rri is probably incorrect #146564

Open
@TiborGY

Description

@TiborGY

See also discussion at https://discourse.llvm.org/t/are-the-latencies-of-vextractf128-correct-for-zen2-3-in-mca/86422

LLVM MCA relies on LLVM's scheduler models to predict cycle counts. This is the predicted timeline graph for a small snippet on Zen3:

[0,0]     DeeeeeeeeER    .    .   vmovapd       (%rdi), %ymm0
[0,1]     D=eeeeeeeeeeER .    .   vsubpd        (%rsi), %ymm0, %ymm0
[0,2]     D===========eeeER   .   vmulpd        %ymm0, %ymm0, %ymm0
[0,3]     D==============eeeeER   vextractf128  $1, %ymm0, %xmm1
[0,4]     D==============eE---R   vmovhlps      %xmm0, %xmm0, %xmm2

As you can see, vextractf128 is predicted to have 4 cycles of latency. This however is inconsistent with both Agner Fogs latency tables (which list 3 cycles) and my own measurements with llvm-exegesis.

./llvm-exegesis -mode=latency -opcode-name=VEXTRACTF128rri -mcpu=znver3 --benchmark-repeat-count=100000 -min-instructions=1000  --repetition-mode=loop
---
mode:            latency
key:
  instructions:
    - 'VEXTRACTF128rri XMM0 YMM0 i_0x1'
  config:          ''
  register_initial_values:
    - 'YMM0=0x0'
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
min_instructions: 1000
measurements:
  - { key: latency, value: 3.15, per_snippet_value: 3.15, validation_counters: {} }
error:           ''
info:            Repeating a single explicitly serial instruction
assembled_snippet: 4883EC20C7042400000000C744240400000000C744240800000000C744240C00000000C744241000000000C744241400000000C744241800000000C744241C00000000C5FE6F04244883C42049B80200000000000000662E0F1F840000000000C4E37D19C001C4E37D19C0014983C0FF75EEC3
...

Confusingly, AMD's official instruction latency table for Zen3 (Family_19h_Instruction_Latencies_version_1-00.xlsx, AMD Publication No. 56665 Revision 3.00 November 2020) lists vextractf128 as having 4 cycles of latency. Perhaps I am misinterpreting my measurement results, but I cannot see how that figure could be correct. My confidence in the accuracy of the official latency table is further eroded by the fact that the two vextractf128 variants are both listed with empty operand fields.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions