Skip to content

Commit 1201f22

Browse files
sean-jcbonzini
authored andcommitted
KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init
Snapshot the output of CPUID.0xD.[1..n] during kvm.ko initiliaization to avoid the overead of CPUID during runtime. The offset, size, and metadata for CPUID.0xD.[1..n] sub-leaves does not depend on XCR0 or XSS values, i.e. is constant for a given CPU, and thus can be cached during module load. On Intel's Emerald Rapids, CPUID is *wildly* expensive, to the point where recomputing XSAVE offsets and sizes results in a 4x increase in latency of nested VM-Enter and VM-Exit (nested transitions can trigger xstate_required_size() multiple times per transition), relative to using cached values. The issue is easily visible by running `perf top` while triggering nested transitions: kvm_update_cpuid_runtime() shows up at a whopping 50%. As measured via RDTSC from L2 (using KVM-Unit-Test's CPUID VM-Exit test and a slightly modified L1 KVM to handle CPUID in the fastpath), a nested roundtrip to emulate CPUID on Skylake (SKX), Icelake (ICX), and Emerald Rapids (EMR) takes: SKX 11650 ICX 22350 EMR 28850 Using cached values, the latency drops to: SKX 6850 ICX 9000 EMR 7900 The underlying issue is that CPUID itself is slow on ICX, and comically slow on EMR. The problem is exacerbated on CPUs which support XSAVES and/or XSAVEC, as KVM invokes xstate_required_size() twice on each runtime CPUID update, and because there are more supported XSAVE features (CPUID for supported XSAVE feature sub-leafs is significantly slower). SKX: CPUID.0xD.2 = 348 cycles CPUID.0xD.3 = 400 cycles CPUID.0xD.4 = 276 cycles CPUID.0xD.5 = 236 cycles <other sub-leaves are similar> EMR: CPUID.0xD.2 = 1138 cycles CPUID.0xD.3 = 1362 cycles CPUID.0xD.4 = 1068 cycles CPUID.0xD.5 = 910 cycles CPUID.0xD.6 = 914 cycles CPUID.0xD.7 = 1350 cycles CPUID.0xD.8 = 734 cycles CPUID.0xD.9 = 766 cycles CPUID.0xD.10 = 732 cycles CPUID.0xD.11 = 718 cycles CPUID.0xD.12 = 734 cycles CPUID.0xD.13 = 1700 cycles CPUID.0xD.14 = 1126 cycles CPUID.0xD.15 = 898 cycles CPUID.0xD.16 = 716 cycles CPUID.0xD.17 = 748 cycles CPUID.0xD.18 = 776 cycles Note, updating runtime CPUID information multiple times per nested transition is itself a flaw, especially since CPUID is a mandotory intercept on both Intel and AMD. E.g. KVM doesn't need to ensure emulated CPUID state is up-to-date while running L2. That flaw will be fixed in a future patch, as deferring runtime CPUID updates is more subtle than it appears at first glance, the benefits aren't super critical to have once the XSAVE issue is resolved, and caching CPUID output is desirable even if KVM's updates are deferred. Cc: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241211013302.1347853-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
1 parent 3154bdd commit 1201f22

File tree

3 files changed

+29
-5
lines changed

3 files changed

+29
-5
lines changed

arch/x86/kvm/cpuid.c

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,26 @@
3636
u32 kvm_cpu_caps[NR_KVM_CPU_CAPS] __read_mostly;
3737
EXPORT_SYMBOL_GPL(kvm_cpu_caps);
3838

39+
struct cpuid_xstate_sizes {
40+
u32 eax;
41+
u32 ebx;
42+
u32 ecx;
43+
};
44+
45+
static struct cpuid_xstate_sizes xstate_sizes[XFEATURE_MAX] __ro_after_init;
46+
47+
void __init kvm_init_xstate_sizes(void)
48+
{
49+
u32 ign;
50+
int i;
51+
52+
for (i = XFEATURE_YMM; i < ARRAY_SIZE(xstate_sizes); i++) {
53+
struct cpuid_xstate_sizes *xs = &xstate_sizes[i];
54+
55+
cpuid_count(0xD, i, &xs->eax, &xs->ebx, &xs->ecx, &ign);
56+
}
57+
}
58+
3959
u32 xstate_required_size(u64 xstate_bv, bool compacted)
4060
{
4161
int feature_bit = 0;
@@ -44,14 +64,15 @@ u32 xstate_required_size(u64 xstate_bv, bool compacted)
4464
xstate_bv &= XFEATURE_MASK_EXTEND;
4565
while (xstate_bv) {
4666
if (xstate_bv & 0x1) {
47-
u32 eax, ebx, ecx, edx, offset;
48-
cpuid_count(0xD, feature_bit, &eax, &ebx, &ecx, &edx);
67+
struct cpuid_xstate_sizes *xs = &xstate_sizes[feature_bit];
68+
u32 offset;
69+
4970
/* ECX[1]: 64B alignment in compacted form */
5071
if (compacted)
51-
offset = (ecx & 0x2) ? ALIGN(ret, 64) : ret;
72+
offset = (xs->ecx & 0x2) ? ALIGN(ret, 64) : ret;
5273
else
53-
offset = ebx;
54-
ret = max(ret, offset + eax);
74+
offset = xs->ebx;
75+
ret = max(ret, offset + xs->eax);
5576
}
5677

5778
xstate_bv >>= 1;

arch/x86/kvm/cpuid.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ int kvm_vcpu_ioctl_get_cpuid2(struct kvm_vcpu *vcpu,
3131
bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
3232
u32 *ecx, u32 *edx, bool exact_only);
3333

34+
void __init kvm_init_xstate_sizes(void);
3435
u32 xstate_required_size(u64 xstate_bv, bool compacted);
3536

3637
int cpuid_query_maxphyaddr(struct kvm_vcpu *vcpu);

arch/x86/kvm/x86.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13997,6 +13997,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_rmp_fault);
1399713997

1399813998
static int __init kvm_x86_init(void)
1399913999
{
14000+
kvm_init_xstate_sizes();
14001+
1400014002
kvm_mmu_x86_module_init();
1400114003
mitigate_smt_rsb &= boot_cpu_has_bug(X86_BUG_SMT_RSB) && cpu_smt_possible();
1400214004
return 0;

0 commit comments

Comments
 (0)