Skip to content

Commit edb0e8f

Browse files
committed
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini: "ARM: - Nested virtualization support for VGICv3, giving the nested hypervisor control of the VGIC hardware when running an L2 VM - Removal of 'late' nested virtualization feature register masking, making the supported feature set directly visible to userspace - Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers - Paravirtual interface for discovering the set of CPU implementations where a VM may run, addressing a longstanding issue of guest CPU errata awareness in big-little systems and cross-implementation VM migration - Userspace control of the registers responsible for identifying a particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1), allowing VMs to be migrated cross-implementation - pKVM updates, including support for tracking stage-2 page table allocations in the protected hypervisor in the 'SecPageTable' stat - Fixes to vPMU, ensuring that userspace updates to the vPMU after KVM_RUN are reflected into the backing perf events LoongArch: - Remove unnecessary header include path - Assume constant PGD during VM context switch - Add perf events support for guest VM RISC-V: - Disable the kernel perf counter during configure - KVM selftests improvements for PMU - Fix warning at the time of KVM module removal x86: - Add support for aging of SPTEs without holding mmu_lock. Not taking mmu_lock allows multiple aging actions to run in parallel, and more importantly avoids stalling vCPUs. This includes an implementation of per-rmap-entry locking; aging the gfn is done with only a per-rmap single-bin spinlock taken, whereas locking an rmap for write requires taking both the per-rmap spinlock and the mmu_lock. Note that this decreases slightly the accuracy of accessed-page information, because changes to the SPTE outside aging might not use atomic operations even if they could race against a clear of the Accessed bit. This is deliberate because KVM and mm/ tolerate false positives/negatives for accessed information, and testing has shown that reducing the latency of aging is far more beneficial to overall system performance than providing "perfect" young/old information. - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to coalesce updates when multiple pieces of vCPU state are changing, e.g. as part of a nested transition - Fix a variety of nested emulation bugs, and add VMX support for synthesizing nested VM-Exit on interception (instead of injecting #UD into L2) - Drop "support" for async page faults for protected guests that do not set SEND_ALWAYS (i.e. that only want async page faults at CPL3) - Bring a bit of sanity to x86's VM teardown code, which has accumulated a lot of cruft over the years. Particularly, destroy vCPUs before the MMU, despite the latter being a VM-wide operation - Add common secure TSC infrastructure for use within SNP and in the future TDX - Block KVM_CAP_SYNC_REGS if guest state is protected. It does not make sense to use the capability if the relevant registers are not available for reading or writing - Don't take kvm->lock when iterating over vCPUs in the suspend notifier to fix a largely theoretical deadlock - Use the vCPU's actual Xen PV clock information when starting the Xen timer, as the cached state in arch.hv_clock can be stale/bogus - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only accounts for kvmclock, and there's no evidence that the flag is actually supported by Xen guests - Clean up the per-vCPU "cache" of its reference pvclock, and instead only track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately expensive to compute, and rarely changes for modern setups) - Don't write to the Xen hypercall page on MSR writes that are initiated by the host (userspace or KVM) to fix a class of bugs where KVM can write to guest memory at unexpected times, e.g. during vCPU creation if userspace has set the Xen hypercall MSR index to collide with an MSR that KVM emulates - Restrict the Xen hypercall MSR index to the unofficial synthetic range to reduce the set of possible collisions with MSRs that are emulated by KVM (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside in the synthetic range) - Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config - Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID entries when updating PV clocks; there is no guarantee PV clocks will be updated between TSC frequency changes and CPUID emulation, and guest reads of the TSC leaves should be rare, i.e. are not a hot path x86 (Intel): - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1 - Pass XFD_ERR as the payload when injecting #NM, as a preparatory step for upcoming FRED virtualization support - Decouple the EPT entry RWX protection bit macros from the EPT Violation bits, both as a general cleanup and in anticipation of adding support for emulating Mode-Based Execution Control (MBEC) - Reject KVM_RUN if userspace manages to gain control and stuff invalid guest state while KVM is in the middle of emulating nested VM-Enter - Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs in anticipation of adding sanity checks for secondary exit controls (the primary field is out of bits) x86 (AMD): - Ensure the PSP driver is initialized when both the PSP and KVM modules are built-in (the initcall framework doesn't handle dependencies) - Use long-term pins when registering encrypted memory regions, so that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to excessive fragmentation - Add macros and helpers for setting GHCB return/error codes - Add support for Idle HLT interception, which elides interception if the vCPU has a pending, unmasked virtual IRQ when HLT is executed - Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical address - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g. because the vCPU was "destroyed" via SNP's AP Creation hypercall - Reject SNP AP Creation if the requested SEV features for the vCPU don't match the VM's configured set of features Selftests: - Fix again the Intel PMU counters test; add a data load and do CLFLUSH{OPT} on the data instead of executing code. The theory is that modern Intel CPUs have learned new code prefetching tricks that bypass the PMU counters - Fix a flaw in the Intel PMU counters test where it asserts that an event is counting correctly without actually knowing what the event counts on the underlying hardware - Fix a variety of flaws, bugs, and false failures/passes dirty_log_test, and improve its coverage by collecting all dirty entries on each iteration - Fix a few minor bugs related to handling of stats FDs - Add infrastructure to make vCPU and VM stats FDs available to tests by default (open the FDs during VM/vCPU creation) - Relax an assertion on the number of HLT exits in the xAPIC IPI test when running on a CPU that supports AMD's Idle HLT (which elides interception of HLT if a virtual IRQ is pending and unmasked)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (216 commits) RISC-V: KVM: Optimize comments in kvm_riscv_vcpu_isa_disable_allowed RISC-V: KVM: Teardown riscv specific bits after kvm_exit LoongArch: KVM: Register perf callbacks for guest LoongArch: KVM: Implement arch-specific functions for guest perf LoongArch: KVM: Add stub for kvm_arch_vcpu_preempted_in_kernel() LoongArch: KVM: Remove PGD saving during VM context switch LoongArch: KVM: Remove unnecessary header include path KVM: arm64: Tear down vGIC on failed vCPU creation KVM: arm64: PMU: Reload when resetting KVM: arm64: PMU: Reload when user modifies registers KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs KVM: arm64: PMU: Assume PMU presence in pmu-emul.c KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR} KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu KVM: arm64: Factor out pKVM hyp vcpu creation to separate function KVM: arm64: Initialize HCRX_EL2 traps in pKVM KVM: arm64: Factor out setting HCRX_EL2 traps into separate function KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected KVM: x86: Add infrastructure for secure TSC KVM: x86: Push down setting vcpu.arch.user_set_tsc ...
2 parents 27bd3ce + 782f9fe commit edb0e8f

File tree

142 files changed

+4125
-1993
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

142 files changed

+4125
-1993
lines changed

Documentation/virt/kvm/api.rst

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1000,6 +1000,10 @@ blobs in userspace. When the guest writes the MSR, kvm copies one
10001000
page of a blob (32- or 64-bit, depending on the vcpu mode) to guest
10011001
memory.
10021002

1003+
The MSR index must be in the range [0x40000000, 0x4fffffff], i.e. must reside
1004+
in the range that is unofficially reserved for use by hypervisors. The min/max
1005+
values are enumerated via KVM_XEN_MSR_MIN_INDEX and KVM_XEN_MSR_MAX_INDEX.
1006+
10031007
::
10041008

10051009
struct kvm_xen_hvm_config {
@@ -8258,6 +8262,24 @@ KVM exits with the register state of either the L1 or L2 guest
82588262
depending on which executed at the time of an exit. Userspace must
82598263
take care to differentiate between these cases.
82608264

8265+
7.37 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS
8266+
-------------------------------------
8267+
8268+
:Architectures: arm64
8269+
:Target: VM
8270+
:Parameters: None
8271+
:Returns: 0 on success, -EINVAL if vCPUs have been created before enabling this
8272+
capability.
8273+
8274+
This capability changes the behavior of the registers that identify a PE
8275+
implementation of the Arm architecture: MIDR_EL1, REVIDR_EL1, and AIDR_EL1.
8276+
By default, these registers are visible to userspace but treated as invariant.
8277+
8278+
When this capability is enabled, KVM allows userspace to change the
8279+
aforementioned registers before the first KVM_RUN. These registers are VM
8280+
scoped, meaning that the same set of values are presented on all vCPUs in a
8281+
given VM.
8282+
82618283
8. Other capabilities.
82628284
======================
82638285

Documentation/virt/kvm/arm/fw-pseudo-registers.rst

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ The pseudo-firmware bitmap register are as follows:
116116
ARM DEN0057A.
117117

118118
* KVM_REG_ARM_VENDOR_HYP_BMAP:
119-
Controls the bitmap of the Vendor specific Hypervisor Service Calls.
119+
Controls the bitmap of the Vendor specific Hypervisor Service Calls[0-63].
120120

121121
The following bits are accepted:
122122

@@ -127,6 +127,19 @@ The pseudo-firmware bitmap register are as follows:
127127
Bit-1: KVM_REG_ARM_VENDOR_HYP_BIT_PTP:
128128
The bit represents the Precision Time Protocol KVM service.
129129

130+
* KVM_REG_ARM_VENDOR_HYP_BMAP_2:
131+
Controls the bitmap of the Vendor specific Hypervisor Service Calls[64-127].
132+
133+
The following bits are accepted:
134+
135+
Bit-0: KVM_REG_ARM_VENDOR_HYP_BIT_DISCOVER_IMPL_VER
136+
This represents the ARM_SMCCC_VENDOR_HYP_KVM_DISCOVER_IMPL_VER_FUNC_ID
137+
function-id. This is reset to 0.
138+
139+
Bit-1: KVM_REG_ARM_VENDOR_HYP_BIT_DISCOVER_IMPL_CPUS
140+
This represents the ARM_SMCCC_VENDOR_HYP_KVM_DISCOVER_IMPL_CPUS_FUNC_ID
141+
function-id. This is reset to 0.
142+
130143
Errors:
131144

132145
======= =============================================================

Documentation/virt/kvm/arm/hypercalls.rst

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,3 +142,62 @@ region is equal to the memory protection granule advertised by
142142
| | | +---------------------------------------------+
143143
| | | | ``INVALID_PARAMETER (-3)`` |
144144
+---------------------+----------+----+---------------------------------------------+
145+
146+
``ARM_SMCCC_VENDOR_HYP_KVM_DISCOVER_IMPL_VER_FUNC_ID``
147+
-------------------------------------------------------
148+
Request the target CPU implementation version information and the number of target
149+
implementations for the Guest VM.
150+
151+
+---------------------+-------------------------------------------------------------+
152+
| Presence: | Optional; KVM/ARM64 Guests only |
153+
+---------------------+-------------------------------------------------------------+
154+
| Calling convention: | HVC64 |
155+
+---------------------+----------+--------------------------------------------------+
156+
| Function ID: | (uint32) | 0xC6000040 |
157+
+---------------------+----------+--------------------------------------------------+
158+
| Arguments: | None |
159+
+---------------------+----------+----+---------------------------------------------+
160+
| Return Values: | (int64) | R0 | ``SUCCESS (0)`` |
161+
| | | +---------------------------------------------+
162+
| | | | ``NOT_SUPPORTED (-1)`` |
163+
| +----------+----+---------------------------------------------+
164+
| | (uint64) | R1 | Bits [63:32] Reserved/Must be zero |
165+
| | | +---------------------------------------------+
166+
| | | | Bits [31:16] Major version |
167+
| | | +---------------------------------------------+
168+
| | | | Bits [15:0] Minor version |
169+
| +----------+----+---------------------------------------------+
170+
| | (uint64) | R2 | Number of target implementations |
171+
| +----------+----+---------------------------------------------+
172+
| | (uint64) | R3 | Reserved / Must be zero |
173+
+---------------------+----------+----+---------------------------------------------+
174+
175+
``ARM_SMCCC_VENDOR_HYP_KVM_DISCOVER_IMPL_CPUS_FUNC_ID``
176+
-------------------------------------------------------
177+
178+
Request the target CPU implementation information for the Guest VM. The Guest kernel
179+
will use this information to enable the associated errata.
180+
181+
+---------------------+-------------------------------------------------------------+
182+
| Presence: | Optional; KVM/ARM64 Guests only |
183+
+---------------------+-------------------------------------------------------------+
184+
| Calling convention: | HVC64 |
185+
+---------------------+----------+--------------------------------------------------+
186+
| Function ID: | (uint32) | 0xC6000041 |
187+
+---------------------+----------+----+---------------------------------------------+
188+
| Arguments: | (uint64) | R1 | selected implementation index |
189+
| +----------+----+---------------------------------------------+
190+
| | (uint64) | R2 | Reserved / Must be zero |
191+
| +----------+----+---------------------------------------------+
192+
| | (uint64) | R3 | Reserved / Must be zero |
193+
+---------------------+----------+----+---------------------------------------------+
194+
| Return Values: | (int64) | R0 | ``SUCCESS (0)`` |
195+
| | | +---------------------------------------------+
196+
| | | | ``INVALID_PARAMETER (-3)`` |
197+
| +----------+----+---------------------------------------------+
198+
| | (uint64) | R1 | MIDR_EL1 of the selected implementation |
199+
| +----------+----+---------------------------------------------+
200+
| | (uint64) | R2 | REVIDR_EL1 of the selected implementation |
201+
| +----------+----+---------------------------------------------+
202+
| | (uint64) | R3 | AIDR_EL1 of the selected implementation |
203+
+---------------------+----------+----+---------------------------------------------+

Documentation/virt/kvm/devices/arm-vgic-its.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,8 @@ KVM_DEV_ARM_VGIC_GRP_ITS_REGS
126126
ITS Restore Sequence:
127127
---------------------
128128

129-
The following ordering must be followed when restoring the GIC and the ITS:
129+
The following ordering must be followed when restoring the GIC, ITS, and
130+
KVM_IRQFD assignments:
130131

131132
a) restore all guest memory and create vcpus
132133
b) restore all redistributors
@@ -139,6 +140,8 @@ d) restore the ITS in the following order:
139140
3. Load the ITS table data (KVM_DEV_ARM_ITS_RESTORE_TABLES)
140141
4. Restore GITS_CTLR
141142

143+
e) restore KVM_IRQFD assignments for MSIs
144+
142145
Then vcpus can be started.
143146

144147
ITS Table ABI REV0:

Documentation/virt/kvm/devices/arm-vgic-v3.rst

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -291,8 +291,18 @@ Groups:
291291
| Aff3 | Aff2 | Aff1 | Aff0 |
292292

293293
Errors:
294-
295294
======= =============================================
296295
-EINVAL vINTID is not multiple of 32 or info field is
297296
not VGIC_LEVEL_INFO_LINE_LEVEL
298297
======= =============================================
298+
299+
KVM_DEV_ARM_VGIC_GRP_MAINT_IRQ
300+
Attributes:
301+
302+
The attr field of kvm_device_attr encodes the following values:
303+
304+
bits: | 31 .... 5 | 4 .... 0 |
305+
values: | RES0 | vINTID |
306+
307+
The vINTID specifies which interrupt is generated when the vGIC
308+
must generate a maintenance interrupt. This must be a PPI.

Documentation/virt/kvm/locking.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -196,7 +196,7 @@ writable between reading spte and updating spte. Like below case:
196196
The Dirty bit is lost in this case.
197197

198198
In order to avoid this kind of issue, we always treat the spte as "volatile"
199-
if it can be updated out of mmu-lock [see spte_has_volatile_bits()]; it means
199+
if it can be updated out of mmu-lock [see spte_needs_atomic_update()]; it means
200200
the spte is always atomically updated in this case.
201201

202202
3) flush tlbs due to spte updated
@@ -212,7 +212,7 @@ function to update spte (present -> present).
212212

213213
Since the spte is "volatile" if it can be updated out of mmu-lock, we always
214214
atomically update the spte and the race caused by fast page fault can be avoided.
215-
See the comments in spte_has_volatile_bits() and mmu_spte_update().
215+
See the comments in spte_needs_atomic_update() and mmu_spte_update().
216216

217217
Lockless Access Tracking:
218218

arch/arm64/include/asm/cpucaps.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,8 @@ cpucap_is_possible(const unsigned int cap)
7171
* KVM MPAM support doesn't rely on the host kernel supporting MPAM.
7272
*/
7373
return true;
74+
case ARM64_HAS_PMUV3:
75+
return IS_ENABLED(CONFIG_HW_PERF_EVENTS);
7476
}
7577

7678
return true;

arch/arm64/include/asm/cpufeature.h

Lines changed: 5 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -525,29 +525,6 @@ cpuid_feature_extract_unsigned_field(u64 features, int field)
525525
return cpuid_feature_extract_unsigned_field_width(features, field, 4);
526526
}
527527

528-
/*
529-
* Fields that identify the version of the Performance Monitors Extension do
530-
* not follow the standard ID scheme. See ARM DDI 0487E.a page D13-2825,
531-
* "Alternative ID scheme used for the Performance Monitors Extension version".
532-
*/
533-
static inline u64 __attribute_const__
534-
cpuid_feature_cap_perfmon_field(u64 features, int field, u64 cap)
535-
{
536-
u64 val = cpuid_feature_extract_unsigned_field(features, field);
537-
u64 mask = GENMASK_ULL(field + 3, field);
538-
539-
/* Treat IMPLEMENTATION DEFINED functionality as unimplemented */
540-
if (val == ID_AA64DFR0_EL1_PMUVer_IMP_DEF)
541-
val = 0;
542-
543-
if (val > cap) {
544-
features &= ~mask;
545-
features |= (cap << field) & mask;
546-
}
547-
548-
return features;
549-
}
550-
551528
static inline u64 arm64_ftr_mask(const struct arm64_ftr_bits *ftrp)
552529
{
553530
return (u64)GENMASK(ftrp->shift + ftrp->width - 1, ftrp->shift);
@@ -866,6 +843,11 @@ static __always_inline bool system_supports_mpam_hcr(void)
866843
return alternative_has_cap_unlikely(ARM64_MPAM_HCR);
867844
}
868845

846+
static inline bool system_supports_pmuv3(void)
847+
{
848+
return cpus_have_final_cap(ARM64_HAS_PMUV3);
849+
}
850+
869851
int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt);
870852
bool try_emulate_mrs(struct pt_regs *regs, u32 isn);
871853

arch/arm64/include/asm/cputype.h

Lines changed: 17 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,16 @@
245245

246246
#define read_cpuid(reg) read_sysreg_s(SYS_ ## reg)
247247

248+
/*
249+
* The CPU ID never changes at run time, so we might as well tell the
250+
* compiler that it's constant. Use this function to read the CPU ID
251+
* rather than directly reading processor_id or read_cpuid() directly.
252+
*/
253+
static inline u32 __attribute_const__ read_cpuid_id(void)
254+
{
255+
return read_cpuid(MIDR_EL1);
256+
}
257+
248258
/*
249259
* Represent a range of MIDR values for a given CPU model and a
250260
* range of variant/revision values.
@@ -280,30 +290,14 @@ static inline bool midr_is_cpu_model_range(u32 midr, u32 model, u32 rv_min,
280290
return _model == model && rv >= rv_min && rv <= rv_max;
281291
}
282292

283-
static inline bool is_midr_in_range(u32 midr, struct midr_range const *range)
284-
{
285-
return midr_is_cpu_model_range(midr, range->model,
286-
range->rv_min, range->rv_max);
287-
}
288-
289-
static inline bool
290-
is_midr_in_range_list(u32 midr, struct midr_range const *ranges)
291-
{
292-
while (ranges->model)
293-
if (is_midr_in_range(midr, ranges++))
294-
return true;
295-
return false;
296-
}
293+
struct target_impl_cpu {
294+
u64 midr;
295+
u64 revidr;
296+
u64 aidr;
297+
};
297298

298-
/*
299-
* The CPU ID never changes at run time, so we might as well tell the
300-
* compiler that it's constant. Use this function to read the CPU ID
301-
* rather than directly reading processor_id or read_cpuid() directly.
302-
*/
303-
static inline u32 __attribute_const__ read_cpuid_id(void)
304-
{
305-
return read_cpuid(MIDR_EL1);
306-
}
299+
bool cpu_errata_set_target_impl(u64 num, void *impl_cpus);
300+
bool is_midr_in_range_list(struct midr_range const *ranges);
307301

308302
static inline u64 __attribute_const__ read_cpuid_mpidr(void)
309303
{

arch/arm64/include/asm/hypervisor.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
void kvm_init_hyp_services(void);
88
bool kvm_arm_hyp_service_available(u32 func_id);
9+
void kvm_arm_target_impl_cpu_init(void);
910

1011
#ifdef CONFIG_ARM_PKVM_GUEST
1112
void pkvm_init_hyp_services(void);

0 commit comments

Comments
 (0)