Skip to content

Commit 6c370dc

Browse files
committed
Merge branch 'kvm-guestmemfd' into HEAD
Introduce several new KVM uAPIs to ultimately create a guest-first memory subsystem within KVM, a.k.a. guest_memfd. Guest-first memory allows KVM to provide features, enhancements, and optimizations that are kludgly or outright impossible to implement in a generic memory subsystem. The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which similar to the generic memfd_create(), creates an anonymous file and returns a file descriptor that refers to it. Again like "regular" memfd files, guest_memfd files live in RAM, have volatile storage, and are automatically released when the last reference is dropped. The key differences between memfd files (and every other memory subystem) is that guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to convert a guest memory area between the shared and guest-private states. A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to specify attributes for a given page of guest memory. In the long term, it will likely be extended to allow userspace to specify per-gfn RWX protections, including allowing memory to be writable in the guest without it also being writable in host userspace. The immediate and driving use case for guest_memfd are Confidential (CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM. For such use cases, being able to map memory into KVM guests without requiring said memory to be mapped into the host is a hard requirement. While SEV+ and TDX prevent untrusted software from reading guest private data by encrypting guest memory, pKVM provides confidentiality and integrity *without* relying on memory encryption. In addition, with SEV-SNP and especially TDX, accessing guest private memory can be fatal to the host, i.e. KVM must be prevent host userspace from accessing guest memory irrespective of hardware behavior. Long term, guest_memfd may be useful for use cases beyond CoCo VMs, for example hardening userspace against unintentional accesses to guest memory. As mentioned earlier, KVM's ABI uses userspace VMA protections to define the allow guest protection (with an exception granted to mapping guest memory executable), and similarly KVM currently requires the guest mapping size to be a strict subset of the host userspace mapping size. Decoupling the mappings sizes would allow userspace to precisely map only what is needed and with the required permissions, without impacting guest performance. A guest-first memory subsystem also provides clearer line of sight to things like a dedicated memory pool (for slice-of-hardware VMs) and elimination of "struct page" (for offload setups where userspace _never_ needs to DMA from or into guest memory). guest_memfd is the result of 3+ years of development and exploration; taking on memory management responsibilities in KVM was not the first, second, or even third choice for supporting CoCo VMs. But after many failed attempts to avoid KVM-specific backing memory, and looking at where things ended up, it is quite clear that of all approaches tried, guest_memfd is the simplest, most robust, and most extensible, and the right thing to do for KVM and the kernel at-large. The "development cycle" for this version is going to be very short; ideally, next week I will merge it as is in kvm/next, taking this through the KVM tree for 6.8 immediately after the end of the merge window. The series is still based on 6.6 (plus KVM changes for 6.7) so it will require a small fixup for changes to get_file_rcu() introduced in 6.7 by commit 0ede61d ("file: convert to SLAB_TYPESAFE_BY_RCU"). The fixup will be done as part of the merge commit, and most of the text above will become the commit message for the merge. Pending post-merge work includes: - hugepage support - looking into using the restrictedmem framework for guest memory - introducing a testing mechanism to poison memory, possibly using the same memory attributes introduced here - SNP and TDX support There are two non-KVM patches buried in the middle of this series: fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure() mm: Add AS_UNMOVABLE to mark mapping as completely unmovable The first is small and mostly suggested-by Christian Brauner; the second a bit less so but it was written by an mm person (Vlastimil Babka).
2 parents b85ea95 + 5d74316 commit 6c370dc

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+3065
-307
lines changed

Documentation/virt/kvm/api.rst

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,10 +147,29 @@ described as 'basic' will be available.
147147
The new VM has no virtual cpus and no memory.
148148
You probably want to use 0 as machine type.
149149

150+
X86:
151+
^^^^
152+
153+
Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.
154+
155+
S390:
156+
^^^^^
157+
150158
In order to create user controlled virtual machines on S390, check
151159
KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
152160
privileged user (CAP_SYS_ADMIN).
153161

162+
MIPS:
163+
^^^^^
164+
165+
To use hardware assisted virtualization on MIPS (VZ ASE) rather than
166+
the default trap & emulate implementation (which changes the virtual
167+
memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
168+
flag KVM_VM_MIPS_VZ.
169+
170+
ARM64:
171+
^^^^^^
172+
154173
On arm64, the physical address size for a VM (IPA Size limit) is limited
155174
to 40bits by default. The limit can be configured if the host supports the
156175
extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
@@ -6192,6 +6211,130 @@ to know what fields can be changed for the system register described by
61926211
``op0, op1, crn, crm, op2``. KVM rejects ID register values that describe a
61936212
superset of the features supported by the system.
61946213

6214+
4.140 KVM_SET_USER_MEMORY_REGION2
6215+
---------------------------------
6216+
6217+
:Capability: KVM_CAP_USER_MEMORY2
6218+
:Architectures: all
6219+
:Type: vm ioctl
6220+
:Parameters: struct kvm_userspace_memory_region2 (in)
6221+
:Returns: 0 on success, -1 on error
6222+
6223+
KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
6224+
allows mapping guest_memfd memory into a guest. All fields shared with
6225+
KVM_SET_USER_MEMORY_REGION identically. Userspace can set KVM_MEM_GUEST_MEMFD
6226+
in flags to have KVM bind the memory region to a given guest_memfd range of
6227+
[guest_memfd_offset, guest_memfd_offset + memory_size]. The target guest_memfd
6228+
must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and
6229+
the target range must not be bound to any other memory region. All standard
6230+
bounds checks apply (use common sense).
6231+
6232+
::
6233+
6234+
struct kvm_userspace_memory_region2 {
6235+
__u32 slot;
6236+
__u32 flags;
6237+
__u64 guest_phys_addr;
6238+
__u64 memory_size; /* bytes */
6239+
__u64 userspace_addr; /* start of the userspace allocated memory */
6240+
__u64 guest_memfd_offset;
6241+
__u32 guest_memfd;
6242+
__u32 pad1;
6243+
__u64 pad2[14];
6244+
};
6245+
6246+
A KVM_MEM_GUEST_MEMFD region _must_ have a valid guest_memfd (private memory) and
6247+
userspace_addr (shared memory). However, "valid" for userspace_addr simply
6248+
means that the address itself must be a legal userspace address. The backing
6249+
mapping for userspace_addr is not required to be valid/populated at the time of
6250+
KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/allocated
6251+
on-demand.
6252+
6253+
When mapping a gfn into the guest, KVM selects shared vs. private, i.e consumes
6254+
userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE
6255+
state. At VM creation time, all memory is shared, i.e. the PRIVATE attribute
6256+
is '0' for all gfns. Userspace can control whether memory is shared/private by
6257+
toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.
6258+
6259+
4.141 KVM_SET_MEMORY_ATTRIBUTES
6260+
-------------------------------
6261+
6262+
:Capability: KVM_CAP_MEMORY_ATTRIBUTES
6263+
:Architectures: x86
6264+
:Type: vm ioctl
6265+
:Parameters: struct kvm_memory_attributes (in)
6266+
:Returns: 0 on success, <0 on error
6267+
6268+
KVM_SET_MEMORY_ATTRIBUTES allows userspace to set memory attributes for a range
6269+
of guest physical memory.
6270+
6271+
::
6272+
6273+
struct kvm_memory_attributes {
6274+
__u64 address;
6275+
__u64 size;
6276+
__u64 attributes;
6277+
__u64 flags;
6278+
};
6279+
6280+
#define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3)
6281+
6282+
The address and size must be page aligned. The supported attributes can be
6283+
retrieved via ioctl(KVM_CHECK_EXTENSION) on KVM_CAP_MEMORY_ATTRIBUTES. If
6284+
executed on a VM, KVM_CAP_MEMORY_ATTRIBUTES precisely returns the attributes
6285+
supported by that VM. If executed at system scope, KVM_CAP_MEMORY_ATTRIBUTES
6286+
returns all attributes supported by KVM. The only attribute defined at this
6287+
time is KVM_MEMORY_ATTRIBUTE_PRIVATE, which marks the associated gfn as being
6288+
guest private memory.
6289+
6290+
Note, there is no "get" API. Userspace is responsible for explicitly tracking
6291+
the state of a gfn/page as needed.
6292+
6293+
The "flags" field is reserved for future extensions and must be '0'.
6294+
6295+
4.142 KVM_CREATE_GUEST_MEMFD
6296+
----------------------------
6297+
6298+
:Capability: KVM_CAP_GUEST_MEMFD
6299+
:Architectures: none
6300+
:Type: vm ioctl
6301+
:Parameters: struct kvm_create_guest_memfd(in)
6302+
:Returns: 0 on success, <0 on error
6303+
6304+
KVM_CREATE_GUEST_MEMFD creates an anonymous file and returns a file descriptor
6305+
that refers to it. guest_memfd files are roughly analogous to files created
6306+
via memfd_create(), e.g. guest_memfd files live in RAM, have volatile storage,
6307+
and are automatically released when the last reference is dropped. Unlike
6308+
"regular" memfd_create() files, guest_memfd files are bound to their owning
6309+
virtual machine (see below), cannot be mapped, read, or written by userspace,
6310+
and cannot be resized (guest_memfd files do however support PUNCH_HOLE).
6311+
6312+
::
6313+
6314+
struct kvm_create_guest_memfd {
6315+
__u64 size;
6316+
__u64 flags;
6317+
__u64 reserved[6];
6318+
};
6319+
6320+
Conceptually, the inode backing a guest_memfd file represents physical memory,
6321+
i.e. is coupled to the virtual machine as a thing, not to a "struct kvm". The
6322+
file itself, which is bound to a "struct kvm", is that instance's view of the
6323+
underlying memory, e.g. effectively provides the translation of guest addresses
6324+
to host memory. This allows for use cases where multiple KVM structures are
6325+
used to manage a single virtual machine, e.g. when performing intrahost
6326+
migration of a virtual machine.
6327+
6328+
KVM currently only supports mapping guest_memfd via KVM_SET_USER_MEMORY_REGION2,
6329+
and more specifically via the guest_memfd and guest_memfd_offset fields in
6330+
"struct kvm_userspace_memory_region2", where guest_memfd_offset is the offset
6331+
into the guest_memfd instance. For a given guest_memfd file, there can be at
6332+
most one mapping per page, i.e. binding multiple memory regions to a single
6333+
guest_memfd range is not allowed (any number of memory regions can be bound to
6334+
a single guest_memfd file, but the bound ranges must not overlap).
6335+
6336+
See KVM_SET_USER_MEMORY_REGION2 for additional details.
6337+
61956338
5. The kvm_run structure
61966339
========================
61976340

@@ -6824,6 +6967,30 @@ array field represents return values. The userspace should update the return
68246967
values of SBI call before resuming the VCPU. For more details on RISC-V SBI
68256968
spec refer, https://github.com/riscv/riscv-sbi-doc.
68266969

6970+
::
6971+
6972+
/* KVM_EXIT_MEMORY_FAULT */
6973+
struct {
6974+
#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3)
6975+
__u64 flags;
6976+
__u64 gpa;
6977+
__u64 size;
6978+
} memory_fault;
6979+
6980+
KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
6981+
could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the
6982+
guest physical address range [gpa, gpa + size) of the fault. The 'flags' field
6983+
describes properties of the faulting access that are likely pertinent:
6984+
6985+
- KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
6986+
on a private memory access. When clear, indicates the fault occurred on a
6987+
shared access.
6988+
6989+
Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
6990+
accompanies a return code of '-1', not '0'! errno will always be set to EFAULT
6991+
or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
6992+
kvm_run.exit_reason is stale/undefined for all other error numbers.
6993+
68276994
::
68286995

68296996
/* KVM_EXIT_NOTIFY */
@@ -7858,6 +8025,27 @@ This capability is aimed to mitigate the threat that malicious VMs can
78588025
cause CPU stuck (due to event windows don't open up) and make the CPU
78598026
unavailable to host or other VMs.
78608027

8028+
7.34 KVM_CAP_MEMORY_FAULT_INFO
8029+
------------------------------
8030+
8031+
:Architectures: x86
8032+
:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
8033+
8034+
The presence of this capability indicates that KVM_RUN will fill
8035+
kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
8036+
there is a valid memslot but no backing VMA for the corresponding host virtual
8037+
address.
8038+
8039+
The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
8040+
an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
8041+
to KVM_EXIT_MEMORY_FAULT.
8042+
8043+
Note: Userspaces which attempt to resolve memory faults so that they can retry
8044+
KVM_RUN are encouraged to guard against repeatedly receiving the same
8045+
error/annotated fault.
8046+
8047+
See KVM_EXIT_MEMORY_FAULT for more information.
8048+
78618049
8. Other capabilities.
78628050
======================
78638051

@@ -8596,6 +8784,19 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
85968784
64-bit bitmap (each bit describing a block size). The default value is
85978785
0, to disable the eager page splitting.
85988786

8787+
8.41 KVM_CAP_VM_TYPES
8788+
---------------------
8789+
8790+
:Capability: KVM_CAP_MEMORY_ATTRIBUTES
8791+
:Architectures: x86
8792+
:Type: system ioctl
8793+
8794+
This capability returns a bitmap of support VM types. The 1-setting of bit @n
8795+
means the VM type with value @n is supported. Possible values of @n are::
8796+
8797+
#define KVM_X86_DEFAULT_VM 0
8798+
#define KVM_X86_SW_PROTECTED_VM 1
8799+
85998800
9. Known KVM API problems
86008801
=========================
86018802

arch/arm64/include/asm/kvm_host.h

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -954,8 +954,6 @@ int __kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
954954
int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
955955
struct kvm_vcpu_events *events);
956956

957-
#define KVM_ARCH_WANT_MMU_NOTIFIER
958-
959957
void kvm_arm_halt_guest(struct kvm *kvm);
960958
void kvm_arm_resume_guest(struct kvm *kvm);
961959

arch/arm64/kvm/Kconfig

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ menuconfig KVM
2222
bool "Kernel-based Virtual Machine (KVM) support"
2323
depends on HAVE_KVM
2424
select KVM_GENERIC_HARDWARE_ENABLING
25-
select MMU_NOTIFIER
25+
select KVM_GENERIC_MMU_NOTIFIER
2626
select PREEMPT_NOTIFIERS
2727
select HAVE_KVM_CPU_RELAX_INTERCEPT
2828
select KVM_MMIO

arch/loongarch/include/asm/kvm_host.h

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -183,7 +183,6 @@ void kvm_flush_tlb_all(void);
183183
void kvm_flush_tlb_gpa(struct kvm_vcpu *vcpu, unsigned long gpa);
184184
int kvm_handle_mm_fault(struct kvm_vcpu *vcpu, unsigned long badv, bool write);
185185

186-
#define KVM_ARCH_WANT_MMU_NOTIFIER
187186
void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
188187
int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end, bool blockable);
189188
int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);

arch/loongarch/kvm/Kconfig

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,9 @@ config KVM
2626
select HAVE_KVM_VCPU_ASYNC_IOCTL
2727
select KVM_GENERIC_DIRTYLOG_READ_PROTECT
2828
select KVM_GENERIC_HARDWARE_ENABLING
29+
select KVM_GENERIC_MMU_NOTIFIER
2930
select KVM_MMIO
3031
select KVM_XFER_TO_GUEST_WORK
31-
select MMU_NOTIFIER
3232
select PREEMPT_NOTIFIERS
3333
help
3434
Support hosting virtualized guest machines using

arch/mips/include/asm/kvm_host.h

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -810,8 +810,6 @@ int kvm_mips_mkclean_gpa_pt(struct kvm *kvm, gfn_t start_gfn, gfn_t end_gfn);
810810
pgd_t *kvm_pgd_alloc(void);
811811
void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);
812812

813-
#define KVM_ARCH_WANT_MMU_NOTIFIER
814-
815813
/* Emulation */
816814
enum emulation_result update_pc(struct kvm_vcpu *vcpu, u32 cause);
817815
int kvm_get_badinstr(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);

arch/mips/kvm/Kconfig

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ config KVM
2525
select HAVE_KVM_EVENTFD
2626
select HAVE_KVM_VCPU_ASYNC_IOCTL
2727
select KVM_MMIO
28-
select MMU_NOTIFIER
28+
select KVM_GENERIC_MMU_NOTIFIER
2929
select INTERVAL_TREE
3030
select KVM_GENERIC_HARDWARE_ENABLING
3131
help

arch/powerpc/include/asm/kvm_host.h

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,8 +63,6 @@
6363

6464
#include <linux/mmu_notifier.h>
6565

66-
#define KVM_ARCH_WANT_MMU_NOTIFIER
67-
6866
#define HPTEG_CACHE_NUM (1 << 15)
6967
#define HPTEG_HASH_BITS_PTE 13
7068
#define HPTEG_HASH_BITS_PTE_LONG 12

arch/powerpc/kvm/Kconfig

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ config KVM_BOOK3S_64_HANDLER
4242
config KVM_BOOK3S_PR_POSSIBLE
4343
bool
4444
select KVM_MMIO
45-
select MMU_NOTIFIER
45+
select KVM_GENERIC_MMU_NOTIFIER
4646

4747
config KVM_BOOK3S_HV_POSSIBLE
4848
bool
@@ -85,7 +85,7 @@ config KVM_BOOK3S_64_HV
8585
tristate "KVM for POWER7 and later using hypervisor mode in host"
8686
depends on KVM_BOOK3S_64 && PPC_POWERNV
8787
select KVM_BOOK3S_HV_POSSIBLE
88-
select MMU_NOTIFIER
88+
select KVM_GENERIC_MMU_NOTIFIER
8989
select CMA
9090
help
9191
Support running unmodified book3s_64 guest kernels in
@@ -194,7 +194,7 @@ config KVM_E500V2
194194
depends on !CONTEXT_TRACKING_USER
195195
select KVM
196196
select KVM_MMIO
197-
select MMU_NOTIFIER
197+
select KVM_GENERIC_MMU_NOTIFIER
198198
help
199199
Support running unmodified E500 guest kernels in virtual machines on
200200
E500v2 host processors.
@@ -211,7 +211,7 @@ config KVM_E500MC
211211
select KVM
212212
select KVM_MMIO
213213
select KVM_BOOKE_HV
214-
select MMU_NOTIFIER
214+
select KVM_GENERIC_MMU_NOTIFIER
215215
help
216216
Support running unmodified E500MC/E5500/E6500 guest kernels in
217217
virtual machines on E500MC/E5500/E6500 host processors.

arch/powerpc/kvm/book3s_hv.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6210,7 +6210,7 @@ static int kvmhv_svm_off(struct kvm *kvm)
62106210
}
62116211

62126212
srcu_idx = srcu_read_lock(&kvm->srcu);
6213-
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
6213+
for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
62146214
struct kvm_memory_slot *memslot;
62156215
struct kvm_memslots *slots = __kvm_memslots(kvm, i);
62166216
int bkt;

0 commit comments

Comments
 (0)