Skip to content

Commit 77ac707

Browse files
yamahatabonzini
authored andcommitted
KVM: x86/tdp_mmu: Propagate building mirror page tables
Integrate hooks for mirroring page table operations for cases where TDX will set PTEs or link page tables. Like other Coco technologies, TDX has the concept of private and shared memory. For TDX the private and shared mappings are managed on separate EPT roots. The private half is managed indirectly through calls into a protected runtime environment called the TDX module, where the shared half is managed within KVM in normal page tables. Since calls into the TDX module are relatively slow, walking private page tables by making calls into the TDX module would not be efficient. Because of this, previous changes have taught the TDP MMU to keep a mirror root, which is separate, unmapped TDP root that private operations can be directed to. Currently this root is disconnected from any actual guest mapping. Now add plumbing to propagate changes to the "external" page tables being mirrored. Just create the x86_ops for now, leave plumbing the operations into the TDX module for future patches. Add two operations for setting up external page tables, one for linking new page tables and one for setting leaf PTEs. Don't add any op for configuring the root PFN, as TDX handles this itself. Don't provide a way to set permissions on the PTEs also, as TDX doesn't support it. This results in MMU "mirroring" support that is very targeted towards TDX. Since it is likely there will be no other user, the main benefit of making the support generic is to keep TDX specific *looking* code outside of the MMU. As a generic feature it will make enough sense from TDX's perspective. For developers unfamiliar with TDX arch it can express the general concepts such that they can continue to work in the code. TDX MMU support will exclude certain MMU operations, so only plug in the mirroring x86 ops where they will be needed. For setting/linking, only hook tdp_mmu_set_spte_atomic() which is used for mapping and linking PTs. Don't bother hooking tdp_mmu_iter_set_spte() as it is only used for setting PTEs in operations unsupported by TDX: splitting huge pages and write protecting. Sprinkle KVM_BUG_ON()s to document as code that these paths are not supported for mirrored page tables. For zapping operations, leave those for near future changes. Many operations in the TDP MMU depend on atomicity of the PTE update. While the mirror PTE on KVM's side can be updated atomically, the update that happens inside the external operations (S-EPT updates via TDX module call) can't happen atomically with the mirror update. The following race could result during two vCPU's populating private memory: * vcpu 1: atomically update 2M level mirror EPT entry to be present * vcpu 2: read 2M level EPT entry that is present * vcpu 2: walk down into 4K level EPT * vcpu 2: atomically update 4K level mirror EPT entry to be present * vcpu 2: set_exterma;_spte() to update 4K secure EPT entry => error because 2M secure EPT entry is not populated yet * vcpu 1: link_external_spt() to update 2M secure EPT entry Prevent this by setting the mirror PTE to FROZEN_SPTE while the reflect operations are performed. Only write the actual mirror PTE value once the reflect operations have completed. When trying to set a PTE to present and encountering a frozen SPTE, retry the fault. By doing this the race is prevented as follows: * vcpu 1: atomically update 2M level EPT entry to be FROZEN_SPTE * vcpu 2: read 2M level EPT entry that is FROZEN_SPTE * vcpu 2: find that the EPT entry is frozen abandon page table walk to resume guest execution * vcpu 1: link_external_spt() to update 2M secure EPT entry * vcpu 1: atomically update 2M level EPT entry to be present (unfreeze) * vcpu 2: resume guest execution Depending on vcpu 1 state, vcpu 2 may result in EPT violation again or make progress on guest execution Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-15-rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
1 parent de1bf90 commit 77ac707

File tree

3 files changed

+94
-13
lines changed

3 files changed

+94
-13
lines changed

arch/x86/include/asm/kvm-x86-ops.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,8 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
9494
KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
9595
KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
9696
KVM_X86_OP(load_mmu_pgd)
97+
KVM_X86_OP_OPTIONAL(link_external_spt)
98+
KVM_X86_OP_OPTIONAL(set_external_spte)
9799
KVM_X86_OP(has_wbinvd_exit)
98100
KVM_X86_OP(get_l2_tsc_offset)
99101
KVM_X86_OP(get_l2_tsc_multiplier)

arch/x86/include/asm/kvm_host.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1759,6 +1759,13 @@ struct kvm_x86_ops {
17591759
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
17601760
int root_level);
17611761

1762+
/* Update external mapping with page table link. */
1763+
int (*link_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
1764+
void *external_spt);
1765+
/* Update the external page table from spte getting set. */
1766+
int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
1767+
kvm_pfn_t pfn_for_gfn);
1768+
17621769
bool (*has_wbinvd_exit)(void);
17631770

17641771
u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);

arch/x86/kvm/mmu/tdp_mmu.c

Lines changed: 85 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -440,6 +440,59 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
440440
call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
441441
}
442442

443+
static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
444+
{
445+
if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte, level)) {
446+
struct kvm_mmu_page *sp = spte_to_child_sp(new_spte);
447+
448+
WARN_ON_ONCE(sp->role.level + 1 != level);
449+
WARN_ON_ONCE(sp->gfn != gfn);
450+
return sp->external_spt;
451+
}
452+
453+
return NULL;
454+
}
455+
456+
static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
457+
gfn_t gfn, u64 old_spte,
458+
u64 new_spte, int level)
459+
{
460+
bool was_present = is_shadow_present_pte(old_spte);
461+
bool is_present = is_shadow_present_pte(new_spte);
462+
bool is_leaf = is_present && is_last_spte(new_spte, level);
463+
kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
464+
int ret = 0;
465+
466+
KVM_BUG_ON(was_present, kvm);
467+
468+
lockdep_assert_held(&kvm->mmu_lock);
469+
/*
470+
* We need to lock out other updates to the SPTE until the external
471+
* page table has been modified. Use FROZEN_SPTE similar to
472+
* the zapping case.
473+
*/
474+
if (!try_cmpxchg64(rcu_dereference(sptep), &old_spte, FROZEN_SPTE))
475+
return -EBUSY;
476+
477+
/*
478+
* Use different call to either set up middle level
479+
* external page table, or leaf.
480+
*/
481+
if (is_leaf) {
482+
ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
483+
} else {
484+
void *external_spt = get_external_spt(gfn, new_spte, level);
485+
486+
KVM_BUG_ON(!external_spt, kvm);
487+
ret = static_call(kvm_x86_link_external_spt)(kvm, gfn, level, external_spt);
488+
}
489+
if (ret)
490+
__kvm_tdp_mmu_write_spte(sptep, old_spte);
491+
else
492+
__kvm_tdp_mmu_write_spte(sptep, new_spte);
493+
return ret;
494+
}
495+
443496
/**
444497
* handle_changed_spte - handle bookkeeping associated with an SPTE change
445498
* @kvm: kvm instance
@@ -540,11 +593,10 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
540593
handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
541594
}
542595

543-
static inline int __must_check __tdp_mmu_set_spte_atomic(struct tdp_iter *iter,
596+
static inline int __must_check __tdp_mmu_set_spte_atomic(struct kvm *kvm,
597+
struct tdp_iter *iter,
544598
u64 new_spte)
545599
{
546-
u64 *sptep = rcu_dereference(iter->sptep);
547-
548600
/*
549601
* The caller is responsible for ensuring the old SPTE is not a FROZEN
550602
* SPTE. KVM should never attempt to zap or manipulate a FROZEN SPTE,
@@ -553,15 +605,27 @@ static inline int __must_check __tdp_mmu_set_spte_atomic(struct tdp_iter *iter,
553605
*/
554606
WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
555607

556-
/*
557-
* Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
558-
* does not hold the mmu_lock. On failure, i.e. if a different logical
559-
* CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
560-
* the current value, so the caller operates on fresh data, e.g. if it
561-
* retries tdp_mmu_set_spte_atomic()
562-
*/
563-
if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
564-
return -EBUSY;
608+
if (is_mirror_sptep(iter->sptep) && !is_frozen_spte(new_spte)) {
609+
int ret;
610+
611+
ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
612+
iter->old_spte, new_spte, iter->level);
613+
if (ret)
614+
return ret;
615+
} else {
616+
u64 *sptep = rcu_dereference(iter->sptep);
617+
618+
/*
619+
* Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs
620+
* and does not hold the mmu_lock. On failure, i.e. if a
621+
* different logical CPU modified the SPTE, try_cmpxchg64()
622+
* updates iter->old_spte with the current value, so the caller
623+
* operates on fresh data, e.g. if it retries
624+
* tdp_mmu_set_spte_atomic()
625+
*/
626+
if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
627+
return -EBUSY;
628+
}
565629

566630
return 0;
567631
}
@@ -591,7 +655,7 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
591655

592656
lockdep_assert_held_read(&kvm->mmu_lock);
593657

594-
ret = __tdp_mmu_set_spte_atomic(iter, new_spte);
658+
ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
595659
if (ret)
596660
return ret;
597661

@@ -631,6 +695,14 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
631695
old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
632696

633697
handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
698+
699+
/*
700+
* Users that do non-atomic setting of PTEs don't operate on mirror
701+
* roots, so don't handle it and bug the VM if it's seen.
702+
*/
703+
if (is_mirror_sptep(sptep))
704+
KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
705+
634706
return old_spte;
635707
}
636708

0 commit comments

Comments
 (0)