Skip to content
This repository has been archived by the owner on Mar 24, 2022. It is now read-only.

Instance::run panics if another instance trapped in the same thread #323

Open
roman-kashitsyn opened this issue Oct 8, 2019 · 10 comments
Open

Comments

@roman-kashitsyn
Copy link
Contributor

It looks like thread-local state becomes infected once an instance traps (e.g. by executing unreachable), and when the next instance is created in the same thread, it panics with the following backtrace:

thread '<unnamed>' panicked at 'enabling or changing the signal stack succeeds: Sys(EPERM)', src/libcore/result.rs:1084:5
stack backtrace:
   0: std::sys_common::backtrace::print
   1: std::panicking::default_hook::{{closure}}
   2: std::panicking::default_hook
   3: <std::panicking::begin_panic::PanicPayload<A> as core::panic::BoxMeUp>::get
   4: std::panicking::continue_panic_fmt
   5: std::panicking::try::do_call
   6: <T as core::any::Any>::type_id
   7: core::ptr::real_drop_in_place
   8: core::result::Result<T,E>::expect
             at /private/tmp/nix-build-rustc-1.38.0.drv-0/rustc-1.38.0-src/src/libcore/result.rs:879
   9: lucet_runtime_internals::instance::signals::<impl lucet_runtime_internals::instance::Instance>::with_signals_on
             at /Users/lifted/Projects/dfinity/rs/.cargo-home/registry/src/github.aaakk.us.kg-1ecc6299db9ec823/lucet-runtime-internals-0.1.1/src/instance/signals.rs:65
  10: lucet_runtime_internals::instance::Instance::run_func
             at /Users/lifted/Projects/dfinity/rs/.cargo-home/registry/src/github.aaakk.us.kg-1ecc6299db9ec823/lucet-runtime-internals-0.1.1/src/instance.rs:612
  11: lucet_runtime_internals::instance::Instance::run

The panic comes from this place:

impl Instance {
    pub(crate) fn with_signals_on<F, R>(&mut self, f: F) -> Result<R, Error>
    where
        F: FnOnce(&mut Instance) -> Result<R, Error>,
    {
    // Set up the signal stack for this thread. Note that because signal stacks are per-thread,
    // rather than per-process, we do this for every run, while the signal handler is installed
    // only once per process.
    let guest_sigstack = SigStack::new(
        self.alloc.slot().sigstack,
        SigStackFlags::empty(),
        libc::SIGSTKSZ,
    );
    let previous_sigstack = unsafe { sigaltstack(Some(guest_sigstack)) } // <-- PANICS
        .expect("enabling or changing the signal stack succeeds");

You can reproduce the issue using the following small crate:

src/main.rs

use lucet_runtime::{DlModule, MmapRegion, Region};
use lucet_runtime_internals::alloc::Limits;
use lucetc::Lucetc;

fn main() {
    let wasm = wabt::wat2wasm(
        r#"
            (module
              (func (export "trap") unreachable)
              (func (export "test")))
        "#,
    )
    .unwrap();

    let lucetc = Lucetc::try_from_bytes(wasm.as_slice()).unwrap();
    lucetc.shared_object_file("mod.so").unwrap();
    let module = DlModule::load("mod.so").unwrap();
    let memory: std::sync::Arc<MmapRegion> = MmapRegion::create(1, &Limits::default()).unwrap();

    for func_name in &["trap", "test"] {
        let mut instance_handle = memory.new_instance_builder(module.clone()).build().unwrap();
        println!("{:?}", instance_handle.run(func_name, &[]));
    }
}

Cargo.toml

[package]
name = "min-example"
version = "0.1.0"
edition = "2018"

[dependencies]
lucet-runtime = "0.1.1"
lucetc = "0.1.1"
lucet-runtime-internals = "0.1.1"
wabt = "0.7.4"
@roman-kashitsyn
Copy link
Contributor Author

The issue disappears if every instance is created in a separate thread:

src/main.rs

use lucet_runtime::{DlModule, MmapRegion, Region};
use lucet_runtime_internals::alloc::Limits;
use lucetc::Lucetc;

fn main() {
    let wasm = wabt::wat2wasm(
        r#"
            (module
              (func (export "trap") unreachable)
              (func (export "test")))
        "#,
    )
    .unwrap();

    let lucetc = Lucetc::try_from_bytes(wasm.as_slice()).unwrap();
    lucetc.shared_object_file("mod.so").unwrap();
    let module = DlModule::load("mod.so").unwrap();

    for func_name in &["trap", "test"] {
        let module = module.clone();
        let handler = std::thread::spawn(move || {
            let memory: std::sync::Arc<MmapRegion> =
                MmapRegion::create(1, &Limits::default()).unwrap();
            let mut instance_handle = memory.new_instance_builder(module).build().unwrap();
            println!("{:?}", instance_handle.run(func_name, &[]));
        });
        handler.join().unwrap();
    }
}

@acfoltzer
Copy link
Contributor

Hi @roman-kashitsyn, thanks for the report! Unfortunately I'm so far unable to replicate this issue, at least on my Linux machine. Can you share more about your OS environment?

Also, the "0.1.1" versions of the Lucet packages on crates.io are pretty old at this point, it may also be worth trying the current versions from master. I tried your example with both, and it succeeded both times, though.

@roman-kashitsyn
Copy link
Contributor Author

roman-kashitsyn commented Oct 8, 2019

Sorry, I forgot to mention that I observe the issue on macOS 10.14 and 10.15. I'll try to reproduce it on current master.

@acfoltzer
Copy link
Contributor

I got access to a Mac to test this out, and can confirm that it's also an issue with master. Mac OS is a pretty experimental platform for us; most of the developers and our deployment environment are x86-64 Linux. As a result, we may not be able to get to the bottom of this very soon, but we're happy to provide guidance if you're interested in troubleshooting.

It appears there's something strange going on with the semantics of alternate signal handler stacks on Mac. When a Lucet guest traps, a SIGILL is raised and handled by the runtime, which swaps back to the host context and returns an error. After the swap to the host context, the thread is no longer on the alternate signal stack, but getting EPERM from that call suggests that the OS believes that signal stack is still active. Perhaps there's a system call we need to make on Macs only to convince the OS that we've jumped away from the handler?

@roman-kashitsyn
Copy link
Contributor Author

I'll check if the problem still reproducible if I call sigaltstack with SS_DISABLE flag before setting the new stack.

It's weird that the problem is not reproducible on Linux, the man page says

       EPERM  An attempt was made to change the alternate signal stack while
              it was active (i.e., the process was already executing on the
              current alternate signal stack).

As far as I understand, that should mean that it's illegal to change the alternative signal stack if it has already been used to handle a signal.

@acfoltzer
Copy link
Contributor

We should never be swapping out the alt stack when the signal handler is running, only after we've swapped back to the host context. Linux must correctly recognize when we've jumped away from the handler, where Mac OS still believes it's running.

@roman-kashitsyn
Copy link
Contributor Author

roman-kashitsyn commented Oct 16, 2019

I tried to construct a small C program reproducing the macOS issue with sigaltstack, but I didn't manage to trigger the issue. Here is the program I tried:

#include <stdlib.h>
#include <stdio.h>
#include <signal.h>

static volatile int G_caught_sig = 0;

static char G_sig_stack_1[SIGSTKSZ];
static char G_sig_stack_2[SIGSTKSZ];

void handler(int sig, siginfo_t *_info, void *_arg) {
  G_caught_sig = sig;
}

int main() {
  stack_t ss, old_ss;
  ss.ss_size = SIGSTKSZ;
  ss.ss_flags = 0;
  ss.ss_sp = &G_sig_stack_1;

  if (sigaltstack(&ss, &old_ss) < 0) {
    perror("sigaltstack");
    exit(EXIT_FAILURE);
  }

  struct sigaction sa;
  sa.sa_flags = SA_ONSTACK | SA_SIGINFO;
  sa.sa_sigaction = &handler;
  sigemptyset(&sa.sa_mask);
  if (sigaction(SIGILL, &sa, NULL) < 0) {
    perror("sigaction");
    exit(EXIT_FAILURE);
  }

  raise(SIGILL);

  printf("Caught signal = %d\n", G_caught_sig);

  ss.ss_sp = &G_sig_stack_2;
  if (sigaltstack(&ss, &old_ss) < 0) {
    perror("sigaltstack");
    exit(EXIT_FAILURE);
  }

  return 0;
}

UPD: One interesting difference I observed is that the syscall trace of this small C program contains sigreturn (a syscall that cleans up the stack), while the trace of the rust program in the issue description doesn't have this syscall.

@acfoltzer
Copy link
Contributor

UPD: One interesting difference I observed is that the syscall trace of this small C program contains sigreturn (a syscall that cleans up the stack), while the trace of the rust program in the issue description doesn't have this syscall.

This is more evidence that Mac OS behaves differently when jumping out of a signal handler, rather than returning. Unfortunately for many of the signals that a guest raises when faulting can't be immediately resolved, so it would just keep reraising the same signal if we returned from the handler. I wonder if there's a syscall similar to sigreturn that we could call to make it clear to the OS that we're no longer on the signal stack?

@roman-kashitsyn
Copy link
Contributor Author

roman-kashitsyn commented Oct 17, 2019

Unfortunately for many of the signals that a guest raises when faulting can't be immediately resolved, so it would just keep reraising the same signal if we returned from the handler.

Yes, that makes a lot of sense. I wrote another small C program that uses longjmp to get out of a signal handler to avoid being stuck in an infinite loop.

#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <setjmp.h>
#include <unistd.h>
#include <sys/mman.h>

static jmp_buf jump_buffer;

static volatile int G_caught_sig = 0;
static volatile char *G_memory = NULL;

static char G_sig_stack_1[SIGSTKSZ];
static char G_sig_stack_2[SIGSTKSZ];

void handler(int sig, siginfo_t *_info, void *_arg) {
  G_caught_sig = sig;
  longjmp(jump_buffer, sig);
}

int main() {
  G_memory = mmap(0, getpagesize(), PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  if (G_memory == MAP_FAILED) {
    perror("mmap");
    exit(EXIT_FAILURE);
  }

  stack_t ss, old_ss;
  ss.ss_size = SIGSTKSZ;
  ss.ss_flags = 0;
  ss.ss_sp = &G_sig_stack_1;

  if (sigaltstack(&ss, &old_ss) < 0) {
    perror("sigaltstack");
    exit(EXIT_FAILURE);
  }

  struct sigaction sa;
  sa.sa_flags = SA_ONSTACK | SA_SIGINFO;
  sa.sa_sigaction = &handler;
  sigemptyset(&sa.sa_mask);
  if (sigaction(SIGBUS, &sa, NULL) < 0) {
    perror("sigaction");
    exit(EXIT_FAILURE);
  }

  while (setjmp(jump_buffer) == 0) {
    printf("Setting memory at %p\n", G_memory);
    G_memory[0] = 'z';
  }

  printf("Caught signal = %d\n", G_caught_sig);

  ss.ss_sp = &G_sig_stack_2;
  if (sigaltstack(&ss, &old_ss) < 0) {
    perror("sigaltstack");
    exit(EXIT_FAILURE);
  }

  return 0;
}

The program terminates successfully.

The syscall trace still contains sigreturn (most likely it's called by longjmp). I also took a quick look at the latest available xnu kernel sources, it appears that sigreturn is the only place where the SA_ONSTACK flag causing the EPERM error is cleared.

@acfoltzer
Copy link
Contributor

Ah, it looks like sigreturn is more flexible than the one available on Linux. We might be able to use it to jump back to the host context on a fault, instead of our own Context::set_from_signal. Or, we might have to use a C trampoline that we can setjmp from, so that we can longjmp out of the handler without having to get our hands dirty with what appears to be an undocumented syscall.

Again, we probably won't be able to dive deeply into this in the short term, but this is very helpful diagnostic information for us. Thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants