Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in background thread #46

Closed
chrisdickinson opened this issue Jan 8, 2024 · 0 comments · Fixed by #52
Closed

Race condition in background thread #46

chrisdickinson opened this issue Jan 8, 2024 · 0 comments · Fixed by #52
Assignees
Labels
bug Something isn't working

Comments

@chrisdickinson
Copy link
Contributor

Tracking this one here. I've been investigating intermittent failures & hangs in CI over the last month or so:

This seems to appear most frequently in Node and Windows in the Test writes that span multiple blocks (w/small buffer) test (which is what the test was designed to catch!)

I managed to replicate the behavior locally by running a "bake" command running just the single test in a tight loop against a single platform. I modified this test to transfer a 18,428-byte buffer with a 16-byte shared scratch space (leaving 12 bytes of room for data per block transfer); it crops up randomly after 20-80 iterations. Strangely, while it's possible to force a failure from Deno by artificially delaying the worker thread in the critical section, Deno does not seem to exhibit the behavior otherwise.

I have reason to think that the bug is down to the semaphore pattern I'm using to transfer the data. I'm going to take a look today and see if I can't replace that with something more off-the-shelf, but if not we might mark runInWorker as an unstable feature while we work on a fix.

@chrisdickinson chrisdickinson added the bug Something isn't working label Jan 8, 2024
@chrisdickinson chrisdickinson self-assigned this Jan 8, 2024
@chrisdickinson chrisdickinson moved this to In Progress in Extism 1.0 Jan 8, 2024
chrisdickinson added a commit that referenced this issue Jan 8, 2024
Because we're still working out kinks in the implementation [1], users can
specifically request that plugins run in the background, but they no longer do
so by default where available. Once [1] is fixed, we're likely to make threading
default again.

[1]: #46
chrisdickinson added a commit that referenced this issue Jan 8, 2024
Because we're still working out kinks in the implementation [1], users can
specifically request that plugins run in the background, but they no longer do
so by default where available. Once [1] is fixed, we're likely to make threading
default again.

[1]: #46
chrisdickinson added a commit that referenced this issue Jan 8, 2024
Because we're still working out kinks in the implementation [1], users can
specifically request that plugins run in the background, but they no longer do
so by default where available. Once [1] is fixed, we're likely to make threading
default again.

[1]: #46
chrisdickinson added a commit that referenced this issue Jan 12, 2024
Oh my god, Atomics.

The linked issue explains the behavior we're fixing in more detail. But
the specific behavior we were running into (I _think_) goes a little something
like this: Imagine we have two threads.

```
  worker thread              host thread
        |                        |
	|                        |
	o postMessage ---------> | - calls hostFunc, waits for completion
        x wait for flag to != 4  x - wait for flag to == 4; writes 1
        |  ↖︎_____________________o   scratchbuffer's worth of info, sets flag to "N" bytes
        |                        x - wait for flag to == 4
        | reads N bytes    _____↗︎|
	| sets flag to 4  /      |
	o _______________/       |
	x wait for flag to != 4  | - all done, set flag to N, wait for
	| ↖︎______________________o   flag to == 4 again
        |                     __↗︎|
        | all done           /   |
	| set flag to 4     /    |
	| return to wasm   /	 |
        |                 / 	 |
        o _______________/	 |
        ↓			 ↓

```

We had a couple of problems:

1. In the first postMessage, we didn't wait for the flag to == 4 before writing
   data back.
2. We implemented waits as a single `Atomics.wait{,Async}` with a MAX_WAIT
   timeout.
3. We trusted the value that came out of `Atomics.load` directly after the
   `Atomics.wait`.

The first problem was pretty straightforward to fix. This merely makes the two
threads agree that the shared array buffer is in a certain state rather than
relying on it implicitly being in the correct state. (Which is an assumption I
slipped into: if the main thread is executing, what other value could the flag
have? After all, we set the flag before we called `postMessage`! --this turns
out to be a _class_ of bug.)

The second two problems were more surprising: looking into other semaphore
implementations I was surprised to see that they combined `wait` with a loop,
and further ensured that the value that they loaded directly after the `wait`
had actually changed. This was the proximate cause of the bug: we had a single
wait, sure, but it was possible for the observed value loaded after the wait to
not change.  This meant skipping an entire flush of the buffer, which would
permanently misalign the two threads.

This has an interesting effect on performance: Bun, browsers, and Node appear
to perform just as well as they did before, minus the errors we saw before.
Deno, on the other hand, hits a hideous slowdown -- the test jumps from taking
3 seconds on other platforms to 18-20 seconds. I'm investigating what's going
on there, but I'm surprised to see how different two V8-backed JS platforms
perform in practice. I've left the `runInWorker` flag defaulted to "off" in the
meantime while I dig into this.

Fixes #46.
chrisdickinson added a commit that referenced this issue Jan 12, 2024
Oh my god, Atomics.

The linked issue explains the behavior we're fixing in more detail. But
the specific behavior we were running into (I _think_) goes a little something
like this: Imagine we have two threads.

```
  worker thread              host thread
        |                        |
	|                        |
	o postMessage ---------> | - calls hostFunc, waits for completion
        x wait for flag to != 4  x - wait for flag to == 4; writes 1
        |  ↖︎_____________________o   scratchbuffer's worth of info, sets flag to "N" bytes
        |                        x - wait for flag to == 4
        | reads N bytes    _____↗︎|
	| sets flag to 4  /      |
	o _______________/       |
	x wait for flag to != 4  | - all done, set flag to N, wait for
	| ↖︎______________________o   flag to == 4 again
        |                     __↗︎|
        | all done           /   |
	| set flag to 4     /    |
	| return to wasm   /	 |
        |                 / 	 |
        o _______________/	 |
        ↓			 ↓

```

We had a couple of problems:

1. In the first postMessage, we didn't wait for the flag to == 4 before writing
   data back.
2. We implemented waits as a single `Atomics.wait{,Async}` with a MAX_WAIT
   timeout.
3. We trusted the value that came out of `Atomics.load` directly after the
   `Atomics.wait`.

The first problem was pretty straightforward to fix. This merely makes the two
threads agree that the shared array buffer is in a certain state rather than
relying on it implicitly being in the correct state. (Which is an assumption I
slipped into: if the main thread is executing, what other value could the flag
have? After all, we set the flag before we called `postMessage`! --this turns
out to be a _class_ of bug.)

The second two problems were more surprising: looking into other semaphore
implementations I was surprised to see that they combined `wait` with a loop,
and further ensured that the value that they loaded directly after the `wait`
had actually changed. This was the proximate cause of the bug: we had a single
wait, sure, but it was possible for the observed value loaded after the wait to
not change.  This meant skipping an entire flush of the buffer, which would
permanently misalign the two threads.

This has an interesting effect on performance: Bun, browsers, and Node appear
to perform just as well as they did before, minus the errors we saw before.
Deno, on the other hand, hits a hideous slowdown -- the test jumps from taking
3 seconds on other platforms to 18-20 seconds. I'm investigating what's going
on there, but I'm surprised to see how different two V8-backed JS platforms
perform in practice. I've left the `runInWorker` flag defaulted to "off" in the
meantime while I dig into this.

Fixes #46.
chrisdickinson added a commit that referenced this issue Jan 12, 2024
Oh my god, Atomics.

The linked issue explains the behavior we're fixing in more detail. But
the specific behavior we were running into (I _think_) goes a little something
like this: Imagine we have two threads.

```
  worker thread              host thread
        |                        |
	|                        |
	o postMessage ---------> | - calls hostFunc, waits for completion
        x wait for flag to != 4  x - wait for flag to == 4; writes 1
        |  ↖︎_____________________o   scratchbuffer's worth of info, sets flag to "N" bytes
        |                        x - wait for flag to == 4
        | reads N bytes    _____↗︎|
	| sets flag to 4  /      |
	o _______________/       |
	x wait for flag to != 4  | - all done, set flag to N, wait for
	| ↖︎______________________o   flag to == 4 again
        |                     __↗︎|
        | all done           /   |
	| set flag to 4     /    |
	| return to wasm   /	 |
        |                 / 	 |
        o _______________/	 |
        ↓			 ↓

```

We had a couple of problems:

1. In the first postMessage, we didn't wait for the flag to == 4 before writing
   data back.
2. We implemented waits as a single `Atomics.wait{,Async}` with a MAX_WAIT
   timeout.
3. We trusted the value that came out of `Atomics.load` directly after the
   `Atomics.wait`.

The first problem was pretty straightforward to fix. This merely makes the two
threads agree that the shared array buffer is in a certain state rather than
relying on it implicitly being in the correct state. (Which is an assumption I
slipped into: if the main thread is executing, what other value could the flag
have? After all, we set the flag before we called `postMessage`! --this turns
out to be a _class_ of bug.)

The second two problems were more surprising: looking into other semaphore
implementations I was surprised to see that they combined `wait` with a loop,
and further ensured that the value that they loaded directly after the `wait`
had actually changed. This was the proximate cause of the bug: we had a single
wait, sure, but it was possible for the observed value loaded after the wait to
not change.  This meant skipping an entire flush of the buffer, which would
permanently misalign the two threads.

This has an interesting effect on performance: Bun, browsers, and Node appear
to perform just as well as they did before, minus the errors we saw before.
Deno, on the other hand, hits a hideous slowdown -- the test jumps from taking
3 seconds on other platforms to 18-20 seconds. I'm investigating what's going
on there, but I'm surprised to see how different two V8-backed JS platforms
perform in practice. I've left the `runInWorker` flag defaulted to "off" in the
meantime while I dig into this.

Fixes #46.
@github-project-automation github-project-automation bot moved this from In Progress to Done in Extism 1.0 Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant