Race condition in background thread #46

chrisdickinson · 2024-01-08T17:27:39Z

Tracking this one here. I've been investigating intermittent failures & hangs in CI over the last month or so:

This seems to appear most frequently in Node and Windows in the Test writes that span multiple blocks (w/small buffer) test (which is what the test was designed to catch!)

I managed to replicate the behavior locally by running a "bake" command running just the single test in a tight loop against a single platform. I modified this test to transfer a 18,428-byte buffer with a 16-byte shared scratch space (leaving 12 bytes of room for data per block transfer); it crops up randomly after 20-80 iterations. Strangely, while it's possible to force a failure from Deno by artificially delaying the worker thread in the critical section, Deno does not seem to exhibit the behavior otherwise.

I have reason to think that the bug is down to the semaphore pattern I'm using to transfer the data. I'm going to take a look today and see if I can't replace that with something more off-the-shelf, but if not we might mark runInWorker as an unstable feature while we work on a fix.

The text was updated successfully, but these errors were encountered:

Because we're still working out kinks in the implementation [1], users can specifically request that plugins run in the background, but they no longer do so by default where available. Once [1] is fixed, we're likely to make threading default again. [1]: #46

Oh my god, Atomics. The linked issue explains the behavior we're fixing in more detail. But the specific behavior we were running into (I _think_) goes a little something like this: Imagine we have two threads. ``` worker thread host thread | | | | o postMessage ---------> | - calls hostFunc, waits for completion x wait for flag to != 4 x - wait for flag to == 4; writes 1 | ↖︎_____________________o scratchbuffer's worth of info, sets flag to "N" bytes | x - wait for flag to == 4 | reads N bytes _____↗︎| | sets flag to 4 / | o _______________/ | x wait for flag to != 4 | - all done, set flag to N, wait for | ↖︎______________________o flag to == 4 again | __↗︎| | all done / | | set flag to 4 / | | return to wasm / | | / | o _______________/ | ↓ ↓ ``` We had a couple of problems: 1. In the first postMessage, we didn't wait for the flag to == 4 before writing data back. 2. We implemented waits as a single `Atomics.wait{,Async}` with a MAX_WAIT timeout. 3. We trusted the value that came out of `Atomics.load` directly after the `Atomics.wait`. The first problem was pretty straightforward to fix. This merely makes the two threads agree that the shared array buffer is in a certain state rather than relying on it implicitly being in the correct state. (Which is an assumption I slipped into: if the main thread is executing, what other value could the flag have? After all, we set the flag before we called `postMessage`! --this turns out to be a _class_ of bug.) The second two problems were more surprising: looking into other semaphore implementations I was surprised to see that they combined `wait` with a loop, and further ensured that the value that they loaded directly after the `wait` had actually changed. This was the proximate cause of the bug: we had a single wait, sure, but it was possible for the observed value loaded after the wait to not change. This meant skipping an entire flush of the buffer, which would permanently misalign the two threads. This has an interesting effect on performance: Bun, browsers, and Node appear to perform just as well as they did before, minus the errors we saw before. Deno, on the other hand, hits a hideous slowdown -- the test jumps from taking 3 seconds on other platforms to 18-20 seconds. I'm investigating what's going on there, but I'm surprised to see how different two V8-backed JS platforms perform in practice. I've left the `runInWorker` flag defaulted to "off" in the meantime while I dig into this. Fixes #46.

chrisdickinson added the bug Something isn't working label Jan 8, 2024

chrisdickinson self-assigned this Jan 8, 2024

chrisdickinson added this to Extism 1.0 Jan 8, 2024

chrisdickinson moved this to In Progress in Extism 1.0 Jan 8, 2024

chrisdickinson mentioned this issue Jan 8, 2024

fix: mark threads as experimental #49

Merged

chrisdickinson mentioned this issue Jan 12, 2024

fix: address race condition when copying blocks between threads on node #52

Merged

chrisdickinson closed this as completed in #52 Jan 23, 2024

chrisdickinson closed this as completed in e48cec9 Jan 23, 2024

github-project-automation bot moved this from In Progress to Done in Extism 1.0 Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in background thread #46

Race condition in background thread #46

chrisdickinson commented Jan 8, 2024

Race condition in background thread #46

Race condition in background thread #46

Comments

chrisdickinson commented Jan 8, 2024