-
Notifications
You must be signed in to change notification settings - Fork 12
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: address race condition when copying blocks between threads on node
Oh my god, Atomics. The linked issue explains the behavior we're fixing in more detail. But the specific behavior we were running into (I _think_) goes a little something like this: Imagine we have two threads. ``` worker thread host thread | | | | o postMessage ---------> | - calls hostFunc, waits for completion x wait for flag to != 4 x - wait for flag to == 4; writes 1 | ↖︎_____________________o scratchbuffer's worth of info, sets flag to "N" bytes | x - wait for flag to == 4 | reads N bytes _____↗︎| | sets flag to 4 / | o _______________/ | x wait for flag to != 4 | - all done, set flag to N, wait for | ↖︎______________________o flag to == 4 again | __↗︎| | all done / | | set flag to 4 / | | return to wasm / | | / | o _______________/ | ↓ ↓ ``` We had a couple of problems: 1. In the first postMessage, we didn't wait for the flag to == 4 before writing data back. 2. We implemented waits as a single `Atomics.wait{,Async}` with a MAX_WAIT timeout. 3. We trusted the value that came out of `Atomics.load` directly after the `Atomics.wait`. The first problem was pretty straightforward to fix. This merely makes the two threads agree that the shared array buffer is in a certain state rather than relying on it implicitly being in the correct state. (Which is an assumption I slipped into: if the main thread is executing, what other value could the flag have? After all, we set the flag before we called `postMessage`! --this turns out to be a _class_ of bug.) The second two problems were more surprising: looking into other semaphore implementations I was surprised to see that they combined `wait` with a loop, and further ensured that the value that they loaded directly after the `wait` had actually changed. This was the proximate cause of the bug: we had a single wait, sure, but it was possible for the observed value loaded after the wait to not change. This meant skipping an entire flush of the buffer, which would permanently misalign the two threads. This has an interesting effect on performance: Bun, browsers, and Node appear to perform just as well as they did before, minus the errors we saw before. Deno, on the other hand, hits a hideous slowdown -- the test jumps from taking 3 seconds on other platforms to 18-20 seconds. I'm investigating what's going on there, but I'm surprised to see how different two V8-backed JS platforms perform in practice. I've left the `runInWorker` flag defaulted to "off" in the meantime while I dig into this. Fixes #46.
- Loading branch information
1 parent
702d794
commit d14493d
Showing
6 changed files
with
223 additions
and
204 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.