-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: address race condition when copying blocks between threads on node #52
fix: address race condition when copying blocks between threads on node #52
Conversation
Oh my god, Atomics. The linked issue explains the behavior we're fixing in more detail. But the specific behavior we were running into (I _think_) goes a little something like this: Imagine we have two threads. ``` worker thread host thread | | | | o postMessage ---------> | - calls hostFunc, waits for completion x wait for flag to != 4 x - wait for flag to == 4; writes 1 | ↖︎_____________________o scratchbuffer's worth of info, sets flag to "N" bytes | x - wait for flag to == 4 | reads N bytes _____↗︎| | sets flag to 4 / | o _______________/ | x wait for flag to != 4 | - all done, set flag to N, wait for | ↖︎______________________o flag to == 4 again | __↗︎| | all done / | | set flag to 4 / | | return to wasm / | | / | o _______________/ | ↓ ↓ ``` We had a couple of problems: 1. In the first postMessage, we didn't wait for the flag to == 4 before writing data back. 2. We implemented waits as a single `Atomics.wait{,Async}` with a MAX_WAIT timeout. 3. We trusted the value that came out of `Atomics.load` directly after the `Atomics.wait`. The first problem was pretty straightforward to fix. This merely makes the two threads agree that the shared array buffer is in a certain state rather than relying on it implicitly being in the correct state. (Which is an assumption I slipped into: if the main thread is executing, what other value could the flag have? After all, we set the flag before we called `postMessage`! --this turns out to be a _class_ of bug.) The second two problems were more surprising: looking into other semaphore implementations I was surprised to see that they combined `wait` with a loop, and further ensured that the value that they loaded directly after the `wait` had actually changed. This was the proximate cause of the bug: we had a single wait, sure, but it was possible for the observed value loaded after the wait to not change. This meant skipping an entire flush of the buffer, which would permanently misalign the two threads. This has an interesting effect on performance: Bun, browsers, and Node appear to perform just as well as they did before, minus the errors we saw before. Deno, on the other hand, hits a hideous slowdown -- the test jumps from taking 3 seconds on other platforms to 18-20 seconds. I'm investigating what's going on there, but I'm surprised to see how different two V8-backed JS platforms perform in practice. I've left the `runInWorker` flag defaulted to "off" in the meantime while I dig into this. Fixes #46.
ef6eda1
to
34492a4
Compare
bake: | ||
while just _test; do true; done | ||
bake filter='.*': | ||
while just _test '{{ filter }}'; do true; done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change made it possible to run the tests in a tighter loop, surfacing the bug more quickly.
// | ||
// - https://github.com/nodejs/node/pull/44409 | ||
// - https://github.com/denoland/deno/issues/14786 | ||
const timer = setInterval(() => {}, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code moved – the timer used to be created per AtomicsWaitAsync
, now we create it once at the start of the invocation and run it until we're done with the invocation.
this.output = output; | ||
this.outputOffset = SAB_BASE_OFFSET; | ||
this.flag = new Int32Array(this.output); | ||
this.wait(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the "first bug" I talked about – now we .wait(0)
before attempting to write anything.
this.worker.terminate(); | ||
this.worker = null as any; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions moved up to put the RingBufferWriter
closer to #handleInvoke
.
Atomics.notify(this.flag, RingBufferWriter.SAB_IDX, 1); | ||
|
||
// wait for the thread to read the data out... | ||
const result = AtomicsWaitAsync(this.flag, RingBufferWriter.SAB_IDX, targetOffset, MAX_WAIT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we only wait once in this version of the code.
@@ -363,15 +363,14 @@ if (typeof WebAssembly === 'undefined') { | |||
}); | |||
|
|||
test('test writes that span multiple blocks (w/small buffer)', async () => { | |||
const res = await fetch('http://localhost:8124/src/mod.test.ts'); | |||
const result = await res.text(); | |||
const value = '9:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ'.repeat(18428 / 34); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix a silly test stability issue: this test used to pass the content of the test file around. Anytime the test file changed, so did the length of the buffer. The unit test in question failed when transferring 18,428 bytes, so we pin it there. Additionally, we use a 34-character repeating readable ASCII sequence to make it easier to spot "missing" runs of characters.
} | ||
} while (value <= SAB_BASE_OFFSET); | ||
|
||
this.#available = Atomics.load(this.flag, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the second issue we ran into – sometimes we'd receive a wait
signal for the flag but, on reading the value, it would turn out to be the old value. The loop ensures we wait until an Atomics.load
gives us the expected, changed value.
Congrats, you win all the branch name points today!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admittedly, I would need to spend a full day+ on this to fully understand, but in lieu of that - nothing looks problematic! Would be good to have @bhelx still take a look though.
Great work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's tricky for me to give a strong opinion on this as it's pretty in the weeds. I worry about some of these components (specifically the RingBufferWriter) because it feels like the kind of thing that could have subtle bugs that are really tricky to debug and test. But I'm assuming we're in uncharted territory here and we're gonna need to write some of this low level code ourselves.
Oh my god, Atomics.
The linked issue explains the behavior we're fixing in more detail. But the specific behavior we were running into (I think) goes a little something like this: Imagine we have two threads.
We had a couple of problems:
Atomics.wait{,Async}
with a MAX_WAIT timeout.Atomics.load
directly after theAtomics.wait
.The first problem was pretty straightforward to fix. This merely makes the two threads agree that the shared array buffer is in a certain state rather than relying on it implicitly being in the correct state. (Which is an assumption I slipped into: if the main thread is executing, what other value could the flag have? After all, we set the flag before we called
postMessage
! --this turns out to be a class of bug.)The second two problems were more surprising: looking into other semaphore implementations I was surprised to see that they combined
wait
with a loop, and further ensured that the value that they loaded directly after thewait
had actually changed. This was the proximate cause of the bug: we had a single wait, sure, but it was possible for the observed value loaded after the wait to not change. This meant skipping an entire flush of the buffer, which would permanently misalign the two threads.This has an interesting effect on performance: Bun, browsers, and Node appear to perform just as well as they did before, minus the errors we saw before. Deno, on the other hand, hits a hideous slowdown -- the test jumps from taking 3 seconds on other platforms to 18-20 seconds. I'm investigating what's going on there, but I'm surprised to see how different two V8-backed JS platforms perform in practice. I've left the
runInWorker
flag defaulted to "off" in the meantime while I dig into this.Fixes #46.