fix: address race condition when copying blocks between threads on node #52

chrisdickinson · 2024-01-12T23:44:36Z

Oh my god, Atomics.

The linked issue explains the behavior we're fixing in more detail. But the specific behavior we were running into (I think) goes a little something like this: Imagine we have two threads.

  worker thread              host thread
        |                        |
	|                        |
	o postMessage ---------> | - calls hostFunc, waits for completion
        x wait for flag to != 4  x - wait for flag to == 4; writes 1
        |  ↖︎_____________________o   scratchbuffer's worth of info, sets flag to "N" bytes
        |                        x - wait for flag to == 4
        | reads N bytes    _____↗︎|
	| sets flag to 4  /      |
	o _______________/       |
	x wait for flag to != 4  | - all done, set flag to N, wait for
	| ↖︎______________________o   flag to == 4 again
        |                     __↗︎|
        | all done           /   |
	| set flag to 4     /    |
	| return to wasm   /	 |
        |                 / 	 |
        o _______________/	 |
        ↓			 ↓

We had a couple of problems:

In the first postMessage, we didn't wait for the flag to == 4 before writing data back.
We implemented waits as a single Atomics.wait{,Async} with a MAX_WAIT timeout.
We trusted the value that came out of Atomics.load directly after the Atomics.wait.

The first problem was pretty straightforward to fix. This merely makes the two threads agree that the shared array buffer is in a certain state rather than relying on it implicitly being in the correct state. (Which is an assumption I slipped into: if the main thread is executing, what other value could the flag have? After all, we set the flag before we called postMessage! --this turns out to be a class of bug.)

The second two problems were more surprising: looking into other semaphore implementations I was surprised to see that they combined wait with a loop, and further ensured that the value that they loaded directly after the wait had actually changed. This was the proximate cause of the bug: we had a single wait, sure, but it was possible for the observed value loaded after the wait to not change. This meant skipping an entire flush of the buffer, which would permanently misalign the two threads.

This has an interesting effect on performance: Bun, browsers, and Node appear to perform just as well as they did before, minus the errors we saw before. Deno, on the other hand, hits a hideous slowdown -- the test jumps from taking 3 seconds on other platforms to 18-20 seconds. I'm investigating what's going on there, but I'm surprised to see how different two V8-backed JS platforms perform in practice. I've left the runInWorker flag defaulted to "off" in the meantime while I dig into this.

Fixes #46.

Oh my god, Atomics. The linked issue explains the behavior we're fixing in more detail. But the specific behavior we were running into (I _think_) goes a little something like this: Imagine we have two threads. ``` worker thread host thread | | | | o postMessage ---------> | - calls hostFunc, waits for completion x wait for flag to != 4 x - wait for flag to == 4; writes 1 | ↖︎_____________________o scratchbuffer's worth of info, sets flag to "N" bytes | x - wait for flag to == 4 | reads N bytes _____↗︎| | sets flag to 4 / | o _______________/ | x wait for flag to != 4 | - all done, set flag to N, wait for | ↖︎______________________o flag to == 4 again | __↗︎| | all done / | | set flag to 4 / | | return to wasm / | | / | o _______________/ | ↓ ↓ ``` We had a couple of problems: 1. In the first postMessage, we didn't wait for the flag to == 4 before writing data back. 2. We implemented waits as a single `Atomics.wait{,Async}` with a MAX_WAIT timeout. 3. We trusted the value that came out of `Atomics.load` directly after the `Atomics.wait`. The first problem was pretty straightforward to fix. This merely makes the two threads agree that the shared array buffer is in a certain state rather than relying on it implicitly being in the correct state. (Which is an assumption I slipped into: if the main thread is executing, what other value could the flag have? After all, we set the flag before we called `postMessage`! --this turns out to be a _class_ of bug.) The second two problems were more surprising: looking into other semaphore implementations I was surprised to see that they combined `wait` with a loop, and further ensured that the value that they loaded directly after the `wait` had actually changed. This was the proximate cause of the bug: we had a single wait, sure, but it was possible for the observed value loaded after the wait to not change. This meant skipping an entire flush of the buffer, which would permanently misalign the two threads. This has an interesting effect on performance: Bun, browsers, and Node appear to perform just as well as they did before, minus the errors we saw before. Deno, on the other hand, hits a hideous slowdown -- the test jumps from taking 3 seconds on other platforms to 18-20 seconds. I'm investigating what's going on there, but I'm surprised to see how different two V8-backed JS platforms perform in practice. I've left the `runInWorker` flag defaulted to "off" in the meantime while I dig into this. Fixes #46.

chrisdickinson · 2024-01-12T23:45:08Z

justfile

-bake:
-    while just _test; do true; done
+bake filter='.*':
+    while just _test '{{ filter }}'; do true; done


This change made it possible to run the tests in a tighter loop, surfacing the bug more quickly.

chrisdickinson · 2024-01-12T23:45:58Z

src/background-plugin.ts

+    //
+    // - https://github.com/nodejs/node/pull/44409
+    // - https://github.com/denoland/deno/issues/14786
+    const timer = setInterval(() => {}, 0);


This code moved – the timer used to be created per AtomicsWaitAsync, now we create it once at the start of the invocation and run it until we're done with the invocation.

chrisdickinson · 2024-01-12T23:47:07Z

src/background-plugin.ts

+    this.output = output;
+    this.outputOffset = SAB_BASE_OFFSET;
+    this.flag = new Int32Array(this.output);
+    this.wait(0);


This is the "first bug" I talked about – now we .wait(0) before attempting to write anything.

chrisdickinson · 2024-01-12T23:47:29Z

src/background-plugin.ts

+      this.worker.terminate();
+      this.worker = null as any;
+    }
+  }


These functions moved up to put the RingBufferWriter closer to #handleInvoke.

chrisdickinson · 2024-01-12T23:47:59Z

src/background-plugin.ts

-    Atomics.notify(this.flag, RingBufferWriter.SAB_IDX, 1);
-
-    // wait for the thread to read the data out...
-    const result = AtomicsWaitAsync(this.flag, RingBufferWriter.SAB_IDX, targetOffset, MAX_WAIT);


Note that we only wait once in this version of the code.

chrisdickinson · 2024-01-12T23:50:49Z

src/mod.test.ts

@@ -363,15 +363,14 @@ if (typeof WebAssembly === 'undefined') {
    });

    test('test writes that span multiple blocks (w/small buffer)', async () => {
-      const res = await fetch('http://localhost:8124/src/mod.test.ts');
-      const result = await res.text();
+      const value = '9:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ'.repeat(18428 / 34);


Fix a silly test stability issue: this test used to pass the content of the test file around. Anytime the test file changed, so did the length of the buffer. The unit test in question failed when transferring 18,428 bytes, so we pin it there. Additionally, we use a 34-character repeating readable ASCII sequence to make it easier to spot "missing" runs of characters.

chrisdickinson · 2024-01-12T23:55:54Z

src/worker.ts

+      }
+    } while (value <= SAB_BASE_OFFSET);
+
+    this.#available = Atomics.load(this.flag, 0);


This is the second issue we ran into – sometimes we'd receive a wait signal for the flag but, on reading the value, it would turn out to be the old value. The loop ensures we wait until an Atomics.load gives us the expected, changed value.

nilslice · 2024-01-15T20:08:43Z

Congrats, you win all the branch name points today!!!

nilslice

Admittedly, I would need to spend a full day+ on this to fully understand, but in lieu of that - nothing looks problematic! Would be good to have @bhelx still take a look though.

Great work!

bhelx

It's tricky for me to give a strong opinion on this as it's pretty in the weeds. I worry about some of these components (specifically the RingBufferWriter) because it feels like the kind of thing that could have subtle bugs that are really tricky to debug and test. But I'm assuming we're in uncharted territory here and we're gonna need to write some of this low level code ourselves.

chrisdickinson force-pushed the chris/20240108-what-is-a-semaphore-a-miserable-pile-of-atomics branch from ef6eda1 to 34492a4 Compare January 12, 2024 23:51

chrisdickinson commented Jan 12, 2024

View reviewed changes

chrisdickinson requested review from bhelx and nilslice January 12, 2024 23:56

nilslice approved these changes Jan 15, 2024

View reviewed changes

bhelx approved these changes Jan 17, 2024

View reviewed changes

chrisdickinson merged commit e48cec9 into main Jan 23, 2024
4 checks passed

chrisdickinson deleted the chris/20240108-what-is-a-semaphore-a-miserable-pile-of-atomics branch January 23, 2024 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: address race condition when copying blocks between threads on node #52

fix: address race condition when copying blocks between threads on node #52

chrisdickinson commented Jan 12, 2024

chrisdickinson Jan 12, 2024

chrisdickinson Jan 12, 2024

chrisdickinson Jan 12, 2024

chrisdickinson Jan 12, 2024

chrisdickinson Jan 12, 2024

chrisdickinson Jan 12, 2024

chrisdickinson Jan 12, 2024

nilslice commented Jan 15, 2024

nilslice left a comment

bhelx left a comment

fix: address race condition when copying blocks between threads on node #52

fix: address race condition when copying blocks between threads on node #52

Conversation

chrisdickinson commented Jan 12, 2024

chrisdickinson Jan 12, 2024

Choose a reason for hiding this comment

chrisdickinson Jan 12, 2024

Choose a reason for hiding this comment

chrisdickinson Jan 12, 2024

Choose a reason for hiding this comment

chrisdickinson Jan 12, 2024

Choose a reason for hiding this comment

chrisdickinson Jan 12, 2024

Choose a reason for hiding this comment

chrisdickinson Jan 12, 2024

Choose a reason for hiding this comment

chrisdickinson Jan 12, 2024

Choose a reason for hiding this comment

nilslice commented Jan 15, 2024

nilslice left a comment

Choose a reason for hiding this comment

bhelx left a comment

Choose a reason for hiding this comment