Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: address race condition when copying blocks between threads on node #52

Conversation

chrisdickinson
Copy link
Contributor

Oh my god, Atomics.

The linked issue explains the behavior we're fixing in more detail. But the specific behavior we were running into (I think) goes a little something like this: Imagine we have two threads.

  worker thread              host thread
        |                        |
	|                        |
	o postMessage ---------> | - calls hostFunc, waits for completion
        x wait for flag to != 4  x - wait for flag to == 4; writes 1
        |  ↖︎_____________________o   scratchbuffer's worth of info, sets flag to "N" bytes
        |                        x - wait for flag to == 4
        | reads N bytes    _____↗︎|
	| sets flag to 4  /      |
	o _______________/       |
	x wait for flag to != 4  | - all done, set flag to N, wait for
	| ↖︎______________________o   flag to == 4 again
        |                     __↗︎|
        | all done           /   |
	| set flag to 4     /    |
	| return to wasm   /	 |
        |                 / 	 |
        o _______________/	 |
        ↓			 ↓

We had a couple of problems:

  1. In the first postMessage, we didn't wait for the flag to == 4 before writing data back.
  2. We implemented waits as a single Atomics.wait{,Async} with a MAX_WAIT timeout.
  3. We trusted the value that came out of Atomics.load directly after the Atomics.wait.

The first problem was pretty straightforward to fix. This merely makes the two threads agree that the shared array buffer is in a certain state rather than relying on it implicitly being in the correct state. (Which is an assumption I slipped into: if the main thread is executing, what other value could the flag have? After all, we set the flag before we called postMessage! --this turns out to be a class of bug.)

The second two problems were more surprising: looking into other semaphore implementations I was surprised to see that they combined wait with a loop, and further ensured that the value that they loaded directly after the wait had actually changed. This was the proximate cause of the bug: we had a single wait, sure, but it was possible for the observed value loaded after the wait to not change. This meant skipping an entire flush of the buffer, which would permanently misalign the two threads.

This has an interesting effect on performance: Bun, browsers, and Node appear to perform just as well as they did before, minus the errors we saw before. Deno, on the other hand, hits a hideous slowdown -- the test jumps from taking 3 seconds on other platforms to 18-20 seconds. I'm investigating what's going on there, but I'm surprised to see how different two V8-backed JS platforms perform in practice. I've left the runInWorker flag defaulted to "off" in the meantime while I dig into this.

Fixes #46.

Oh my god, Atomics.

The linked issue explains the behavior we're fixing in more detail. But
the specific behavior we were running into (I _think_) goes a little something
like this: Imagine we have two threads.

```
  worker thread              host thread
        |                        |
	|                        |
	o postMessage ---------> | - calls hostFunc, waits for completion
        x wait for flag to != 4  x - wait for flag to == 4; writes 1
        |  ↖︎_____________________o   scratchbuffer's worth of info, sets flag to "N" bytes
        |                        x - wait for flag to == 4
        | reads N bytes    _____↗︎|
	| sets flag to 4  /      |
	o _______________/       |
	x wait for flag to != 4  | - all done, set flag to N, wait for
	| ↖︎______________________o   flag to == 4 again
        |                     __↗︎|
        | all done           /   |
	| set flag to 4     /    |
	| return to wasm   /	 |
        |                 / 	 |
        o _______________/	 |
        ↓			 ↓

```

We had a couple of problems:

1. In the first postMessage, we didn't wait for the flag to == 4 before writing
   data back.
2. We implemented waits as a single `Atomics.wait{,Async}` with a MAX_WAIT
   timeout.
3. We trusted the value that came out of `Atomics.load` directly after the
   `Atomics.wait`.

The first problem was pretty straightforward to fix. This merely makes the two
threads agree that the shared array buffer is in a certain state rather than
relying on it implicitly being in the correct state. (Which is an assumption I
slipped into: if the main thread is executing, what other value could the flag
have? After all, we set the flag before we called `postMessage`! --this turns
out to be a _class_ of bug.)

The second two problems were more surprising: looking into other semaphore
implementations I was surprised to see that they combined `wait` with a loop,
and further ensured that the value that they loaded directly after the `wait`
had actually changed. This was the proximate cause of the bug: we had a single
wait, sure, but it was possible for the observed value loaded after the wait to
not change.  This meant skipping an entire flush of the buffer, which would
permanently misalign the two threads.

This has an interesting effect on performance: Bun, browsers, and Node appear
to perform just as well as they did before, minus the errors we saw before.
Deno, on the other hand, hits a hideous slowdown -- the test jumps from taking
3 seconds on other platforms to 18-20 seconds. I'm investigating what's going
on there, but I'm surprised to see how different two V8-backed JS platforms
perform in practice. I've left the `runInWorker` flag defaulted to "off" in the
meantime while I dig into this.

Fixes #46.
@chrisdickinson chrisdickinson force-pushed the chris/20240108-what-is-a-semaphore-a-miserable-pile-of-atomics branch from ef6eda1 to 34492a4 Compare January 12, 2024 23:51
bake:
while just _test; do true; done
bake filter='.*':
while just _test '{{ filter }}'; do true; done
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change made it possible to run the tests in a tighter loop, surfacing the bug more quickly.

//
// - https://github.com/nodejs/node/pull/44409
// - https://github.com/denoland/deno/issues/14786
const timer = setInterval(() => {}, 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code moved – the timer used to be created per AtomicsWaitAsync, now we create it once at the start of the invocation and run it until we're done with the invocation.

this.output = output;
this.outputOffset = SAB_BASE_OFFSET;
this.flag = new Int32Array(this.output);
this.wait(0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "first bug" I talked about – now we .wait(0) before attempting to write anything.

this.worker.terminate();
this.worker = null as any;
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions moved up to put the RingBufferWriter closer to #handleInvoke.

Atomics.notify(this.flag, RingBufferWriter.SAB_IDX, 1);

// wait for the thread to read the data out...
const result = AtomicsWaitAsync(this.flag, RingBufferWriter.SAB_IDX, targetOffset, MAX_WAIT);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we only wait once in this version of the code.

@@ -363,15 +363,14 @@ if (typeof WebAssembly === 'undefined') {
});

test('test writes that span multiple blocks (w/small buffer)', async () => {
const res = await fetch('http://localhost:8124/src/mod.test.ts');
const result = await res.text();
const value = '9:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ'.repeat(18428 / 34);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix a silly test stability issue: this test used to pass the content of the test file around. Anytime the test file changed, so did the length of the buffer. The unit test in question failed when transferring 18,428 bytes, so we pin it there. Additionally, we use a 34-character repeating readable ASCII sequence to make it easier to spot "missing" runs of characters.

}
} while (value <= SAB_BASE_OFFSET);

this.#available = Atomics.load(this.flag, 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the second issue we ran into – sometimes we'd receive a wait signal for the flag but, on reading the value, it would turn out to be the old value. The loop ensures we wait until an Atomics.load gives us the expected, changed value.

@nilslice
Copy link
Member

Congrats, you win all the branch name points today!!!

Copy link
Member

@nilslice nilslice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly, I would need to spend a full day+ on this to fully understand, but in lieu of that - nothing looks problematic! Would be good to have @bhelx still take a look though.

Great work!

Copy link
Contributor

@bhelx bhelx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky for me to give a strong opinion on this as it's pretty in the weeds. I worry about some of these components (specifically the RingBufferWriter) because it feels like the kind of thing that could have subtle bugs that are really tricky to debug and test. But I'm assuming we're in uncharted territory here and we're gonna need to write some of this low level code ourselves.

@chrisdickinson chrisdickinson merged commit e48cec9 into main Jan 23, 2024
4 checks passed
@chrisdickinson chrisdickinson deleted the chris/20240108-what-is-a-semaphore-a-miserable-pile-of-atomics branch January 23, 2024 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition in background thread
3 participants