Run more tests in parallel #35

raphlinus · 2021-10-30T03:55:50Z

I've been doing some of my own GPU testing, and have been able to trigger failures on a test basically identical to "message passing" on my AMD 5700 XT; my code is at: googlefonts/compute-shader-101#11

Alan Baker pointed me at your repo, and I am very happy to see systematic litmus testing. I was curious whether your litmus tests would also be able to trigger similar failures. They are not (with barriers in place; I can see failures when running purely relaxed). I've been trying to bisect exactly what causes the difference, and think it's parallelism; I was trying to run a huge number of tests in one dispatch, and you're basically running a single litmus.

I think there's a lot of value to running more tests in parallel, both to catch things which might otherwise slip through, and also to provide higher testing throughput. It's a nontrivial change to the code though. I'm filing this issue partly to ask whether you're interested in pursuing it, or open to such changes.

reeselevine · 2021-11-01T21:07:17Z

Hey @raphlinus thanks for reaching out! Exciting to see other people interested in the minutiae of WebGPU's memory model :).

I took a look at your example, and the parallelism aspect is very interesting, and definitely something worth exploring. I'd be curious to see if we could integrate parts of the stressing we do in our tests with the parallelism from your tests to get higher overall throughput and maybe reveal some more interesting behaviors. So yes, open to these types of changes!

There's one other difference I found in our tests; our non-atomic variant of message passing (from the dropdown menu at https://users.soe.ucsc.edu/~reeselevine/webgpu/tests/message-passing/) includes an if condition before the final non-atomic read. This is because if the two barriers don't end up synchronizing the threads, there would otherwise be a data race between the read of the "data" variable and the subsequent write. However, it's likely that this conditional might be impacting our results.

I went ahead and wrote a non-atomic variant that drops the conditional (#37), and you should see an option (nonatomic1) on the website that will run this variant. I'm curious to see if you can see any violations with that one. If the basic stress or interleaving parameter settings don't reveal anything and you have a little time, you could try going to https://users.soe.ucsc.edu/~reeselevine/webgpu/tuning/ and running some different configurations with the two non atomic variants of message passing selected (maybe 20-30 configurations at 1000 iterations each).

reeselevine · 2021-11-01T21:08:44Z

@tyler-utah also has an AMD gpu (5500), so we're going to see if he can get any violations as well.

raphlinus · 2021-11-01T22:04:38Z

Well, that's very interesting indeed. Adding an if (flag > 0u) guard around the data load indeed makes a huge difference, basically all weak behavior goes away (and that's true even if I elide the storage barriers, in which weak behavior would be permitted). To summarize my measurements: in your version of the test, the only time I see weak behavior is when the barriers are off and there is no if guard. (I'm running your test code both in Chrome canary and my own Rust wgpu code btw)

If my understanding is correct, in this nonatomic version the unconditional read is a race, which I think makes my test invalid (it would be UB in C), but I'm not 100% sure the implications. I've also tested an atomic version where both data and flag are atomic<u32>, and results are basically the same as the nonatomic case - the if guard makes all weak behavior go away (barrier or no), while the unconditional load shows much weak behavior (again, barrier or no).

Unless I'm missing something (very likely, this stuff is confusing), I believe that configuration of: parallel tests, atomic data and flag, barriers, unconditional load is a real live violation of the memory model, and as such likely worth tracking down.

There are still some mysteries. If you look at my draft prefix sum implementation, you'll see that there the data read (which happens to be atomic though I don't think that's very important) is protected by an if guard based on the flag read. I believe this code is correct, though of course it's much more complicated so there's a lot more that can go wrong. In particular, there's looping control flow around the flag read, though that should at least be workgroup-uniform.

I have spent some time experimenting with your version of the test, including fiddling with the stress parameters a bit, but have not (yet) found a configuration that I read as a violation of the memory model.

raphlinus · 2021-11-01T22:38:13Z

I think the change to your page is wrong btw. Doesn't affect my results because I mostly ran the tests in Rust, but I also patched the web version locally:

diff --git a/pages/tests/message-passing.js b/pages/tests/message-passing.js
index 5477701..ab885cf 100644
--- a/pages/tests/message-passing.js
+++ b/pages/tests/message-passing.js
@@ -54,7 +54,7 @@ const variants = {
 0.3: atomicStore(y, 1)`, `1.1: let r0 = atomicLoad(y)
 1.2: storageBarrier()
 1.4: let r1 = *x`]),
-    shader: barrierMessagePassingNA
+    shader: barrierMessagePassingNARacy
   }

 }

reeselevine · 2021-11-02T00:42:45Z

Ah, yes that is a bug, thanks for catching that.

I agree, if you can still see violations with all the accesses made atomic, that would definitely be a memory model bug. I think the non-atomic variant would end up being undefined, but since the atomic variant is also giving you bugs, I don't think that ends up being too important.

I'd like to spend some time seeing if we can reproduce your memory model bug in the browser framework, I might try and write a parallel shader like you have and see if that works. It would be great to incorporate the parallelism into the stressing framework.

reeselevine · 2021-12-13T18:30:06Z

After lots of conversations and work, we now have parallel tests! #40 has the changes, but it's probably easier to just check out how they work on the website, https://gpuharbor.ucsc.edu/.

reeselevine closed this as completed Dec 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run more tests in parallel #35

Run more tests in parallel #35

raphlinus commented Oct 30, 2021

reeselevine commented Nov 1, 2021

reeselevine commented Nov 1, 2021

raphlinus commented Nov 1, 2021

raphlinus commented Nov 1, 2021

reeselevine commented Nov 2, 2021

reeselevine commented Dec 13, 2021

Run more tests in parallel #35

Run more tests in parallel #35

Comments

raphlinus commented Oct 30, 2021

reeselevine commented Nov 1, 2021

reeselevine commented Nov 1, 2021

raphlinus commented Nov 1, 2021

raphlinus commented Nov 1, 2021

reeselevine commented Nov 2, 2021

reeselevine commented Dec 13, 2021