Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run more tests in parallel #35

Closed
raphlinus opened this issue Oct 30, 2021 · 6 comments
Closed

Run more tests in parallel #35

raphlinus opened this issue Oct 30, 2021 · 6 comments

Comments

@raphlinus
Copy link

I've been doing some of my own GPU testing, and have been able to trigger failures on a test basically identical to "message passing" on my AMD 5700 XT; my code is at: googlefonts/compute-shader-101#11

Alan Baker pointed me at your repo, and I am very happy to see systematic litmus testing. I was curious whether your litmus tests would also be able to trigger similar failures. They are not (with barriers in place; I can see failures when running purely relaxed). I've been trying to bisect exactly what causes the difference, and think it's parallelism; I was trying to run a huge number of tests in one dispatch, and you're basically running a single litmus.

I think there's a lot of value to running more tests in parallel, both to catch things which might otherwise slip through, and also to provide higher testing throughput. It's a nontrivial change to the code though. I'm filing this issue partly to ask whether you're interested in pursuing it, or open to such changes.

@reeselevine
Copy link
Owner

Hey @raphlinus thanks for reaching out! Exciting to see other people interested in the minutiae of WebGPU's memory model :).

I took a look at your example, and the parallelism aspect is very interesting, and definitely something worth exploring. I'd be curious to see if we could integrate parts of the stressing we do in our tests with the parallelism from your tests to get higher overall throughput and maybe reveal some more interesting behaviors. So yes, open to these types of changes!

There's one other difference I found in our tests; our non-atomic variant of message passing (from the dropdown menu at https://users.soe.ucsc.edu/~reeselevine/webgpu/tests/message-passing/) includes an if condition before the final non-atomic read. This is because if the two barriers don't end up synchronizing the threads, there would otherwise be a data race between the read of the "data" variable and the subsequent write. However, it's likely that this conditional might be impacting our results.

I went ahead and wrote a non-atomic variant that drops the conditional (#37), and you should see an option (nonatomic1) on the website that will run this variant. I'm curious to see if you can see any violations with that one. If the basic stress or interleaving parameter settings don't reveal anything and you have a little time, you could try going to https://users.soe.ucsc.edu/~reeselevine/webgpu/tuning/ and running some different configurations with the two non atomic variants of message passing selected (maybe 20-30 configurations at 1000 iterations each).

@reeselevine
Copy link
Owner

@tyler-utah also has an AMD gpu (5500), so we're going to see if he can get any violations as well.

@raphlinus
Copy link
Author

Well, that's very interesting indeed. Adding an if (flag > 0u) guard around the data load indeed makes a huge difference, basically all weak behavior goes away (and that's true even if I elide the storage barriers, in which weak behavior would be permitted). To summarize my measurements: in your version of the test, the only time I see weak behavior is when the barriers are off and there is no if guard. (I'm running your test code both in Chrome canary and my own Rust wgpu code btw)

If my understanding is correct, in this nonatomic version the unconditional read is a race, which I think makes my test invalid (it would be UB in C), but I'm not 100% sure the implications. I've also tested an atomic version where both data and flag are atomic<u32>, and results are basically the same as the nonatomic case - the if guard makes all weak behavior go away (barrier or no), while the unconditional load shows much weak behavior (again, barrier or no).

Unless I'm missing something (very likely, this stuff is confusing), I believe that configuration of: parallel tests, atomic data and flag, barriers, unconditional load is a real live violation of the memory model, and as such likely worth tracking down.

There are still some mysteries. If you look at my draft prefix sum implementation, you'll see that there the data read (which happens to be atomic though I don't think that's very important) is protected by an if guard based on the flag read. I believe this code is correct, though of course it's much more complicated so there's a lot more that can go wrong. In particular, there's looping control flow around the flag read, though that should at least be workgroup-uniform.

I have spent some time experimenting with your version of the test, including fiddling with the stress parameters a bit, but have not (yet) found a configuration that I read as a violation of the memory model.

@raphlinus
Copy link
Author

I think the change to your page is wrong btw. Doesn't affect my results because I mostly ran the tests in Rust, but I also patched the web version locally:

diff --git a/pages/tests/message-passing.js b/pages/tests/message-passing.js
index 5477701..ab885cf 100644
--- a/pages/tests/message-passing.js
+++ b/pages/tests/message-passing.js
@@ -54,7 +54,7 @@ const variants = {
 0.3: atomicStore(y, 1)`, `1.1: let r0 = atomicLoad(y)
 1.2: storageBarrier()
 1.4: let r1 = *x`]),
-    shader: barrierMessagePassingNA
+    shader: barrierMessagePassingNARacy
   }

 }

@reeselevine
Copy link
Owner

Ah, yes that is a bug, thanks for catching that.

I agree, if you can still see violations with all the accesses made atomic, that would definitely be a memory model bug. I think the non-atomic variant would end up being undefined, but since the atomic variant is also giving you bugs, I don't think that ends up being too important.

I'd like to spend some time seeing if we can reproduce your memory model bug in the browser framework, I might try and write a parallel shader like you have and see if that works. It would be great to incorporate the parallelism into the stressing framework.

@reeselevine
Copy link
Owner

After lots of conversations and work, we now have parallel tests! #40 has the changes, but it's probably easier to just check out how they work on the website, https://gpuharbor.ucsc.edu/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants