-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MIGRAPHX_TRACE_EVAL=1 with DLM test_parity_gelu changes behavior. #1877
Comments
Performing git bisect to determine point where this does not pass. Originally thought this was an error with #1835 but that change is not in the target release/rocm-rel-5.6.0 branch |
Spent the last few days rebuild and rolling back containers and getting continuous nonsensical answers when running things. Got into the weeds with compile options thinking maybe I need to twiddle config for hiprtc/MLIR. That wasn't the case. Took a step back after watching a football game, (Yay tiecats) and althought code between 5.5 and 5.6 rocm releases are identical, there must be some underlying part of the code that was broken then too and we;re likely running into an odd race condition here. Realized that we have the following parts in the trace path of things.
but in the side of non tracing we just merely run in generic_eval()
Originally this was done as it was assumed that since we're tracing and pulling data back from the card, we'll need to read thus requiring the synchronization I've instead added the following and had success with the gelu test.
I think I'll need to remove this in the case of stream sync being enabled here so we're not always performing a stream.wait() underneath async calls thus blocking but I think we still require synchronization here in the non async cases and its potentially not implicit when calling through onnxruntime from what I've seen here. I'm retesting this in the UIF container before I update the ticket in the morning but Ive gotten this to work in the default build options and seeing if I can repeat the success with the change before I refine something tomorrow morning for review. |
Seeing the same behavior mirroed on the remote system using the UIF container build as well. It appears we're unable to properly use async streams right now on Navi21 without this throwing an error. adding an async check around this yields the same behavior since the ORT driver is running async |
Ripped out the syncs in the trace path just to see what would happen. I can consistently fail when I use trace=1, and 0 now, but I'm able to get passes now with trace = 2 Interestingly enough I see a hipDeviceSync sintead of hipStreamSync which preceeds a memcpy before the first launched kernel finishes. |
I'm assuming this insetion of the HipDeviceSync is because of the copy_from instruction in the trace section of the code. Regardless, it looks like the other two kernels are happening after a hipMemCpy occurs. taking a look at MI250 I see the following output Which also has the execution of the kernel, before the hipStreamSteamWait and then the first hipMemCpy occurs. |
ping @pfultz2 Shouldn't we not be waiting on the first event before this memcpy occurs here? |
Looking at more hip command output between the MI250 and Navi using ltrace. Seeing variations between who is calling what hip calls in Navi libc10 and torch wherase on the MI250, only libtorch is being used. It seems like the Async calls are only on the MI250 and called by Got curious with the test itself and I decided to up the data size in the test on the Navi21 system for the hidden layer (2, 5, 10, 20, 100x) and found that with 100x larger hidden layer, for this parity test, Navi21 passes compared to the case that fails. Grabbing a capture of this and going to use it to compare to the failing case along with ltrace calls. This is looking like a data race condition now since I can change the state based on how much data is used. |
It appears the failing case is missing a hip malloc before the hipMemcpy before invoking some initial kernels before a set of hipMemCpyWithStream, and hipStreamCreateWithFlags Not sure if this is related to data output or not. Will post ltraces once finished capturing/gotten. Above is the first discrepancy between cases. |
It appears I'm correct in thinking this is a data race. It appears that on the failing case with stock hidden size (768) we attempt to perform a copyDeviceToHost() before the kernel is complete with the result. This isn't the case if we can up the workload as it appears the hipMemCopy stall out the run so the kernel completion and copy happen later. What's interesting is why we're not seeing the hipMalloc on the stock size run. It appears when I inflact the hidden layer in this gelu test, there's a 4kB malloc at the beginning of the run. Need to look more into the ltrace from this, although it appears when using ltrace now the failing run "passes" most likely due to added delay when tracking hipCalls |
Varied the size of the MI250 hidden layer to the point where I've gotten it down to 1 hidden layer. I don't see any large change in how the data is handled based on variable size. Seems like stream waits and syncs work as intended here. I can't get MI250 to fail like the Navi21 system by reducing data size. zipping since full run is 25 Mb |
Looking like not an MIGraphX issue. We're doing everything as intended it seems. |
Found fix upstream in ORT. related to how we perform CopyTensorAsync() in onnxruntime. Changed hipMemCpy() to a HipAsyncMemcpy() while syncing to the correct event and stream has become: Verified for both SWDEV-408437 and SWDEV-408438. Moving this to review state but this is entirely out of context for MIGraphX now. |
PR is now in ORT and will part of this week's update |
Ref: SWDEV-401749
DLM changes failures were related to issues with python package updates for newer UIF container
Currently the last test not working is with test_party_gelu, found in onnxruntime involked via the DLM tag migx_onnxrt_gelu_benchmarks
The test result seems to pass/fail depending on whether the MIGRAPHX_TRACE_EVAL=1 flag is set on Navi21.
Currently debugged this in DLM and manually built everything from the target branch
release/rocm-rel-5.6.0 and able to reproduce.
The text was updated successfully, but these errors were encountered: