-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] MIGraphX EP seeing HipMemcpy via onnxruntime::GPUDataTransfer::CopyTensor that break multi stream execution #16774
Comments
ping @PeixuanZuo @cloudhan @ytaous . Let me know if you know the best person to debug this |
Additional debug when enabling verbose logging via --log_verbose for the parity tests.
is there a way for us to add a sync between kernels or waiting on the stream completion. It appears we're not performing the sync before each run. What is odd is we don't observe this behavior on our MI250 card at all |
I am not quite sure how far back have you tried to traceback and failed to find a stable point. If possible, could please try 13495 to see if the stream problem is presented right before the commit? |
Sure. Currently tried: f4cd35f Rolled back to my original commit MIGraphX stream related things were created (october 2022) but there's no changes for the no_optimize flags in the parity tests. |
@TedThemistokleous Is this still a problem after your PR? |
Describe the issue
Currently running through a set of parity tests found in
/onnxruntime/onnxruntime/test/python/transformers/
primarily test_parity_gelu.py and test_parity_layernorm.py
We're experiencing out of order memcopies that seem to occur during kernel execution on our Navi21 card.
Here's an example output when we use ROCm tracing tools to view the sequence of events (captured with our rocprof and then used perfetto/chrome://tracing to view the traces:
I'm able to trigger this case consistently and cut down the GELU test to only perform 2 test runs per kernel which fails always on the second. I found that when we run only 1 test, this out of order error never happens.
I've also noticed that if I increase the hidden layer size in the test_parity_gelu.py test, I can get a point (around 100x hidden layer size) that the tests always pass and we don't get an overlap.
I've cut down the test_parity_gelu.py on a seperate branch here to my ORT fork off mainline: https://github.com/TedThemistokleous/onnxruntime/tree/debug_parity_tests.
The behavior goes away entirely if we add a sync between every single kernel run, thus undoing multi stream execution
The reason I'm bring this up to Onnxruntime is that after a few weeks of debugging this (configuration, previous builds, etc) is that I've been unable to find a working stable point using the Navi21 card (gfx 1030)
From a recent stack trace using GDB with the test I've found the following around said hipMemcpy thats being called via
onnxruntime::GPUDataTransfer::CopyTensor
here's the stack trace I've mentioned.
Urgency
Urgent. Blocking builds of ROCm
Target platform
Navi21
Build script
Error / output
Tests fail due to accuracy errors for test_parity_gelu.py and test_parity_layernorm.py
For layernorm
Visual Studio Version
No response
GCC / Compiler Version
No response
The text was updated successfully, but these errors were encountered: