-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apparent data race in onnxruntime on aarch64 #32899
Comments
A new Issue was created by @dan131riley Dan Riley. @Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core, reconstruction FYI @riga @mialiu149 |
New categories assigned: core,reconstruction @Dr15Jones,@smuzaffar,@slava77,@perrotta,@makortel,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Should we inform ONNXRuntime developers? |
Its perhaps trying with a newer onnxruntime?
… On Feb 12, 2021, at 5:57 PM, Matti Kortelainen ***@***.***> wrote:
Should we inform ONNXRuntime developers?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
We can start upgrading ONNXRuntime next week from 1.3.0 to 1.6.0 and see if the error persists. Also, depending on whether the current onnx models in cmssw are compatible, this could have a positive effect on the inference performance (see e.g. #32883). |
With the updated ONNXRuntime aarch64 is getting assertion failures and exceptions:
with stack traces:
or just a segmentation fault:
In these there are frequently multiple threads raising the same exception. |
I have a theory about the onnxruntime crashes, which I'm going to throw out here for comments while I try to reproduce the problem (as the first step to implementing a fix). Many of the crashes and assertion failures point back to IExecutionFrame::ReleaseMLValueImpl:
As I understand the use of onnxruntime (which is not very well), all the threads are sharing an execution context (not sure if that's the right term), so
and now the critical point, So my guess is that Plausible? |
It looks like I was correct that the problem is at least partially simultaneous access to the |
It looks like cms-externals/onnxruntime#6 helps, but I'm still seeing assertion failures, stack trace below. This stack trace should be impossible. There are three threads in
and I have other stack traces from before the shared pointer fixes where there are threads clearly obeying the lock:
so there's some code path where the lock is getting ignored or released early. Stack trace:
|
New theory: onnxruntime uses the google nsync library for locks, not the C++ standard library implementations. For arm64 they have an assembler implementation of their synchronization primitives. The ARM synchronization primitives have a lot of undefined behavior:
The Cavium ThunderX cores in the techlab systems are not based on any standard design (e.g., Cortex) and are known to have made some unconventional choices wrt implementation defined behavior. So my current guess is that the nsync primitives are relying on common choices for implementation defined behavior, and thus aren't reliable on the ThunderX. This could explain the intermittent lock failures with nominally impossible stack traces, and also why the crashes don't replicate at all on my Apple Silicon M1. Currently testing with
which replaces the nsync implementations with the ones from the standard library. |
cms-externals/onnxruntime#7 should fix the underlying problem that the google nsync library is not reliable on the Cavium ThunderX techlab systems. cms-externals/onnxruntime#6 only addressed the symptoms, which fortunately narrowed the problem enough to make it obvious that it was the underlying synchronization library that was broken. |
the new theory seems to be correct (🤞) |
I understood it was @dan131riley's plan to make a PR (or issue) to upstream ONNXruntime. In the mean time we can probably close this issue? |
Yes, I think we've seen enough IBs to declare this fixed. |
+1 |
As mentioned in #31123 we see occasional thread-related crashes in onnxruntime on aarch64. Unlike many threading crashes, this one is reproducible in gdb at the few percent level. The stack trace shows that the other threads in the same routine are blocked on a mutex, so there may be a bug in the mutex logic. Unfortunately, valgrind on aarch64 doesn't understand the instruction used by nsync_mu_semaphore_p (it looks like this uses standard C++ atomics, so there may be a general issue with valgrind and C++ atomics).
The text was updated successfully, but these errors were encountered: