Fix a deadlock bug in EigenNonBlockingThreadPool.h #23098

snnn · 2024-12-13T06:42:51Z

Description

In July,2024 an Intel engineer created a PR(#21545) that reduced the spin count of our Eigen thread pool. Then I found it triggered a deadlock bug in EigenNonBlockingThreadPool.h.

Here is how to reproduce the issue:

Create an ARM64 VM in Azure with only 4 vCPUs. I used Standard_D4plds_v5. The more CPUs you have, the less likely you will see the bug.
Create a local branch and apply the following changes to EigenNonBlockingThreadPool.h

-    constexpr int log2_spin = 20;
-    const int spin_count = allow_spinning_ ? (1ull << log2_spin) : 0;
-    const int steal_count = spin_count / 100;
+    //constexpr int log2_spin = 20;
+    const int spin_count = 10000;
+    const int steal_count = 100;

Build the source code locally:
python3 tools/ci_build/build.py --skip_submodule_sync --parallel --config RelWithDebInfo --build_dir b1 --update --build
Run the following script

#!/bin/bash
for i in {1..1000}
do
    ./onnx_test_runner -c 1 -j 1 -x  model.onnx
done

Then quickly a "onnx_test_runner" process will stick in the loop, and you will see one of the CPU's usage is 100%, while all the other CPUs are idle. Sorry I cannot make the model file public. But it has nothing special.

Furthermore, to confirm we saw the same bug, please use gdb to attach the hang process and examine each worker thread(the threads that are idle). To examine the Nth thread, you should run the following gdb commands:

thr N
f 8
p {this->queue.back_._M_i & (1024-1), this->queue.front_._M_i & (1024-1)}

The last command prints two integers. If they are different, it means the thread's worker queue is not empty, and the thread should not wait there and idling. That's the bug I mean.

I have a hypothesis about the root cause, unfortunately I couldn't fully prove it. I think Heisenberg's indeterminacy principle plays magic here that blocked me seeing what was actually happening. When two threads run simultaneously on multiple CPUs, I want to know the relative order of the actual executions, but I cannot get the information without adding additional synchronizations to the threads, which in turn may impact the real behavior. Anyway, when debugging the issue, my approach was adding a logical clock to each thread. The clock was just an atomic integer counter that can be read/write by multiple threads. Whoever reads it must also increase it by one at the same time. Let's assume there are two threads: a producer who produces tasks and a consumer who executes tasks. Then I believe I observed the following thing:

Producer Thread: called PushBack function that inserted a new task to the consumer thread's worker queue.
Consumer Thread: called SetBlocked function and entered the mutex region
Producer Thread: called EnsureAwake() function and load the status
Consumer Thread: SetBlocked function went to sleep
Producer Thread: In the EnsureAwake function, it skipped alert because it believed the consumer thread was spinning.
Since nobody woke up the consumer thread, the producer thread idled there forever though its worker queue was not empty.
(I have low confidence in the above. )

It is very counterintuitive because at step 4 before the consumer thread went to sleep, the consumer thread should have seen the queue was not empty. My explanation is : it was because ARM has a weaker memory model than x86. We have got used to x86 too long.

My fix is to enable an assert. Though I don't believe the assert will ever hit, the updated code actually will insert a memory barrier there to ensure total ordering. std::atomic class's exchange function is a read-modify-write operation, while a store function is write-only. I tried to change the store function to use a stronger memory order, but it didn't fix the problem. Semantically, since we are going to read the queue, we need a read barrier here.

Still, I didn't get fully persuaded. It would be better if I can add some assert there, find a contradiction and abort the process. If you have ideal to prove it, please let me know.

Motivation and Context

5 weeks ago @goldsteinn suggested me to replace all std::memory_order_relaxed to std::memory_order_seq_cst. I didn't take his suggestion because any change to this file could make the bug not reproducible, however, it doesn't mean the bug is fixed. I have found a lot of different ways to make the bug disappear or harder to find. ("harder" means I need to run the test process 10k or 100k times to get a hang up instead of 1K. )

@yuslepukhin and @tlh20 also helped me a lot.

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h

yuslepukhin

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h

yuslepukhin

goldsteinn · 2024-12-16T18:13:42Z

Does this unblock #21545?

### Description This PR fixes a deadlock bug in EigenNonBlockingThreadPool.h. It only happens on platforms with weakly ordered memory model, such as ARM64.

snnn · 2024-12-16T19:19:43Z

Sorry I don't know much about that one.

### Description This PR fixes a deadlock bug in EigenNonBlockingThreadPool.h. It only happens on platforms with weakly ordered memory model, such as ARM64.

snnn requested a review from yuslepukhin December 13, 2024 06:42

snnn commented Dec 13, 2024

View reviewed changes

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h Show resolved Hide resolved

yuslepukhin previously approved these changes Dec 13, 2024

View reviewed changes

update

35f2461

snnn dismissed yuslepukhin’s stale review via 35f2461 December 13, 2024 19:15

snnn force-pushed the snnn-patch-8 branch from 7dada0b to 35f2461 Compare December 13, 2024 19:15

goldsteinn reviewed Dec 13, 2024

View reviewed changes

include/onnxruntime/core/platform/EigenNonBlockingThreadPool.h Show resolved Hide resolved

snnn added 2 commits December 13, 2024 19:42

fix no exception issue

f0ecf9c

avoid using exception

f5b48d1

snnn requested a review from yuslepukhin December 13, 2024 23:48

yuslepukhin approved these changes Dec 13, 2024

View reviewed changes

snnn merged commit 2ff66b8 into main Dec 16, 2024
95 checks passed

snnn deleted the snnn-patch-8 branch December 16, 2024 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a deadlock bug in EigenNonBlockingThreadPool.h #23098

Fix a deadlock bug in EigenNonBlockingThreadPool.h #23098

snnn commented Dec 13, 2024 •

edited

Loading

yuslepukhin left a comment

yuslepukhin left a comment

goldsteinn commented Dec 16, 2024

snnn commented Dec 16, 2024

Fix a deadlock bug in EigenNonBlockingThreadPool.h #23098

Fix a deadlock bug in EigenNonBlockingThreadPool.h #23098

Conversation

snnn commented Dec 13, 2024 • edited Loading

Description

Motivation and Context

yuslepukhin left a comment

Choose a reason for hiding this comment

yuslepukhin left a comment

Choose a reason for hiding this comment

goldsteinn commented Dec 16, 2024

snnn commented Dec 16, 2024

snnn commented Dec 13, 2024 •

edited

Loading