-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks are enqueued but not executed in TBB #86
Comments
we see a loop in asleep_list in private server object. |
Hello Did you try to reproduce it with a newer versions of the library? Could you please provide a reproducer. |
@ntfshard, the issue is reproduced with TBB 2018 Update 5. I would not expect much difference with TBB 2019. |
Hi Maksim,
I am not sure if this email reaches you. What we see is that a node already in the sleep list is being inserted into the list again, causing the list get broken in to two pieces with the root pointing to a circular list.
We made the sleeping threads list doubly linked list and added an assert to ensure that node is not in the list already, when it is inserted in try_insert_in_asleep_list(), and we hit that assert indicating that the worker is already in the list
(gdb) bt |
#0 0x00007f1de0d861d7 in raise () from /lib64/libc.so.6 |
#1 0x00007f1de0d878c8 in abort () from /lib64/libc.so.6 |
#2 0x00007f1de0d7f146 in __assert_fail_base () from /lib64/libc.so.6 |
#3 0x00007f1de0d7f1f2 in __assert_fail () from /lib64/libc.so.6 |
#4 0x00007f1de1959bda in tbb::internal::rml::private_server::try_insert_in_asleep_list (this=0x4ad5c00, t=...) at /home/sangarshp/sb1/third_party/tbb-2018_U5/src/tbb/private_server.cpp:369 |
#5 0x00007f1de1958bdd in tbb::internal::rml::private_worker::run (this=0x4add980) at /home/sangarshp/sb1/third_party/tbb-2018_U5/src/tbb/private_server.cpp:286 |
#6 0x00007f1de19589f8 in tbb::internal::rml::private_worker::thread_routine (arg=0x4add980) at /home/sangarshp/sb1/third_party/tbb-2018_U5/src/tbb/private_server.cpp:231 |
#7 0x00007f1de1baadc5 in start_thread () from /lib64/libpthread.so.0 |
#8 0x00007f1de0e4873d in clone () from /lib64/libc.so.6 |
(gdb)
(gdb) p my_server |
$6 = (tbb::internal::rml::private_server &) @0x4ad5c00: {<tbb::internal::rml::tbb_server> = {<rml::server> = {<rml::versioned_object> = { |
_vptr.versioned_object = 0x7f1de1b9ceb0 <vtable for tbb::internal::rml::private_server+16>}, <No data fields>}, <No data fields>}, <tbb::internal::no_copy> = {<tbb::internal::no_assign> = |
{<No data fields>}, <No data fields>}, my_client = @0x4ad3400, my_n_thread = 256, my_stack_size = 4194304, |
my_slack = {<tbb::internal::atomic_impl_with_arithmetic<int, int, char>> = {<tbb::internal::atomic_impl<int>> = {my_storage = {my_value = 0}}, <No data fields>}, <No data fields>}, |
my_ref_count = {<tbb::internal::atomic_impl_with_arithmetic<int, int, char>> = {<tbb::internal::atomic_impl<int>> = {my_storage = {my_value = 257}}, <No data fields>}, <No data fields>}, |
my_thread_array = 0x4ad6000, |
my_asleep_list_root = {<tbb::internal::atomic_impl_with_arithmetic<tbb::internal::rml::private_worker*, long, tbb::internal::rml::private_worker>> = {<tbb::internal::atomic_impl<tbb::internal::r|
ml::private_worker*>> = {my_storage = {my_value = 0x4addb00}}, <No data fields>}, <No data fields>}, |
my_asleep_list_mutex = {<tbb::internal::mutex_copy_deprecated_and_disabled> = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>}, <No data fields>}, |
flag = 1 '\001', static is_rw_mutex = false, static is_recursive_mutex = false, static is_fair_mutex = false}, |
my_net_slack_requests = {<tbb::internal::atomic_impl_with_arithmetic<int, int, char>> = {<tbb::internal::atomic_impl<int>> = {my_storage = {my_value = 0}}, <No data fields>}, <No data fields>}} |
(gdb)
(gdb) p *this |
$7 = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>}, my_state = {<tbb::internal::atomic_impl<tbb::internal::rml::private_worker::state_t>> = { |
my_storage = {my_value = tbb::internal::rml::private_worker::st_normal}}, <No data fields>}, my_server = @0x4ad5c00, my_client = @0x4ad3400, my_index = 243, my_thread_monitor = { |
my_cookie = {my_epoch = {<tbb::internal::atomic_impl_with_arithmetic<unsigned long, unsigned long, char>> = {<tbb::internal::atomic_impl<unsigned long>> = {my_storage = { |
my_value = 4920888}}, <No data fields>}, <No data fields>}}, in_wait = {<tbb::internal::atomic_impl<bool>> = {my_storage = {my_value = true}}, <No data fields>}, spurious = false, |
my_sema = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>}, |
my_sem = {<tbb::internal::atomic_impl_with_arithmetic<int, int, char>> = {<tbb::internal::atomic_impl<int>> = {my_storage = {my_value = 1}}, <No data fields>}, <No data fields>}}, |
notify_count = 4920887}, my_handle = 139765850949376, my_next = 0x4adda00, my_prev = 0x4addf80, wait_count = 13, sleep_count = 4920889} |
(gdb)
(gdb) p my_next |
$8 = (tbb::internal::rml::private_worker *) 0x4adda00 |
(gdb) p my_prev |
$9 = (tbb::internal::rml::private_worker *) 0x4addf80 | |
(gdb) p my_server.my_asleep_list_root |
$10 = {<tbb::internal::atomic_impl_with_arithmetic<tbb::internal::rml::private_worker*, long, tbb::internal::rml::private_worker>> = {<tbb::internal::atomic_impl<tbb::internal::rml::private_worker|
*>> = {my_storage = {my_value = 0x4addb00}}, <No data fields>}, <No data fields>} |
(gdb) |
This is the patch we applied
diff --git a/src/rml/server/thread_monitor.h b/src/rml/server/thread_monitor.h
index 4ddd5bf..a10aec1 100644
--- a/src/rml/server/thread_monitor.h
+++ b/src/rml/server/thread_monitor.h
@@ -78,7 +78,7 @@ public:
friend class thread_monitor;
tbb::atomic<size_t> my_epoch;
};
- thread_monitor() : spurious(false), my_sema() {
+ thread_monitor() : spurious(false), my_sema(), notify_count(0) {
my_cookie.my_epoch = 0;
ITT_SYNC_CREATE(&my_sema, SyncType_RML, SyncObj_ThreadMonitor);
in_wait = false;
@@ -133,6 +133,7 @@ private:
tbb::atomic<bool> in_wait;
bool spurious;
tbb::internal::binary_semaphore my_sema;
+ int notify_count;
#if USE_PTHREAD
static void check( int error_code, const char* routine );
#endif
@@ -240,6 +241,7 @@ inline void thread_monitor::notify() {
my_cookie.my_epoch = my_cookie.my_epoch + 1;
bool do_signal = in_wait.fetch_and_store( false );
if( do_signal )
+ notify_count++;
my_sema.V();
}
diff --git a/src/tbb/private_server.cpp b/src/tbb/private_server.cpp
index ae25e57..e4458c0 100644
--- a/src/tbb/private_server.cpp
+++ b/src/tbb/private_server.cpp
@@ -25,7 +25,7 @@
#include "scheduler_common.h"
#include "governor.h"
#include "tbb_misc.h"
-
+#include <cassert>
using rml::internal::thread_monitor;
namespace tbb {
@@ -76,6 +76,13 @@ private:
//! Link for list of workers that are sleeping or have no associated thread.
private_worker* my_next;
+ private_worker* my_prev;
+
+ //Should be one , if it is two or more , then it was like like woken up,no job,sleep again;
+ int wait_count;
+
+ // number of times worker went to commit wait, this is compared with notify count
+ int sleep_count;
friend class private_server;
@@ -95,7 +102,8 @@ private:
protected:
private_worker( private_server& server, tbb_client& client, const size_t i ) :
my_server(server), my_client(client), my_index(i),
- my_thread_monitor(), my_handle(), my_next()
+ my_thread_monitor(), my_handle(), my_next(NULL), my_prev(NULL),
+ wait_count(0), sleep_count(0)
{
my_state = st_init;
}
@@ -267,13 +275,16 @@ void private_worker::run() {
::rml::job& j = *my_client.create_one_job();
while( my_state!=st_quit ) {
if( my_server.my_slack>=0 ) {
+ wait_count = 0;
my_client.process(j);
} else {
+ wait_count++;
thread_monitor::cookie c;
// Prepare to wait
my_thread_monitor.prepare_wait(c);
// Check/set the invariant for sleeping
if( my_state!=st_quit && my_server.try_insert_in_asleep_list(*this) ) {
+ sleep_count++;
my_thread_monitor.commit_wait(c);
my_server.propagate_chain_reaction();
} else {
@@ -333,6 +344,8 @@ private_server::private_server( tbb_client& client ) :
for( size_t i=0; i<my_n_thread; ++i ) {
private_worker* t = new( &my_thread_array[i] ) padded_private_worker( *this, client, i );
t->my_next = my_asleep_list_root;
+ if (my_asleep_list_root)
+ my_asleep_list_root->my_prev = t;
my_asleep_list_root = t;
}
}
@@ -353,7 +366,12 @@ inline bool private_server::try_insert_in_asleep_list( private_worker& t ) {
// it sees us sleeping on the list and wakes us up.
int k = ++my_slack;
if( k<=0 ) {
+ assert(!t.my_next);
+ assert(!t.my_prev);
+ assert(&t != my_asleep_list_root);
t.my_next = my_asleep_list_root;
+ if (my_asleep_list_root)
+ my_asleep_list_root->my_prev = &t;
my_asleep_list_root = &t;
return true;
} else {
@@ -383,6 +401,10 @@ void private_server::wake_some( int additional_slack ) {
}
// Pop sleeping worker to combine with claimed unit of slack
my_asleep_list_root = (*w++ = my_asleep_list_root)->my_next;
+ assert(!(*(w-1))->my_prev);
+ if (my_asleep_list_root)
+ my_asleep_list_root->my_prev = NULL;
+ (*(w-1))->my_next = NULL;
}
if( additional_slack ) {
// Contribute our unused slack to my_slack.
…________________________________
From: Sangarshan Pillareddy
Sent: Wednesday, September 12, 2018 11:47:54 AM
To: 01org/tbb; 01org/tbb
Cc: Author; Anantharamu Suryanarayana; Yuvaraja Mariappan
Subject: Re: [01org/tbb] Tasks are enqueued but not executed in TBB (#86)
Yes , we tried with latest release TBB 2018 update 5. It is very hard to reproduce , we see this crash happens once in a while ( 7 days to 24 days , bit random)
We have core file, will share you TBB states with you
Regards,
Sangarshan
From: Maksim Derbasov <[email protected]>
Reply-To: 01org/tbb <[email protected]>
Date: Thursday, 13 September 2018 at 12:11 AM
To: 01org/tbb <[email protected]>
Cc: Sangarshan Pillareddy <[email protected]>, Author <[email protected]>
Subject: Re: [01org/tbb] Tasks are enqueued but not executed in TBB (#86)
Hello
Did you try to reproduce it with a newer versions of the library? Could you please provide a reproducer.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_01org_tbb_issues_86-23issuecomment-2D420755289&d=DwMCaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=6JiwBt9xuUt9SGWGplI68_WzTbP6koD4-22iICQZFn0&m=vXrSXMQi1kEQ2jKVER5l4JsYrjCOtf05sZkzwUiDySg&s=pka_3GOVpvrz9sVZkSnDrjkWFwtmAMwdscGw8lPlrVo&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AezGRm8wtzciGRiFmcGGR0cPWu52rp6gks5uaVV2gaJpZM4Wf9iD&d=DwMCaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=6JiwBt9xuUt9SGWGplI68_WzTbP6koD4-22iICQZFn0&m=vXrSXMQi1kEQ2jKVER5l4JsYrjCOtf05sZkzwUiDySg&s=lAX0qxh-Zn3NL7H5zoYOOpY2JiBJxTtQHKvfaUGyEz8&e=>.
|
Hi Alex,
>>>>
@ntfshard<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ntfshard&d=DwMFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=6JiwBt9xuUt9SGWGplI68_WzTbP6koD4-22iICQZFn0&m=IqOJCZ4kYYwI-BGPjsIjo5ZSkWShfR7EvmjiyQK8PWQ&s=K23UzRG1pRAscZ7g3ntLDbIsRS5HHz7aD3QDWuoKrIg&e=>, the issue is reproduced with TBB 2018 Update 5. I would not expect much difference with TBB 2019.
@Sangarshan<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Sangarshan&d=DwMFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=6JiwBt9xuUt9SGWGplI68_WzTbP6koD4-22iICQZFn0&m=IqOJCZ4kYYwI-BGPjsIjo5ZSkWShfR7EvmjiyQK8PWQ&s=mGxuDbpvKc_2T4r-rSOY5AakVVeFVgkFzVszG67J5fM&e=>, we failed to figure out how it could happen. The asleep_list is used only under the lock and the logic is quite primitive. We even supposed that some memory barriers could be broken but there were a lot of other places where it could reveal some other issues. So, it will be great if you can share a core file with us. Is it big?
What is your hardware and software configuration (CPU and OS)?
>>
here is the configuration details:
[heat-admin@compute01 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2200.000
BogoMIPS: 4399.98
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47
[heat-admin@compute01 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.3 (Maipo)
[heat-admin@compute01 ~]$ uname -a
Linux 3.10.0-514.26.2.el7.x86_64
Regards,
Sangarshan
From: Anantharamu Suryanarayana <[email protected]>
Date: Thursday, 13 September 2018 at 12:33 AM
To: Sangarshan Pillareddy <[email protected]>, 01org/tbb <[email protected]>, 01org/tbb <[email protected]>
Cc: Author <[email protected]>, Yuvaraja Mariappan <[email protected]>
Subject: Re: [01org/tbb] Tasks are enqueued but not executed in TBB (#86)
Hi Maksim,
I am not sure if this email reaches you. What we see is that a node already in the sleep list is being inserted into the list again, causing the list get broken in to two pieces with the root pointing to a circular list.
We made the sleeping threads list doubly linked list and added an assert to ensure that node is not in the list already, when it is inserted in try_insert_in_asleep_list(), and we hit that assert indicating that the worker is already in the list
(gdb) bt |
#0 0x00007f1de0d861d7 in raise () from /lib64/libc.so.6 |
#1 0x00007f1de0d878c8 in abort () from /lib64/libc.so.6 |
#2 0x00007f1de0d7f146 in __assert_fail_base () from /lib64/libc.so.6 |
#3 0x00007f1de0d7f1f2 in __assert_fail () from /lib64/libc.so.6 |
#4 0x00007f1de1959bda in tbb::internal::rml::private_server::try_insert_in_asleep_list (this=0x4ad5c00, t=...) at /home/sangarshp/sb1/third_party/tbb-2018_U5/src/tbb/private_server.cpp:369 |
#5 0x00007f1de1958bdd in tbb::internal::rml::private_worker::run (this=0x4add980) at /home/sangarshp/sb1/third_party/tbb-2018_U5/src/tbb/private_server.cpp:286 |
#6 0x00007f1de19589f8 in tbb::internal::rml::private_worker::thread_routine (arg=0x4add980) at /home/sangarshp/sb1/third_party/tbb-2018_U5/src/tbb/private_server.cpp:231 |
#7 0x00007f1de1baadc5 in start_thread () from /lib64/libpthread.so.0 |
#8 0x00007f1de0e4873d in clone () from /lib64/libc.so.6 |
(gdb)
(gdb) p my_server |
$6 = (tbb::internal::rml::private_server &) @0x4ad5c00: {<tbb::internal::rml::tbb_server> = {<rml::server> = {<rml::versioned_object> = { |
_vptr.versioned_object = 0x7f1de1b9ceb0 <vtable for tbb::internal::rml::private_server+16>}, <No data fields>}, <No data fields>}, <tbb::internal::no_copy> = {<tbb::internal::no_assign> = |
{<No data fields>}, <No data fields>}, my_client = @0x4ad3400, my_n_thread = 256, my_stack_size = 4194304, |
my_slack = {<tbb::internal::atomic_impl_with_arithmetic<int, int, char>> = {<tbb::internal::atomic_impl<int>> = {my_storage = {my_value = 0}}, <No data fields>}, <No data fields>}, |
my_ref_count = {<tbb::internal::atomic_impl_with_arithmetic<int, int, char>> = {<tbb::internal::atomic_impl<int>> = {my_storage = {my_value = 257}}, <No data fields>}, <No data fields>}, |
my_thread_array = 0x4ad6000, |
my_asleep_list_root = {<tbb::internal::atomic_impl_with_arithmetic<tbb::internal::rml::private_worker*, long, tbb::internal::rml::private_worker>> = {<tbb::internal::atomic_impl<tbb::internal::r|
ml::private_worker*>> = {my_storage = {my_value = 0x4addb00}}, <No data fields>}, <No data fields>}, |
my_asleep_list_mutex = {<tbb::internal::mutex_copy_deprecated_and_disabled> = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>}, <No data fields>}, |
flag = 1 '\001', static is_rw_mutex = false, static is_recursive_mutex = false, static is_fair_mutex = false}, |
my_net_slack_requests = {<tbb::internal::atomic_impl_with_arithmetic<int, int, char>> = {<tbb::internal::atomic_impl<int>> = {my_storage = {my_value = 0}}, <No data fields>}, <No data fields>}} |
(gdb)
(gdb) p *this |
$7 = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>}, my_state = {<tbb::internal::atomic_impl<tbb::internal::rml::private_worker::state_t>> = { |
my_storage = {my_value = tbb::internal::rml::private_worker::st_normal}}, <No data fields>}, my_server = @0x4ad5c00, my_client = @0x4ad3400, my_index = 243, my_thread_monitor = { |
my_cookie = {my_epoch = {<tbb::internal::atomic_impl_with_arithmetic<unsigned long, unsigned long, char>> = {<tbb::internal::atomic_impl<unsigned long>> = {my_storage = { |
my_value = 4920888}}, <No data fields>}, <No data fields>}}, in_wait = {<tbb::internal::atomic_impl<bool>> = {my_storage = {my_value = true}}, <No data fields>}, spurious = false, |
my_sema = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>}, |
my_sem = {<tbb::internal::atomic_impl_with_arithmetic<int, int, char>> = {<tbb::internal::atomic_impl<int>> = {my_storage = {my_value = 1}}, <No data fields>}, <No data fields>}}, |
notify_count = 4920887}, my_handle = 139765850949376, my_next = 0x4adda00, my_prev = 0x4addf80, wait_count = 13, sleep_count = 4920889} |
(gdb)
(gdb) p my_next |
$8 = (tbb::internal::rml::private_worker *) 0x4adda00 |
(gdb) p my_prev |
$9 = (tbb::internal::rml::private_worker *) 0x4addf80 | |
(gdb) p my_server.my_asleep_list_root |
$10 = {<tbb::internal::atomic_impl_with_arithmetic<tbb::internal::rml::private_worker*, long, tbb::internal::rml::private_worker>> = {<tbb::internal::atomic_impl<tbb::internal::rml::private_worker|
*>> = {my_storage = {my_value = 0x4addb00}}, <No data fields>}, <No data fields>} |
(gdb) |
This is the patch we applied
diff --git a/src/rml/server/thread_monitor.h b/src/rml/server/thread_monitor.h
index 4ddd5bf..a10aec1 100644
--- a/src/rml/server/thread_monitor.h
+++ b/src/rml/server/thread_monitor.h
@@ -78,7 +78,7 @@ public:
friend class thread_monitor;
tbb::atomic<size_t> my_epoch;
};
- thread_monitor() : spurious(false), my_sema() {
+ thread_monitor() : spurious(false), my_sema(), notify_count(0) {
my_cookie.my_epoch = 0;
ITT_SYNC_CREATE(&my_sema, SyncType_RML, SyncObj_ThreadMonitor);
in_wait = false;
@@ -133,6 +133,7 @@ private:
tbb::atomic<bool> in_wait;
bool spurious;
tbb::internal::binary_semaphore my_sema;
+ int notify_count;
#if USE_PTHREAD
static void check( int error_code, const char* routine );
#endif
@@ -240,6 +241,7 @@ inline void thread_monitor::notify() {
my_cookie.my_epoch = my_cookie.my_epoch + 1;
bool do_signal = in_wait.fetch_and_store( false );
if( do_signal )
+ notify_count++;
my_sema.V();
}
diff --git a/src/tbb/private_server.cpp b/src/tbb/private_server.cpp
index ae25e57..e4458c0 100644
--- a/src/tbb/private_server.cpp
+++ b/src/tbb/private_server.cpp
@@ -25,7 +25,7 @@
#include "scheduler_common.h"
#include "governor.h"
#include "tbb_misc.h"
-
+#include <cassert>
using rml::internal::thread_monitor;
namespace tbb {
@@ -76,6 +76,13 @@ private:
//! Link for list of workers that are sleeping or have no associated thread.
private_worker* my_next;
+ private_worker* my_prev;
+
+ //Should be one , if it is two or more , then it was like like woken up,no job,sleep again;
+ int wait_count;
+
+ // number of times worker went to commit wait, this is compared with notify count
+ int sleep_count;
friend class private_server;
@@ -95,7 +102,8 @@ private:
protected:
private_worker( private_server& server, tbb_client& client, const size_t i ) :
my_server(server), my_client(client), my_index(i),
- my_thread_monitor(), my_handle(), my_next()
+ my_thread_monitor(), my_handle(), my_next(NULL), my_prev(NULL),
+ wait_count(0), sleep_count(0)
{
my_state = st_init;
}
@@ -267,13 +275,16 @@ void private_worker::run() {
::rml::job& j = *my_client.create_one_job();
while( my_state!=st_quit ) {
if( my_server.my_slack>=0 ) {
+ wait_count = 0;
my_client.process(j);
} else {
+ wait_count++;
thread_monitor::cookie c;
// Prepare to wait
my_thread_monitor.prepare_wait(c);
// Check/set the invariant for sleeping
if( my_state!=st_quit && my_server.try_insert_in_asleep_list(*this) ) {
+ sleep_count++;
my_thread_monitor.commit_wait(c);
my_server.propagate_chain_reaction();
} else {
@@ -333,6 +344,8 @@ private_server::private_server( tbb_client& client ) :
for( size_t i=0; i<my_n_thread; ++i ) {
private_worker* t = new( &my_thread_array[i] ) padded_private_worker( *this, client, i );
t->my_next = my_asleep_list_root;
+ if (my_asleep_list_root)
+ my_asleep_list_root->my_prev = t;
my_asleep_list_root = t;
}
}
@@ -353,7 +366,12 @@ inline bool private_server::try_insert_in_asleep_list( private_worker& t ) {
// it sees us sleeping on the list and wakes us up.
int k = ++my_slack;
if( k<=0 ) {
+ assert(!t.my_next);
+ assert(!t.my_prev);
+ assert(&t != my_asleep_list_root);
t.my_next = my_asleep_list_root;
+ if (my_asleep_list_root)
+ my_asleep_list_root->my_prev = &t;
my_asleep_list_root = &t;
return true;
} else {
@@ -383,6 +401,10 @@ void private_server::wake_some( int additional_slack ) {
}
// Pop sleeping worker to combine with claimed unit of slack
my_asleep_list_root = (*w++ = my_asleep_list_root)->my_next;
+ assert(!(*(w-1))->my_prev);
+ if (my_asleep_list_root)
+ my_asleep_list_root->my_prev = NULL;
+ (*(w-1))->my_next = NULL;
}
if( additional_slack ) {
// Contribute our unused slack to my_slack.
…________________________________
From: Sangarshan Pillareddy
Sent: Wednesday, September 12, 2018 11:47:54 AM
To: 01org/tbb; 01org/tbb
Cc: Author; Anantharamu Suryanarayana; Yuvaraja Mariappan
Subject: Re: [01org/tbb] Tasks are enqueued but not executed in TBB (#86)
Yes , we tried with latest release TBB 2018 update 5. It is very hard to reproduce , we see this crash happens once in a while ( 7 days to 24 days , bit random)
We have core file, will share you TBB states with you
Regards,
Sangarshan
From: Maksim Derbasov <[email protected]>
Reply-To: 01org/tbb <[email protected]>
Date: Thursday, 13 September 2018 at 12:11 AM
To: 01org/tbb <[email protected]>
Cc: Sangarshan Pillareddy <[email protected]>, Author <[email protected]>
Subject: Re: [01org/tbb] Tasks are enqueued but not executed in TBB (#86)
Hello
Did you try to reproduce it with a newer versions of the library? Could you please provide a reproducer.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_01org_tbb_issues_86-23issuecomment-2D420755289&d=DwMCaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=6JiwBt9xuUt9SGWGplI68_WzTbP6koD4-22iICQZFn0&m=vXrSXMQi1kEQ2jKVER5l4JsYrjCOtf05sZkzwUiDySg&s=pka_3GOVpvrz9sVZkSnDrjkWFwtmAMwdscGw8lPlrVo&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AezGRm8wtzciGRiFmcGGR0cPWu52rp6gks5uaVV2gaJpZM4Wf9iD&d=DwMCaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=6JiwBt9xuUt9SGWGplI68_WzTbP6koD4-22iICQZFn0&m=vXrSXMQi1kEQ2jKVER5l4JsYrjCOtf05sZkzwUiDySg&s=lAX0qxh-Zn3NL7H5zoYOOpY2JiBJxTtQHKvfaUGyEz8&e=>.
|
This is the fix that we propose (Please ignore the counters we have used for debugging purposes only)
|
Related information in here https://en.wikipedia.org/wiki/Spurious_wakeup |
It looks like I could reproduce the situation when a worker thread was active while remaining in the asleep list. I am not sure if it relates to the spurious wakeup issue because my reproducer fails on Windows as well (or maybe something else is broken). Moreover, the underlying sync primitives are protected from spurious wake up (e.g. semaphore.h:205). In addition, the spurious wakeup issue usually relates to condition variables (not semaphores). I will continue investigation and notify you if I can figure out something. |
Thanks Alexey. I see some comments though, in the code which allude to spurious wakeups such as in thread_monitor.h:215 in thread_monitor::prepare_wait as In any case, I am glad that you are also able to reproduce the situation where in the sleep list gets corrupted (or becomes circular) and then gets into bad state. Please let us know if you can find the real reason how we can get into that state where a thread becomes active when it is suppose to be sleeping. Irrespective, if we want to handle such a situation, we have done the following. Do you see any issue with this approach of making the list as doubly linked and then checking before inserting into the list ? If the worker is already present in the list, we just restore my_slack and return true. Caller behaves as if the worker was indeed inserted into the sleep list.
Thanks |
It looks like that I observed inconsistent state of the
As for workaround, perhaps, it will work but while we do not know the root cause, it can only hide some symptoms (are we sure about other side effects?). |
Hi, Thanks for looking into this.
My email in case you want to reach out to me directly is Thanks once again. I really appreciate it. |
And Sangarshan can be reached at [email protected] |
Just to further clarify, in our testing, we do hit the condition of duplicate insertion (very rarely though), and when it does, fix takes effect and the daemon continues to function normally, afaik. |
During testing, it was found that tbb sleeping threads singly linked list was corrupted and had become circular. This seemingly caused my_slack count to get permanently stuck at -1, as the sleeping list traversal would potentially never end. During testing, using a specific assert, it was confirmed that duplicate insertion did happen. Fixed it by modifying the sleeing threads singly linked list into a doubly linked list and then making sure that a thread if already in the list is never prepended back as the head of the list. uxlfoundation/oneTBB#86 Closes-Bug: #1684993 Change-Id: I6773eb8dddd849cebb695a59864a9da2ce2faa17 Depends-On: Iec821e3b08c3825cf2789a70bf53621650c66516
During testing, it was found that tbb sleeping threads singly linked list was corrupted and had become circular. This seemingly caused my_slack count to get permanently stuck at -1, as the sleeping list traversal would potentially never end. During testing, using a specific assert, it was confirmed that duplicate insertion did happen. Fixed it by modifying the sleeing threads singly linked list into a doubly linked list and then making sure that a thread if already in the list is never prepended back as the head of the list. uxlfoundation/oneTBB#86 Closes-Bug: #1684993 Change-Id: I6773eb8dddd849cebb695a59864a9da2ce2faa17 Depends-On: Iec821e3b08c3825cf2789a70bf53621650c66516
During testing, it was found that tbb sleeping threads singly linked list was corrupted and had become circular. This seemingly caused my_slack count to get permanently stuck at -1, as the sleeping list traversal would potentially never end. During testing, using a specific assert, it was confirmed that duplicate insertion did happen. Fixed it by modifying the sleeing threads singly linked list into a doubly linked list and then making sure that a thread if already in the list is never prepended back as the head of the list. uxlfoundation/oneTBB#86 Change-Id: I3375b8be324245c329a9bd3a1f001a38576f617d Closes-Bug: #1684993
During testing, it was found that tbb sleeping threads singly linked list was corrupted and had become circular. This seemingly caused my_slack count to get permanently stuck at -1, as the sleeping list traversal would potentially never end. During testing, using a specific assert, it was confirmed that duplicate insertion did happen. Fixed it by modifying the sleeing threads singly linked list into a doubly linked list and then making sure that a thread if already in the list is never prepended back as the head of the list. uxlfoundation/oneTBB#86 Change-Id: Ie5de2844a19843bec185e921a83bde4f55c1b6ed Closes-Bug: #1684993
During testing, it was found that tbb sleeping threads singly linked list was corrupted and had become circular. This seemingly caused my_slack count to get permanently stuck at -1, as the sleeping list traversal would potentially never end. During testing, using a specific assert, it was confirmed that duplicate insertion did happen. Fixed it by modifying the sleeing threads singly linked list into a doubly linked list and then making sure that a thread if already in the list is never prepended back as the head of the list. uxlfoundation/oneTBB#86 Change-Id: I052d397bf8b881c75e9ed6a8d17e7cbc1b12764b Closes-Bug: #1684993 Depends-On: Ie5de2844a19843bec185e921a83bde4f55c1b6ed
During testing, it was found that tbb sleeping threads singly linked list was corrupted and had become circular. This seemingly caused my_slack count to get permanently stuck at -1, as the sleeping list traversal would potentially never end. During testing, using a specific assert, it was confirmed that duplicate insertion did happen. Fixed it by modifying the sleeing threads singly linked list into a doubly linked list and then making sure that a thread if already in the list is never prepended back as the head of the list. uxlfoundation/oneTBB#86 Change-Id: I3375b8be324245c329a9bd3a1f001a38576f617d Closes-Bug: #1684993
During testing, it was found that tbb sleeping threads singly linked list was corrupted and had become circular. This seemingly caused my_slack count to get permanently stuck at -1, as the sleeping list traversal would potentially never end. During testing, using a specific assert, it was confirmed that duplicate insertion did happen. Fixed it by modifying the sleeing threads singly linked list into a doubly linked list and then making sure that a thread if already in the list is never prepended back as the head of the list. uxlfoundation/oneTBB#86 Change-Id: I84993a23b4524f19c42e4166015c9216466c6865 Closes-Bug: #1684993
During testing, it was found that tbb sleeping threads singly linked list was corrupted and had become circular. This seemingly caused my_slack count to get permanently stuck at -1, as the sleeping list traversal would potentially never end. During testing, using a specific assert, it was confirmed that duplicate insertion did happen. Fixed it by modifying the sleeing threads singly linked list into a doubly linked list and then making sure that a thread if already in the list is never prepended back as the head of the list. uxlfoundation/oneTBB#86 Change-Id: I1c08787cf74a12b5486189eb0dc69cd524ed4170 Closes-Bug: #1684993 Depends-On: I84993a23b4524f19c42e4166015c9216466c6865
During testing, it was found that tbb sleeping threads singly linked list was corrupted and had become circular. This seemingly caused my_slack count to get permanently stuck at -1, as the sleeping list traversal would potentially never end. During testing, using a specific assert, it was confirmed that duplicate insertion did happen. Fixed it by modifying the sleeing threads singly linked list into a doubly linked list and then making sure that a thread if already in the list is never prepended back as the head of the list. uxlfoundation/oneTBB#86 Change-Id: Ic0a943b54afa07115ad6f304310d385cb199c504 Closes-Bug: #1684993 Depends-On: I1c08787cf74a12b5486189eb0dc69cd524ed4170 Depends-On: I84993a23b4524f19c42e4166015c9216466c6865
@Rombie, since you said above that you rebuild the TBB library for RedHat Linux: could you try the original TBB binaries from our packages? And when you rebuild, do you use the makefiles provided with TBB, or your own build system? Also please specify the compiler and any special command line options, so that we can try to reproduce the issue with the same compilation settings. |
Sorry. I did not realize that you posted a response to this. Some how, I don't get an email indicating some activity on this issue. Anyways, thanks once again for your kind support! We build tbb for different OS distributions including RedHat. In this particular case, we used CentOS 7. Yes, we do use the makefiles provided by the TBB library as is and pretty much just do make. https://github.com/Juniper/contrail-controller/blob/master/lib/tbb/SConscript#L28 We do not use any special flag to while running make, as you can see in line number 28 in the link provided above. We just do make from the top level. Compiler is gcc as provided with CentOS 7 based. I don't have that system available as it has been re-imaged. IIRC, We had seen this tbb issue in ubuntu too. (But that was years ago) Did you get a chance to open and analyze the core file from here, where duplicate insertion into the list was clearly caught using assert that we added during our testing ? https://drive.google.com/file/d/13LSgIIrLMkM4RdPYJ6s-BG79SINseo_0/view?usp=sharing Thanks so much once again! Please feel free to email me directly as well [email protected] or [email protected] if you need any additional information. |
Btw, just to be clear, we always used the binaries directly as provided in the distribution upstream. In order to use our proposed fix, we now build and distribute libtbb.so.2 from within our rpms. |
During testing, it was found that tbb sleeping threads singly linked list was corrupted and had become circular. This seemingly caused my_slack count to get permanently stuck at -1, as the sleeping list traversal would potentially never end. During testing, using a specific assert, it was confirmed that duplicate insertion did happen. Fixed it by modifying the sleeing threads singly linked list into a doubly linked list and then making sure that a thread if already in the list is never prepended back as the head of the list. uxlfoundation/oneTBB#86 With this change, libtbb.so.2 is provided directly from contrail-lib package. Change-Id: I07416601cd9be658d75309caa0917d3c61d9e427 Closes-Bug: #1684993
In TBB 2019 Update 2, we strengthened the code in In our testing, we do not observe spurious wakeups or failed assertions. Moreover, we believe the code in Could you please check if this latest TBB update works in your environment? |
During testing, it was found that tbb sleeping threads singly linked list was corrupted and had become circular. This seemingly caused my_slack count to get permanently stuck at -1, as the sleeping list traversal would potentially never end. During testing, using a specific assert, it was confirmed that duplicate insertion did happen. Fixed it by modifying the sleeing threads singly linked list into a doubly linked list and then making sure that a thread if already in the list is never prepended back as the head of the list. uxlfoundation/oneTBB#86 Change-Id: I79bcd7192ea7c7db732b191503d0747e4b0ff229 Closes-Bug: #1684993
During testing, it was found that tbb sleeping threads singly linked list was corrupted and had become circular. This seemingly caused my_slack count to get permanently stuck at -1, as the sleeping list traversal would potentially never end. During testing, using a specific assert, it was confirmed that duplicate insertion did happen. Fixed it by modifying the sleeing threads singly linked list into a doubly linked list and then making sure that a thread if already in the list is never prepended back as the head of the list. Also take commit db50e81 to fix build with newer TBB. uxlfoundation/oneTBB#86 Change-Id: I83a7d267d4558c02715c675db5fc08f58f1208e2 Depends-On: I79bcd7192ea7c7db732b191503d0747e4b0ff229 Closes-Bug: #1684993
During testing, it was found that tbb sleeping threads singly linked list was corrupted and had become circular. This seemingly caused my_slack count to get permanently stuck at -1, as the sleeping list traversal would potentially never end. During testing, using a specific assert, it was confirmed that duplicate insertion did happen. Fixed it by modifying the sleeing threads singly linked list into a doubly linked list and then making sure that a thread if already in the list is never prepended back as the head of the list. uxlfoundation/oneTBB#86 With this change, libtbb.so.2 is provided directly from contrail-lib package. Change-Id: I4b911de240544143ad1833641ffe640155a34648 Depends-On: I83a7d267d4558c02715c675db5fc08f58f1208e2 Closes-Bug: #1684993
Since there is no relevant activity for quite a long time, I propose closing the issue. Interested people could always reopen it if they deem useful. |
Hi
We are running into TBB scheduler issue where we see tasks are getting enqueued but not executed.
we see same issue with both TBB version: TBB 2018 Update 5 and TBB 4.3 initial version.
In our application, aster thread instantiates Scheduler with thread count as 8. priority is same for all tasks. and tbb::task::enqueue() method is used for enqueuing task. we use one tbb worker thread for i/o and it runs forever.
in current state, number of active threads in arena is 1 and which is used for i/o . as per the TBB state , new work is available and my_max_workers_requested is 8 but server my slack value is set to -1, all worker threads ( except one used for i/o) are in commit wait state waiting for wake up signal. it never recovered from this state. can you please share some details on what could be wrong.
Arena object elements:
my_task_stream = {tbb::internal::no_copy = {tbb::internal::no_assign = {}, }, population = {0, 26300, 0}, lanes = {0x3169e68,
Thanks in advance,
Sangarshan
The text was updated successfully, but these errors were encountered: