Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reentrant deadlock #154

Closed
nmldiegues opened this issue Mar 31, 2023 · 5 comments · Fixed by #186
Closed

Reentrant deadlock #154

nmldiegues opened this issue Mar 31, 2023 · 5 comments · Fixed by #186

Comments

@nmldiegues
Copy link

We're using bb8 (thanks for doing it!) in a service handling lots of traffic, with bb8 used millions of times per minute.
Today we hit a deadlock that seems to be induced by a single task/thread scenario due to reentrancy:

Thread 3 (Thread 0x7f4d193fe700 (LWP 99758)):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000563841825ea5 in parking_lot_core::thread_parker::imp::ThreadParker::futex_wait () at /cfsetup_build/vendor/parking_lot_core/src/thread_parker/linux.rs:112
#2  parking_lot_core::thread_parker::imp::{impl#0}::park () at /cfsetup_build/vendor/parking_lot_core/src/thread_parker/linux.rs:66
#3  parking_lot_core::parking_lot::park::{closure#0}<parking_lot::raw_mutex::{impl#3}::lock_slow::{closure_env#0}, parking_lot::raw_mutex::{impl#3}::lock_slow::{closure_env#1}, parking_lot::raw_mutex::{impl#3}::lock_slow::{closure_env#2}> () at /cfsetup_build/vendor/parking_lot_core/src/parking_lot.rs:635
#4  parking_lot_core::parking_lot::with_thread_data<parking_lot_core::parking_lot::ParkResult, parking_lot_core::parking_lot::park::{closure_env#0}<parking_lot::raw_mutex::{impl#3}::lock_slow::{closure_env#0}, parking_lot::raw_mutex::{impl#3}::lock_slow::{closure_env#1}, parking_lot::raw_mutex::{impl#3}::lock_slow::{closure_env#2}>> () at /cfsetup_build/vendor/parking_lot_core/src/parking_lot.rs:207
#5  parking_lot_core::parking_lot::park<parking_lot::raw_mutex::{impl#3}::lock_slow::{closure_env#0}, parking_lot::raw_mutex::{impl#3}::lock_slow::{closure_env#1}, parking_lot::raw_mutex::{impl#3}::lock_slow::{closure_env#2}> () at /cfsetup_build/vendor/parking_lot_core/src/parking_lot.rs:600
#6  parking_lot::raw_mutex::RawMutex::lock_slow () at src/raw_mutex.rs:262
#7  0x0000563841a3ef55 in parking_lot::raw_mutex::{impl#0}::lock () at vendor/parking_lot/src/raw_mutex.rs:72
#8  lock_api::mutex::Mutex::lock<parking_lot::raw_mutex::RawMutex, bb8::internals::PoolInternals<quicksilver_client::conn_pool::QuicksilverConnManager<quicksilver_client::conn::MemcachedConn, quicksilver_client::new::{async_fn#0}::{closure_env#0}, quicksilver_client::new::{async_fn#0}::{closure#0}::{async_block_env#0}>>> () at vendor/lock_api/src/mutex.rs:214
#9  bb8::internals::{impl#4}::drop<quicksilver_client::conn_pool::QuicksilverConnManager<quicksilver_client::conn::MemcachedConn, quicksilver_client::new::{async_fn#0}::{closure_env#0}, quicksilver_client::new::{async_fn#0}::{closure#0}::{async_block_env#0}>> () at vendor/bb8/src/internals.rs:191
#10 0x000056384185007a in core::ptr::drop_in_place<bb8::internals::InternalsGuard<quicksilver_client::conn_pool::QuicksilverConnManager<quicksilver_client::conn::MemcachedConn, quicksilver_client::new::{async_fn#0}::{closure_env#0}, quicksilver_client::new::{async_fn#0}::{closure#0}::{async_block_env#0}>>> () at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/ptr/mod.rs:490
#11 core::ptr::drop_in_place<core::option::Option<bb8::internals::InternalsGuard<quicksilver_client::conn_pool::QuicksilverConnManager<quicksilver_client::conn::MemcachedConn, quicksilver_client::new::{async_fn#0}::{closure_env#0}, quicksilver_client::new::{async_fn#0}::{closure#0}::{async_block_env#0}>>>> () at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/ptr/mod.rs:490
#12 core::ptr::drop_in_place<core::cell::UnsafeCell<core::option::Option<bb8::internals::InternalsGuard<quicksilver_client::conn_pool::QuicksilverConnManager<quicksilver_client::conn::MemcachedConn, quicksilver_client::new::{async_fn#0}::{closure_env#0}, quicksilver_client::new::{async_fn#0}::{closure#0}::{async_block_env#0}>>>>> () at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/ptr/mod.rs:490
#13 core::ptr::drop_in_place<futures_channel::lock::Lock<core::option::Option<bb8::internals::InternalsGuard<quicksilver_client::conn_pool::QuicksilverConnManager<quicksilver_client::conn::MemcachedConn, quicksilver_client::new::{async_fn#0}::{closure_env#0}, quicksilver_client::new::{async_fn#0}::{closure#0}::{async_block_env#0}>>>>> () at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/ptr/mod.rs:490
#14 core::ptr::drop_in_place<futures_channel::oneshot::Inner<bb8::internals::InternalsGuard<quicksilver_client::conn_pool::QuicksilverConnManager<quicksilver_client::conn::MemcachedConn, quicksilver_client::new::{async_fn#0}::{closure_env#0}, quicksilver_client::new::{async_fn#0}::{closure#0}::{async_block_env#0}>>>> () at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/ptr/mod.rs:490
#15 alloc::sync::Arc::drop_slow<futures_channel::oneshot::Inner<bb8::internals::InternalsGuard<quicksilver_client::conn_pool::QuicksilverConnManager<quicksilver_client::conn::MemcachedConn, quicksilver_client::new::{async_fn#0}::{closure_env#0}, quicksilver_client::new::{async_fn#0}::{closure#0}::{async_block_env#0}>>>> () at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/alloc/src/sync.rs:1123
#16 0x0000563841914acc in alloc::sync::{impl#27}::drop<futures_channel::oneshot::Inner<bb8::internals::InternalsGuard<quicksilver_client::conn_pool::QuicksilverConnManager<quicksilver_client::conn::MemcachedConn, quicksilver_client::new::{async_fn#0}::{closure_env#0}, quicksilver_client::new::{async_fn#0}::{closure#0}::{async_block_env#0}>>>> () at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/alloc/src/sync.rs:1745
#17 core::ptr::drop_in_place<alloc::sync::Arc<futures_channel::oneshot::Inner<bb8::internals::InternalsGuard<quicksilver_client::conn_pool::QuicksilverConnManager<quicksilver_client::conn::MemcachedConn, quicksilver_client::new::{async_fn#0}::{closure_env#0}, quicksilver_client::new::{async_fn#0}::{closure#0}::{async_block_env#0}>>>>> () at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/ptr/mod.rs:490
#18 core::ptr::drop_in_place<futures_channel::oneshot::Sender<bb8::internals::InternalsGuard<quicksilver_client::conn_pool::QuicksilverConnManager<quicksilver_client::conn::MemcachedConn, quicksilver_client::new::{async_fn#0}::{closure_env#0}, quicksilver_client::new::{async_fn#0}::{closure#0}::{async_block_env#0}>>>> () at /rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/ptr/mod.rs:490
#19 futures_channel::oneshot::Sender::send<bb8::internals::InternalsGuard<quicksilver_client::conn_pool::QuicksilverConnManager<quicksilver_client::conn::MemcachedConn, quicksilver_client::new::{async_fn#0}::{closure_env#0}, quicksilver_client::new::{async_fn#0}::{closure#0}::{async_block_env#0}>>> () at vendor/futures-channel/src/oneshot.rs:343
#20 0x0000563841a251d7 in bb8::internals::PoolInternals::put<quicksilver_client::conn_pool::QuicksilverConnManager<quicksilver_client::conn::MemcachedConn, quicksilver_client::new::{async_fn#0}::{closure_env#0}, quicksilver_client::new::{async_fn#0}::{closure#0}::{async_block_env#0}>> () at vendor/bb8/src/internals.rs:74
#21 0x0000563841836ac4 in bb8::inner::PoolInner::put_back<quicksilver_client::conn_pool::QuicksilverConnManager<quicksilver_client::conn::MemcachedConn, quicksilver_client::new::{async_fn#0}::{closure_env#0}, quicksilver_client::new::{async_fn#0}::{closure#0}::{async_block_env#0}>> () at vendor/bb8/src/inner.rs:166
#22 0x0000563841b139f7 in bb8::api::{impl#10}::drop<quicksilver_client::conn_pool::QuicksilverConnManager<quicksilver_client::conn::MemcachedConn, quicksilver_client::new::{async_fn#0}::{closure_env#0}, quicksilver_client::new::{async_fn#0}::{closure#0}::{async_block_env#0}>> () at vendor/bb8/src/api.rs:364

This is using tag v0.8.0.

If you ignore the boilerplate, you'll see essentially that:

  1. a connection was being returned to the pool
  2. that logic acquires the internals lock
  3. then it calls PoolInternals#PoolInternals
  4. which creates a InternalsGuard
  5. now while the InternalsGuard is in the call stack, everything gets dropped (this is possible in our case because requests are handled in a tokio runtime with a timeout per request, and timeout happened here)
  6. so InternalsGuard#drop is called, which tries to acquire the internals lock <-- and voila, we hit a deadlock since the lock is not reentrant
@djc
Copy link
Owner

djc commented Apr 1, 2023

Thanks for digging into this! Would you be able to submit a PR to fix it?

@nmldiegues
Copy link
Author

nmldiegues commented Apr 3, 2023

Thanks for digging into this! Would you be able to submit a PR to fix it?

Fixing it with parking_lot should be trivial because we can just use https://docs.rs/parking_lot/latest/parking_lot/type.ReentrantMutex.html --- see 5c1fe4a for the example patch on top of v0.8.0 (see below for why I did this)

However I see this repo very recently had this change #152 which makes that trivial fix impossible.

Any other fix is going to require a design change as to how the connection is managed throughout the code, which puts us at a crossroads here.

nmldiegues added a commit to nmldiegues/bb8 that referenced this issue Apr 3, 2023
@djc
Copy link
Owner

djc commented May 26, 2023

Any other fix is going to require a design change as to how the connection is managed throughout the code, which puts us at a crossroads here.

What kind of design changes? Would it introduce a lot of extra complexity or just move the code around?

@drdrsh
Copy link

drdrsh commented May 27, 2023

@nmldiegues I think I am running into the same issue and I am able to reproduce it. I tried your patch I am a getting panic

panicked at 'already borrowed: BorrowMutError', /usr/local/cargo/git/checkouts/bb8-c4fdc7c4996dd882/5c1fe4a/bb8/src/internals.rs:196

@djc
Copy link
Owner

djc commented Jan 26, 2024

I've submitted #186 which tries to fix this.

@djc djc closed this as completed in #186 Jan 29, 2024
nicolasvan added a commit to OneSignal/pgcat that referenced this issue Mar 18, 2024
To get djc/bb8#186 and djc/bb8#189
which fix potential deadlocks (djc/bb8#154)
magec pushed a commit to OneSignal/pgcat that referenced this issue Oct 23, 2024
To get djc/bb8#186 and djc/bb8#189
which fix potential deadlocks (djc/bb8#154)
magec pushed a commit to OneSignal/pgcat that referenced this issue Oct 23, 2024
To get djc/bb8#186 and djc/bb8#189
which fix potential deadlocks (djc/bb8#154).

Also, this (djc/bb8#225) was needed to prevent a connection
leak which was conveniently spotted in our integration tests.
drdrsh pushed a commit to postgresml/pgcat that referenced this issue Oct 28, 2024
* Update bb8 to 0.8.6

To get djc/bb8#186 and djc/bb8#189
which fix potential deadlocks (djc/bb8#154).

Also, this (djc/bb8#225) was needed to prevent a connection
leak which was conveniently spotted in our integration tests.

* Ignore ./.bundle (created by dev console)

---------

Co-authored-by: Jose Fernandez (magec) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants