Make sure all arbiters are joined #4841

matklad · 2021-09-17T17:01:04Z

In a bunch of places , we create new actix arbiters:

chain/client/src/client_actor.rs:
 132 ) -> Result
 {
 133: let state_parts_arbiter = Arbiter::new();
 134 let self_addr = ctx.address();

chain/network/src/peer_manager.rs:
 543 // Start every peer actor on separate thread.
 544: let arbiter = Arbiter::new();
 545 let peer_counter = self.peer_counter.clone();

integration-tests/tests/network/peer_handshake.rs:
 229 run_actix(async move {
 230: let arbiter = Arbiter::new();
 231 let port = open_port();

nearcore/src/lib.rs:
 347 
 348: let arbiter = Arbiter::new();
 349

An Arbiter is basically a std::thread::JoinHandle, and it is a best practice to make sure that they are joined. There's a Tolstoy-length _novel [blog post](https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/) about the general issue, but the TL;DR in Rust context is:

if you never call join, then the thread can outlive the entity that spawned the thread.
this may cause unwanted interference in tests: for two sequentially running tests, the second test might be affected by work spawned, but not joined, by the first thread
this may cause data loss under certain circumstances. Rust program exits when the main function returns. If there are some still running threads, they are abruptly terminated. In particular, Drops are not invokes, and so things like BufWriter may fail to flush the data.
if a thread panics, the panic is reprorted as a Result of the join method. If you don't join, you don't know if the thread paniced!

I suggest auditing all calls to Arbiter::new and making sure that something joins the arbiters in the end. Often,the most convenient place for that is Drop:

 struct SomeActor 

 { arb: Option } 

impl Drop for SomeActor {
 fn drop(&mut self) 

 { let mut arb = self.arb.take().unwrap(); arb.stop(); arb.join().unwrap(); } 

}

Not that Arbiter has two "stopping" methods – the stop asyncronously asks arbiter to stop, while .join() actually waits for the work to complete. I think we need to use both.

The text was updated successfully, but these errors were encountered:

matklad · 2021-09-17T17:02:10Z

cc @mina86, IIRC, you did some work on the tests which were flaky due to database not being properly closed by the end of the test. This might be (or not be) related to that.

mina86 · 2021-09-17T19:22:49Z

I don’t recall working on tests but this may be related to #3266

bowenwang1996 · 2022-01-10T15:42:47Z

Related to #5340

stale · 2022-04-10T16:23:31Z

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

matklad · 2022-08-05T16:46:16Z

I think we don't have acute problem with DB not being closed properly anymore, thanks to our RocskDB::block_until_all_instances_are_dropped hack. But today I figured out a workflow to sanity-check if any tests leaks a database:

diff --git a/core/store/src/db/rocksdb.rs b/core/store/src/db/rocksdb.rs
index 280c33ae1..debd3bcc1 100644
--- a/core/store/src/db/rocksdb.rs
+++ b/core/store/src/db/rocksdb.rs
@@ -453,6 +453,20 @@ pub(crate) static ROCKSDB_INSTANCES_COUNTER: Lazy<(Mutex<usize>, Condvar)> =
     Lazy::new(|| (Mutex::new(0), Condvar::new()));
 
 impl RocksDB {
+    pub fn db_leak_guard() -> impl Drop {
+        struct Guard;
+
+        impl Drop for Guard {
+            fn drop(&mut self) {
+                if !std::thread::panicking() {
+                    assert_eq!(*ROCKSDB_INSTANCES_COUNTER.0.lock().unwrap(), 0);
+                }
+            }
+        }
+        assert_eq!(*ROCKSDB_INSTANCES_COUNTER.0.lock().unwrap(), 0);
+        Guard
+    }
+
     /// Blocks until all RocksDB instances (usually 0 or 1) gracefully shutdown.
     pub fn block_until_all_instances_are_dropped() {
         let (lock, cvar) = &*ROCKSDB_INSTANCES_COUNTER;
@@ -556,6 +570,7 @@ impl Drop for RocksDB {
             env.set_background_threads(4);
         }
         self.db.cancel_all_background_work(true);
+        std::thread::sleep(std::time::Duration::from_millis(25));
     }
 }

It's rather easy to add that to tests, and to run them with --test-threads 1.

No action required, just something we might need in the future if this resurfaces.

matklad added C-housekeeping Category: Refactoring, cleanups, code quality T-node Team: issues relevant to the node experience team labels Sep 17, 2021

bowenwang1996 assigned mina86 Dec 13, 2021

stale bot added the S-stale label Apr 10, 2022

bowenwang1996 removed the S-stale label Apr 15, 2022

nikurt unassigned mina86 Jun 14, 2022

exalate-issue-sync bot added the P-low Priority: low label Sep 6, 2022

exalate-issue-sync bot added the Groomed label Nov 8, 2022

gmilescu added the Node Node team label Oct 19, 2023

gmilescu mentioned this issue Oct 23, 2023

Make sure all arbiters are joined #9935

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make sure all arbiters are joined #4841

Make sure all arbiters are joined #4841

matklad commented Sep 17, 2021 •

edited by exalate-issue-sync bot

Loading

matklad commented Sep 17, 2021

mina86 commented Sep 17, 2021

bowenwang1996 commented Jan 10, 2022

stale bot commented Apr 10, 2022

matklad commented Aug 5, 2022

Make sure all arbiters are joined #4841

Make sure all arbiters are joined #4841

Comments

matklad commented Sep 17, 2021 • edited by exalate-issue-sync bot Loading

matklad commented Sep 17, 2021

mina86 commented Sep 17, 2021

bowenwang1996 commented Jan 10, 2022

stale bot commented Apr 10, 2022

matklad commented Aug 5, 2022

matklad commented Sep 17, 2021 •

edited by exalate-issue-sync bot

Loading