Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure all arbiters are joined #4841

Open
matklad opened this issue Sep 17, 2021 · 5 comments
Open

Make sure all arbiters are joined #4841

matklad opened this issue Sep 17, 2021 · 5 comments
Labels
C-housekeeping Category: Refactoring, cleanups, code quality Groomed Node Node team P-low Priority: low T-node Team: issues relevant to the node experience team

Comments

@matklad
Copy link
Contributor

matklad commented Sep 17, 2021

In a bunch of places , we create new actix arbiters:

chain/client/src/client_actor.rs:
 132 ) -> Result
 {
 133: let state_parts_arbiter = Arbiter::new();
 134 let self_addr = ctx.address();

chain/network/src/peer_manager.rs:
 543 // Start every peer actor on separate thread.
 544: let arbiter = Arbiter::new();
 545 let peer_counter = self.peer_counter.clone();

integration-tests/tests/network/peer_handshake.rs:
 229 run_actix(async move {
 230: let arbiter = Arbiter::new();
 231 let port = open_port();

nearcore/src/lib.rs:
 347 
 348: let arbiter = Arbiter::new();
 349 

 

An Arbiter is basically a std::thread::JoinHandle, and it is a best practice to make sure that they are joined. There's a Tolstoy-length novel [blog post](https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/) about the general issue, but the TL;DR in Rust context is:

  • if you never call join, then the thread can outlive the entity that spawned the thread.
  • this may cause unwanted interference in tests: for two sequentially running tests, the second test might be affected by work spawned, but not joined, by the first thread
  • this may cause data loss under certain circumstances. Rust program exits when the main function returns. If there are some still running threads, they are abruptly terminated. In particular, Drops are not invokes, and so things like BufWriter may fail to flush the data.
  • if a thread panics, the panic is reprorted as a Result of the join method. If you don't join, you don't know if the thread paniced!

I suggest auditing all calls to Arbiter::new and making sure that something joins the arbiters in the end. Often,the most convenient place for that is Drop:

 struct SomeActor 

 { arb: Option } 

impl Drop for SomeActor {
 fn drop(&mut self) 

 { let mut arb = self.arb.take().unwrap(); arb.stop(); arb.join().unwrap(); } 

}
 

Not that Arbiter has two "stopping" methods – the stop asyncronously asks arbiter to stop, while .join() actually waits for the work to complete. I think we need to use both.

@matklad matklad added C-housekeeping Category: Refactoring, cleanups, code quality T-node Team: issues relevant to the node experience team labels Sep 17, 2021
@matklad
Copy link
Contributor Author

matklad commented Sep 17, 2021

cc @mina86, IIRC, you did some work on the tests which were flaky due to database not being properly closed by the end of the test. This might be (or not be) related to that.

@mina86
Copy link
Contributor

mina86 commented Sep 17, 2021

I don’t recall working on tests but this may be related to #3266

@bowenwang1996
Copy link
Collaborator

Related to #5340

@stale
Copy link

stale bot commented Apr 10, 2022

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

@matklad
Copy link
Contributor Author

matklad commented Aug 5, 2022

I think we don't have acute problem with DB not being closed properly anymore, thanks to our RocskDB::block_until_all_instances_are_dropped hack. But today I figured out a workflow to sanity-check if any tests leaks a database:

diff --git a/core/store/src/db/rocksdb.rs b/core/store/src/db/rocksdb.rs
index 280c33ae1..debd3bcc1 100644
--- a/core/store/src/db/rocksdb.rs
+++ b/core/store/src/db/rocksdb.rs
@@ -453,6 +453,20 @@ pub(crate) static ROCKSDB_INSTANCES_COUNTER: Lazy<(Mutex<usize>, Condvar)> =
     Lazy::new(|| (Mutex::new(0), Condvar::new()));
 
 impl RocksDB {
+    pub fn db_leak_guard() -> impl Drop {
+        struct Guard;
+
+        impl Drop for Guard {
+            fn drop(&mut self) {
+                if !std::thread::panicking() {
+                    assert_eq!(*ROCKSDB_INSTANCES_COUNTER.0.lock().unwrap(), 0);
+                }
+            }
+        }
+        assert_eq!(*ROCKSDB_INSTANCES_COUNTER.0.lock().unwrap(), 0);
+        Guard
+    }
+
     /// Blocks until all RocksDB instances (usually 0 or 1) gracefully shutdown.
     pub fn block_until_all_instances_are_dropped() {
         let (lock, cvar) = &*ROCKSDB_INSTANCES_COUNTER;
@@ -556,6 +570,7 @@ impl Drop for RocksDB {
             env.set_background_threads(4);
         }
         self.db.cancel_all_background_work(true);
+        std::thread::sleep(std::time::Duration::from_millis(25));
     }
 }

It's rather easy to add that to tests, and to run them with --test-threads 1.

No action required, just something we might need in the future if this resurfaces.

@exalate-issue-sync exalate-issue-sync bot added the P-low Priority: low label Sep 6, 2022
@gmilescu gmilescu added the Node Node team label Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-housekeeping Category: Refactoring, cleanups, code quality Groomed Node Node team P-low Priority: low T-node Team: issues relevant to the node experience team
Projects
None yet
Development

No branches or pull requests

4 participants