Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix race condition between bootstrap pull and push #3721

Merged

Conversation

dsiganos
Copy link
Contributor

@dsiganos dsiganos commented Feb 8, 2022

The unit test bootstrap_processor.push_one fails intermittently and it
exposes a race condition in the bootstrap code of the node. #3533

The race condition is that after finishing with bootstrap frontiers pull,
the code is supposed to put the socket back into the idle bootstrap
connections queue. However it does that in a racy way. The problem is in
the function nano::frontier_req_client::received_frontier.

Near the bottom of that function, we should call:
connection->connections.pool_connection (connection);
before calling
promise.set_value (false);

Once the promise is given a value then
nano::bootstrap_attempt_legacy::run()
continues to call push_request in parallel with pool_connection and it is
a race condition.

Although this is a bug, this is likely not a major nor catastrophic bug
because the bootstrap push functionality is only an efficiency.

resolves #3533
resolves #3720

The unit test bootstrap_processor.push_one fails intermittently and it
exposes a race condition in the bootstrap code of the node. nanocurrency#3533

The race condition is that after finishing with bootstrap frontiers pull,
the code is supposed to put the socket back into the idle bootstrap
connections queue. However it does that in a racy way. The problem is in
the function nano::frontier_req_client::received_frontier.

Near the bottom of that function, we should call:
connection->connections.pool_connection (connection);
before calling
promise.set_value (false);

Once the promise is given a value then
nano::bootstrap_attempt_legacy::run()
continues to call push_request in parallel with pool_connection and it is
a race condition.

Although this is a bug, this is likely not a major nor catastrophic bug
because the bootstrap push functionality is only an efficiency.

resolves nanocurrency#3533
resolves nanocurrency#3720
Copy link
Contributor

@theohax theohax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2022-02-08T04:32:05.3477625Z + [[ Linux == \L\i\n\u\x ]]
2022-02-08T04:32:05.3477883Z + clang --version
2022-02-08T04:32:05.3478439Z ./ci/build-travis.sh: line 33: clang: command not found
2022-02-08T04:32:05.3478947Z + BACKTRACE=-DNANO_STACKTRACE_BACKTRACE=ON
2022-02-08T04:32:05.3480243Z + cmake '-GUnix Makefiles' -DACTIVE_NETWORK=nano_dev_network -DNANO_TEST=ON -DNANO_GUI=ON -DPORTABLE=1 -DNANO_WARN_TO_ERR=ON -DCMAKE_BUILD_TYPE=Debug -DCMAKE_VERBOSE_MAKEFILE=ON -DBOOST_ROOT=/tmp/boost -DNANO_SHARED_BOOST=ON -DQt5_DIR=/usr/lib/x86_64-linux-gnu/cmake/Qt5 -DCI_TEST=1 -DNANO_STACKTRACE_BACKTRACE=ON ..
2022-02-08T04:32:05.3481095Z fatal: No names found, cannot describe anything.
2022-02-08T04:32:05.3482385Z CMake Warning at flatbuffers/CMake/Version.cmake:22 (message):
2022-02-08T04:32:05.3599182Z   git describe failed with exit code: 128
2022-02-08T04:32:05.3599477Z Call Stack (most recent call first):
2022-02-08T04:32:05.3599737Z   flatbuffers/CMakeLists.txt:559 (include)
2022-02-08T04:32:05.3600146Z 

Probably this makes the CI fail, I'll have a look into it. Fix itself looks good to me (y)

Oh, actually I think this is the culprit:

[ RUN      ] block_store.empty_bootstrap
+ core_test_res=124

Looks like a timeout?

@dsiganos
Copy link
Contributor Author

dsiganos commented Feb 8, 2022

Return value 124 is the timeout code of the command timeout.
The test block_store.empty_bootstrap hang.

@dsiganos dsiganos merged commit ddf07f1 into nanocurrency:develop Feb 8, 2022
@dsiganos dsiganos deleted the bootstrap_pull_push_race_issue3720 branch February 8, 2022 12:46
@theohax
Copy link
Contributor

theohax commented Feb 8, 2022

Return value 124 is the timeout code of the command timeout. The test block_store.empty_bootstrap hang.

Yes, looks like a timeout... probably a buggy test therefore

@zhyatt zhyatt added the bug label Feb 8, 2022
@zhyatt zhyatt added this to the V24.0 milestone Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants