Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible regression after #1144 #1148

Closed
Thegaram opened this issue Mar 30, 2020 · 7 comments · Fixed by #1163
Closed

Possible regression after #1144 #1148

Thegaram opened this issue Mar 30, 2020 · 7 comments · Fixed by #1163
Assignees
Labels
bug Something isn't working P2 Low Priority Issue that awaits fix

Comments

@Thegaram
Copy link
Contributor

On my local machine, p2p_era_test fails reliably after #1144 (310dc73). The previous commit (5b6b7ca) works fine.

The process just keeps hanging for a long time (I waited for about 5 mins, did not wait for the whole 20 mins). I suspect this might be a deadlock or circular reference issue.

2020-03-30T07:53:37.055000Z TestFramework (INFO): 1 generate block 0x803ba1c88d95a70db639973c7e6c66b0d18b09c56e1710091b8e315c47510902
2020-03-30T07:53:37.125000Z TestFramework (INFO): 2 generate block 0xd9f92b994c50a7e2ec560d94a588d2359914f647a7fe8bdac24f9552be75b811
2020-03-30T07:53:37.182000Z TestFramework (INFO): stop 4
2020-03-30T07:55:46.420000Z TestFramework (ERROR): Assertion failed
Traceback (most recent call last):
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_framework.py", line 201, in main
    self.run_test()
  File "./tests/full_node_tests/p2p_era_test.py", line 56, in run_test
    self.maybe_restart_node(chosen_peer, self.stop_probability, clean_p)
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_framework.py", line 390, in maybe_restart_node
    self.start_node(i, wait_time=120, phase_to_wait=("NormalSyncPhase"))
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_framework.py", line 345, in start_node
    node.wait_for_recovery(phase_to_wait, wait_time)
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_node.py", line 206, in wait_for_recovery
    self.wait_for_phase(phase_to_wait, wait_time=wait_time)
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_node.py", line 219, in wait_for_phase
    raise AssertionError(f"Node did not reach any of {phases} after {wait_time} seconds, current phase is {current_phase}")
AssertionError: Node did not reach any of NormalSyncPhase after 120 seconds, current phase is CatchUpRecoverBlockFromDbPhase
2020-03-30T07:55:46.423000Z TestFramework (INFO): Stopping nodes

^CTraceback (most recent call last):
  File "./tests/full_node_tests/p2p_era_test.py", line 73, in <module>
    P2PTest().main()
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_framework.py", line 226, in main
    self.stop_nodes()
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_framework.py", line 380, in stop_nodes
    node.stop_node()
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_node.py", line 249, in stop_node
    self.wait_until_stopped()
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_node.py", line 285, in wait_until_stopped
    wait_until(self.is_node_stopped, timeout=timeout)
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/util.py", line 245, in wait_until
    time.sleep(0.5)
KeyboardInterrupt
[node 4] Cleaning up leftover process
[node 7] Cleaning up leftover process
[node 6] Cleaning up leftover process
[node 5] Cleaning up leftover process

real	6m7.904s
user	1m53.080s
sys	5m39.304s
The program fails at iteration 1
@Thegaram Thegaram added the bug Something isn't working label Mar 30, 2020
@Thegaram Thegaram self-assigned this Mar 30, 2020
@Thegaram
Copy link
Contributor Author

It seems that occasionally, consensus_receiver is never closed. As a consequence, the Consensus Worker thread will never terminate, so it keeps data_man: Arc<BlockDataManager> and consensus: SharedConsensusGraph which of course prevents shutdown.

The only possible reason I can see is that the sender half of the channel is never dropped, which suggests SynchronizationGraph is never dropped. I fail to see how #1144 might cause this, as the only relevant change is moving data_man into the thread, which should not result in cyclical references (data_man does not reference sync_graph).

Further analysis suggests that the IoService worker thread join hangs indefinitely.

  • As a result, we will not drop IoService.
  • As a result, we cannot drop NetworkServiceInner (due to drop order of the fields of NetworkService, see here).
  • As a result, we cannot drop SynchronizationProtocolHandler (registered here) and LightProvider (registered here).
  • As a result, we cannot drop SynchronizationGraph.

If this is the case, I do not know yet why IoService won't terminate.

(This might be related to #939.)

@fanlong fanlong added the P2 Low Priority Issue that awaits fix label Mar 31, 2020
@fanlong
Copy link
Contributor

fanlong commented Mar 31, 2020

This looks very weird. It seems my PR exposed a bug in the code rather than the PR itself is the issue.

@Thegaram
Copy link
Contributor Author

I agree. It seems though that this issue is somehow only reliably reproducible on my machine, CI seems fine and @peilun-conflux was also unable to reproduce.

MacBook Pro (Catalina 10.15.1)
2.3 GHz Dual-Core Intel Core i5
8 GB 2133 MHz LPDDR3
rustc 1.42.0 (b8cedc004 2020-03-09)

@peilun-conflux
Copy link
Contributor

peilun-conflux commented Mar 31, 2020

The failure to reach NormalSyncPhase is related to #1144.

Now consensus may process blocks in a different order, and the check here will never pass if the last block inserted in sync graph is not the last block processed in consensus graph.

@peilun-conflux
Copy link
Contributor

Shutdown timeout is caused by the exact reason that @Thegaram mentioned in #939 here.

@peilun-conflux peilun-conflux self-assigned this Mar 31, 2020
@fanlong fanlong self-assigned this Mar 31, 2020
@fanlong
Copy link
Contributor

fanlong commented Mar 31, 2020

@peilun-conflux I will fix the sync issue with a new PR

@peilun-conflux
Copy link
Contributor

Shutdown timeout is caused by the exact reason that @Thegaram mentioned in #939 here.

Ok, when trying to fix this shutdown timeout issue I notice that it's directly caused by the fact that IO Worker is blocked on the while loop, and io_service is joining these threads. Thus #939 is not directly related to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P2 Low Priority Issue that awaits fix
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants