Possible regression after #1144 #1148

Thegaram · 2020-03-30T08:00:25Z

On my local machine, p2p_era_test fails reliably after #1144 (310dc73). The previous commit (5b6b7ca) works fine.

The process just keeps hanging for a long time (I waited for about 5 mins, did not wait for the whole 20 mins). I suspect this might be a deadlock or circular reference issue.

2020-03-30T07:53:37.055000Z TestFramework (INFO): 1 generate block 0x803ba1c88d95a70db639973c7e6c66b0d18b09c56e1710091b8e315c47510902
2020-03-30T07:53:37.125000Z TestFramework (INFO): 2 generate block 0xd9f92b994c50a7e2ec560d94a588d2359914f647a7fe8bdac24f9552be75b811
2020-03-30T07:53:37.182000Z TestFramework (INFO): stop 4
2020-03-30T07:55:46.420000Z TestFramework (ERROR): Assertion failed
Traceback (most recent call last):
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_framework.py", line 201, in main
    self.run_test()
  File "./tests/full_node_tests/p2p_era_test.py", line 56, in run_test
    self.maybe_restart_node(chosen_peer, self.stop_probability, clean_p)
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_framework.py", line 390, in maybe_restart_node
    self.start_node(i, wait_time=120, phase_to_wait=("NormalSyncPhase"))
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_framework.py", line 345, in start_node
    node.wait_for_recovery(phase_to_wait, wait_time)
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_node.py", line 206, in wait_for_recovery
    self.wait_for_phase(phase_to_wait, wait_time=wait_time)
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_node.py", line 219, in wait_for_phase
    raise AssertionError(f"Node did not reach any of {phases} after {wait_time} seconds, current phase is {current_phase}")
AssertionError: Node did not reach any of NormalSyncPhase after 120 seconds, current phase is CatchUpRecoverBlockFromDbPhase
2020-03-30T07:55:46.423000Z TestFramework (INFO): Stopping nodes

^CTraceback (most recent call last):
  File "./tests/full_node_tests/p2p_era_test.py", line 73, in <module>
    P2PTest().main()
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_framework.py", line 226, in main
    self.stop_nodes()
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_framework.py", line 380, in stop_nodes
    node.stop_node()
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_node.py", line 249, in stop_node
    self.wait_until_stopped()
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/test_node.py", line 285, in wait_until_stopped
    wait_until(self.is_node_stopped, timeout=timeout)
  File "/Users/peter/work/repos/conflux-rust-clean/tests/test_framework/util.py", line 245, in wait_until
    time.sleep(0.5)
KeyboardInterrupt
[node 4] Cleaning up leftover process
[node 7] Cleaning up leftover process
[node 6] Cleaning up leftover process
[node 5] Cleaning up leftover process

real	6m7.904s
user	1m53.080s
sys	5m39.304s
The program fails at iteration 1

The text was updated successfully, but these errors were encountered:

Thegaram · 2020-03-30T12:51:42Z

It seems that occasionally, consensus_receiver is never closed. As a consequence, the Consensus Worker thread will never terminate, so it keeps data_man: Arc<BlockDataManager> and consensus: SharedConsensusGraph which of course prevents shutdown.

The only possible reason I can see is that the sender half of the channel is never dropped, which suggests SynchronizationGraph is never dropped. I fail to see how #1144 might cause this, as the only relevant change is moving data_man into the thread, which should not result in cyclical references (data_man does not reference sync_graph).

Further analysis suggests that the IoService worker thread join hangs indefinitely.

As a result, we will not drop IoService.
As a result, we cannot drop NetworkServiceInner (due to drop order of the fields of NetworkService, see here).
As a result, we cannot drop SynchronizationProtocolHandler (registered here) and LightProvider (registered here).
As a result, we cannot drop SynchronizationGraph.

If this is the case, I do not know yet why IoService won't terminate.

(This might be related to #939.)

fanlong · 2020-03-31T02:37:17Z

This looks very weird. It seems my PR exposed a bug in the code rather than the PR itself is the issue.

Thegaram · 2020-03-31T06:41:23Z

I agree. It seems though that this issue is somehow only reliably reproducible on my machine, CI seems fine and @peilun-conflux was also unable to reproduce.

MacBook Pro (Catalina 10.15.1)
2.3 GHz Dual-Core Intel Core i5
8 GB 2133 MHz LPDDR3
rustc 1.42.0 (b8cedc004 2020-03-09)

peilun-conflux · 2020-03-31T16:04:32Z

The failure to reach NormalSyncPhase is related to #1144.

Now consensus may process blocks in a different order, and the check here will never pass if the last block inserted in sync graph is not the last block processed in consensus graph.

peilun-conflux · 2020-03-31T16:36:09Z

Shutdown timeout is caused by the exact reason that @Thegaram mentioned in #939 here.

fanlong · 2020-03-31T23:36:50Z

@peilun-conflux I will fix the sync issue with a new PR

peilun-conflux · 2020-04-01T06:26:41Z

Shutdown timeout is caused by the exact reason that @Thegaram mentioned in #939 here.

Ok, when trying to fix this shutdown timeout issue I notice that it's directly caused by the fact that IO Worker is blocked on the while loop, and io_service is joining these threads. Thus #939 is not directly related to this issue.

Thegaram added the bug Something isn't working label Mar 30, 2020

Thegaram self-assigned this Mar 30, 2020

fanlong added the P2 Low Priority Issue that awaits fix label Mar 31, 2020

peilun-conflux self-assigned this Mar 31, 2020

fanlong self-assigned this Mar 31, 2020

peilun-conflux linked a pull request Apr 1, 2020 that will close this issue

Create a more robust mechanism to indicate whether the consensus work… #1163

Merged

peilun-conflux closed this as completed in #1163 Apr 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible regression after #1144 #1148

Possible regression after #1144 #1148

Thegaram commented Mar 30, 2020

Thegaram commented Mar 30, 2020

fanlong commented Mar 31, 2020

Thegaram commented Mar 31, 2020

peilun-conflux commented Mar 31, 2020 •

edited

Loading

peilun-conflux commented Mar 31, 2020

fanlong commented Mar 31, 2020

peilun-conflux commented Apr 1, 2020

Possible regression after #1144 #1148

Possible regression after #1144 #1148

Comments

Thegaram commented Mar 30, 2020

Thegaram commented Mar 30, 2020

fanlong commented Mar 31, 2020

Thegaram commented Mar 31, 2020

peilun-conflux commented Mar 31, 2020 • edited Loading

peilun-conflux commented Mar 31, 2020

fanlong commented Mar 31, 2020

peilun-conflux commented Apr 1, 2020

peilun-conflux commented Mar 31, 2020 •

edited

Loading