Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: disconnected node was never ready #3312

Merged
merged 1 commit into from
Sep 7, 2021

Conversation

Cifko
Copy link
Contributor

@Cifko Cifko commented Sep 7, 2021

Description

Disconnected node was never ready.

How Has This Been Tested?

npm test -- --name "Full block sync with small reorg"
npm test -- --name "Pruned mode reorg simple"
npm test -- --name "Pruned mode reorg past horizon"

Copy link
Collaborator

@SWvheerden SWvheerden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be completely wrong here. But I think something else is wrong if this fixes the error.

handle_liveness_event
Is suppose to update the base_node or listing state to the ChainMetaData of other neighboring nodes. It is not suppose to stop the node broadcasting or telling other clients its own ChainMetaData.

send_chain_metadata_to_event_publisher is sent out if it receives a new ping or pong from a neighbor. So this PR will resend that entire list it just sent again if this node restarts a ping round, but that data is only supposed to be neighbor info. In the cucumber tests in question, the node is alone and thus never receives a ping or pong message. Thats why this is now "fixes" it. But the question is why does the fact that the node cannot broadcast its own ChainMetaData if it does not receive other node's ChainMetaData

@aviator-app aviator-app bot merged commit dfc6fd2 into tari-project:development Sep 7, 2021
@sdbondi
Copy link
Member

sdbondi commented Sep 8, 2021

Yeah I tend to agree with @SWvheerden that something else is wrong (maybe there are no other peers to connect to in the cucumber tests?). The listening state responds to the event from the chain metadata service. So I'm going to say that it is triggering this code early in listener.rs.

       if peer_metadata_list.is_empty() {
                        debug!(
                            target: LOG_TARGET,
                            "No peer metadata to check. Continuing in listening state.",
                        );

                        if !self.is_synced {
                            debug!(target: LOG_TARGET, "Initial sync achieved");
                            self.is_synced = true;
                            shared.set_state_info(StateInfo::Listening(ListeningInfo::new(true)));
                        }

This may (need to check) be breaking the base node in a subtle way where it says it is synced without first actually checking if it needs to sync first - maybe that ends up being ok because it will kick back to is_synced == false if needed, but then we may as well start the base node with is_synced == true.

It is the chain metadata's job to provide a good view of the chain metadata to whoever is listening, so empty chain metadata is probably breaking that contract

@Cifko
Copy link
Contributor Author

Cifko commented Sep 8, 2021

Yeah I tend to agree with @SWvheerden that something else is wrong (maybe there are no other peers to connect to in the cucumber tests?).

That's true. There are no other nodes. So that's why it never receives ping or pong, so the only way to trigger anything is from the PingRoundBroadcast. Otherwise the node will be stuck in the listening.rs next_event waiting for update. Or we can have pair of nodes in the tests that can ping each other (I've tested this just now on the Pruned mode reorg simple).

This may (need to check) be breaking the base node in a subtle way where it says it is synced without first actually checking if it needs to sync first - maybe that ends up being ok because it will kick back to is_synced == false if needed, but then we may as well start the base node with is_synced == true.

How about we put the send_chain_metadata_to_event_publisher in a if statement num_peers == 0. But I'm not sure if the number is not 0 from the start (need to check).

@sdbondi
Copy link
Member

sdbondi commented Sep 8, 2021

I checked, liveness service will send the PingBroadcastRound every configured interval even if there are no peers in the round
i.e PingRoundBroadcast(0) doe get sent. Think that is better as it will stop unnecessary events when we have peers to talk to.

@SWvheerden
Copy link
Collaborator

Found the actual problem.
Line 121 in Listening.rs:
The code sits here n=and waits: let metadata_event = shared.metadata_event_stream.recv().await; for an event to come through. This happens as the ChainMetaData_service only sends out an event if receives ChainMetadata.

This PR fixes the problem because it now sends out an empty Vec every ping broadcast event. So it does not get stuck there. The problem is that the piece of code to set is_sync == true is after that wait.

@Cifko
Copy link
Contributor Author

Cifko commented Sep 8, 2021

So we agree we need to somehow trigger the listening.rs to continue. So there are two options as I see it.

  1. I will change it to if (num_peers == 0) send_chain_metadata_to_event_publisher
  2. I will change the tests to have pair of nodes that ping each other and I will revert the code as it was before this PR.

I prefer the option 1, as I encountered the problem outside of cucumber test (on my manual testing environment).

@sdbondi
Copy link
Member

sdbondi commented Sep 8, 2021

I think option 1 is essentially saying that we decide that there is network silence after a few seconds and assume we're the only node and so are synced. That seems fine for cucumber tests, but may be a bit quick on a real base node. The way we handle network silence isn't great - feels like this responsibility should fall on chain metadata service. Basically if there are n rounds of no ping/pong peers, then we should broadcast a ChainMetadataEvent::NetworkSilence event that will be handled appropriately by the listening state (there is already a NetworkSilence StateEvent)

Cifko added a commit to Cifko/tari that referenced this pull request Sep 10, 2021
Description
---
Disconnected node was never ready.

How Has This Been Tested?
---
npm test -- --name "Full block sync with small reorg"
npm test -- --name "Pruned mode reorg simple"
npm test -- --name "Pruned mode reorg past horizon"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants