Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update handling of SAF message propagation and deletion #3164

Merged
merged 4 commits into from
Aug 11, 2021

Conversation

philipr-za
Copy link
Contributor

Description

This PR adds two changes to the way SAF messages are handled to fix two subtle bugs spotted while developing cucumber tests.

The first issue was that when a Node propagates a SAF message it was storing to other nodes in its neighbourhood the broadcast strategy it was using only chose from currently connected base nodes. This meant that if the Node had an active connection to a Communication Client (wallet) it would not just directly send the SAF message to that client but to other base nodes in the network region. This meant that the wallet would only receive new SAF message when it actively requested them on connection even though it was directly connected to the node.

This PR adds a new broadcast strategy called DirectOrClosestNodes which will first check if the node has a direct active connection and if it does just send the SAF message directly to its destination.

The second issue was a subtle problem where when a node starts to send SAF messages to a destination it would remove the messages from the database based only on whether the outbound messages were put onto the outbound message pipeline. The problem occurs when the TCP connection to that peer is actually broken the sending of those messages would fail at the end of the pipeline but the SAF messages were already deleted from the database.

This PR changes the way SAF messages are deleted. When a client asks a node for SAF message it will also provide a timestamp of the most recent SAF message it has received. The Node will then send all SAF messages since that timestamp that it has for the node and will delete all SAF messages from before the specified Timestamp. This serves as a form of Ack that the client has received the older messages at some point and they are no longer needed.

How Has This Been Tested?

Unit tests have been updated to test this functionality.

Checklist:

  • I'm merging against the development branch.
  • I have squashed my commits into a single commit.

SWvheerden
SWvheerden previously approved these changes Aug 6, 2021
Copy link
Collaborator

@SWvheerden SWvheerden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, this is a good fix

Copy link
Member

@sdbondi sdbondi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - I had assumed that we keep the broadcast strats the same and just make closest send to the dest peer if a connection is available rather than creating a new one broadcast strat. (Perhaps a miscommunication, I had thought maybe a DirectOrClosest would be needed on the connectivity_manager.select_connections call, not as a broadcast strat)

Any reason/use-case that we need to keep the previous Closest behaviour around?
A quick look, I can't see a reason to have a new strat. If that is the case, I'd say rather fix ClosestNodes to send directly if connected.

comms/dht/src/store_forward/service.rs Outdated Show resolved Hide resolved
@philipr-za
Copy link
Contributor Author

philipr-za commented Aug 10, 2021

A quick look, I can't see a reason to have a new strat. If that is the case, I'd say rather fix ClosestNodes to send directly if connected.

Cool, I was a bit hesitant to assume there was no use for it so I left it in as is, in case, but if you think it can be cut I will take it out. I think I will leave it called DirectOrClosestNodes just so that the name makes the behaviour clear.

This PR adds two changes to the way SAF messages are handled to fix two subtle bugs spotted while developing cucumber tests.

The first issue was that when a Node propagates a SAF message it was storing to other nodes in its neighbourhood the broadcast strategy it was using only chose from currently connected base nodes. This meant that if the Node had an active connection to a Communication Client (wallet) it would not just directly send the SAF message to that client but to other base nodes in the network region. This meant that the wallet would only receive new SAF message when it actively requested them on connection even though it was directly connected to the node.

This PR adds a new broadcast strategy called `DirectOrClosestNodes` which will first check if the node has a direct active connection and if it does just send the SAF message directly to its destination.

The second issue was a subtle problem where when a node starts to send SAF messages to a destination it would remove the messages from the database based only on whether the outbound messages were put onto the outbound message pipeline. The problem occurs when the TCP connection to that peer is actually broken the sending of those messages would fail at the end of the pipeline but the SAF messages were already deleted from the database.

This PR changes the way SAF messages are deleted. When a client asks a node for SAF message it will also provide a timestamp of the most recent SAF message it has received. The Node will then send all SAF messages since that timestamp that it has for the node and will delete all SAF messages from before the specified Timestamp.  This serves as a form of Ack that the client has received the older messages at some point and they are no longer needed.
@philipr-za
Copy link
Contributor Author

Looks good - I had assumed that we keep the broadcast strats the same and just make closest send to the dest peer if a connection is available rather than creating a new one broadcast strat. (Perhaps a miscommunication, I had thought maybe a DirectOrClosest would be needed on the connectivity_manager.select_connections call, not as a broadcast strat)

Any reason/use-case that we need to keep the previous Closest behaviour around?
A quick look, I can't see a reason to have a new strat. If that is the case, I'd say rather fix ClosestNodes to send directly if connected.

So I looked at this a bit and I noted that the Join process still kind of needs the ClosestNodes without the direct option. I also think that perhaps there might be other applications where sending to a neighbourhood and not doing the direct case might be useful.

The Join process might need an update in the future when we move to k-buckets but I am going to leave that considering for the future.

@stringhandler stringhandler changed the title Update handling of SAF message propagation and deletion fix: update handling of SAF message propagation and deletion Aug 10, 2021
@aviator-app
Copy link
Contributor

aviator-app bot commented Aug 10, 2021

PR queued successfully. Your position in queue is: 3

@aviator-app
Copy link
Contributor

aviator-app bot commented Aug 10, 2021

PR is on top of the queue now

philipr-za added a commit to philipr-za/tari that referenced this pull request Aug 10, 2021
# Based on tari-project#3164

This PR addresses the following scenario spotted by @stanimal:

 - NodeA sends to nodeB(offline) 
- NodeA goes offline 
- NodeB receives tx, and cancels it (weird I know) 
- NodeA comes online and broadcasts the transaction 
- NodeB is not aware of the transaction, transaction complete for NodeA

This is handled by adding logic that if a FinalizedTransaction is received with no active Receive Protocols that the database is checked if there is a matching cancelled inbound transaction from the same pubkey. If there is the receiver might as well restart that protocol and accept the finalized transaction.

A cucumber test is provided to test this case.

This required adding in functionality to the Transaction and Output Manager service to reinstate a cancelled inbound transaction, unit tests provided for that.
@aviator-app
Copy link
Contributor

aviator-app bot commented Aug 10, 2021

PR failed to merge with reason: Some CI status(es) failed
Failed CI(s): ci/circleci: run-integration-tests

@aviator-app
Copy link
Contributor

aviator-app bot commented Aug 10, 2021

PR queued successfully. Your position in queue is: 1

@aviator-app
Copy link
Contributor

aviator-app bot commented Aug 10, 2021

PR failed to merge with reason: Some CI status(es) failed
Failed CI(s): ci/circleci: run-integration-tests

@aviator-app
Copy link
Contributor

aviator-app bot commented Aug 11, 2021

PR queued successfully. Your position in queue is: 2

@aviator-app
Copy link
Contributor

aviator-app bot commented Aug 11, 2021

PR is on top of the queue now

@aviator-app aviator-app bot merged commit cedb4ef into tari-project:development Aug 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants