Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating the Nodes domain with timedCancellable #475

Merged
merged 10 commits into from
Oct 7, 2022

Conversation

tegefaulkes
Copy link
Contributor

@tegefaulkes tegefaulkes commented Oct 5, 2022

Description

This PR addresses the timedCancellable changes for the nodes domain. Part of this work was done in #445. This is just to fully finish off the domain.

Issues Fixed

Tasks

  • 1. Where needed the nodes methods need to be updated to support cancellability.
  • 2. where needed the nodes methods need to use the timedCancellable decorator.
  • 3. parts of the code making use of these timedCancellable methods needs to be updated to exploit this new functionality.
  • 4. Tests need to be expanded to test a wide range of timeout and cancellation conditions.
  • 5. If possible use the transaction locking for the object maps. Object map locks should be removed in favour of transaction locks #443 This was already using LockBox. So it has already been addressed.

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@ghost
Copy link

ghost commented Oct 5, 2022

👇 Click on the image for a new way to code review
  • Make big changes easier — review code in small groups of related files

  • Know where to start — see the whole change at a glance

  • Take a code tour — explore the change with an interactive tour

  • Make comments and review — all fully sync’ed with github

    Try it now!

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map Legend

@tegefaulkes
Copy link
Contributor Author

There is something funky happening with the context timer. I have a situation where we have a public method that is using the timedCancellable decorator to set up the ctx. The ctx is then passed into a private method. This private method isn't a timedCanceallbe, since it's private it just get's passed the context and never sets it up.

So what is happening is, when the first method nodeConnectionManager.acquireConnection(remoteNodeId1) is called, it sets up the timer. We can print it out and it's fine.

// Inside of acquireConnection
    console.log(ctx.timer.delay);
    console.log(ctx.timer.getTimeout());

// results in 
console.log
    20000

      at constructor_.acquireConnection (src/nodes/NodeConnectionManager.ts:178:13)

  console.log
    19988

      at constructor_.acquireConnection (src/nodes/NodeConnectionManager.ts:179:13)

However when it is passed into NodeConnectionmanager.getConnection() as a required parameter and we print the same details we get.

console.log
    20000

      at constructor_.getConnection (src/nodes/NodeConnectionManager.ts:282:13)

  console.log
    0

      at constructor_.getConnection (src/nodes/NodeConnectionManager.ts:283:13)

So I'm guessing the timer is getting cancelled for some reason?

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Oct 6, 2022

Be aware that the decorators are kind of "smart". There are some rules as how it sets up the ctx. If you're getting values that you don't expect, then just do a sanity check that these 2 ctxes are in fact the same ctx when you are propagating.

Also see my context tests, they show how you're supposed to be propagating the ctx.

@tegefaulkes
Copy link
Contributor Author

I see the problem. acquireConnection is a higher order function that returns a new function to be called. The timer it creates only exists for the duration of the acquireConnection call but not the existence of the returned function. So what's happening is that the timer is set up and immediately cancelled when the arrow function is returned. When the arrow function tries to get the delay it gets 0 from getTimeout since the timer was cancelled early.

The solution here is to make getConnection use timedCancellable even though it's a protected method and make acquireConnection pass through the context.

@CMCDragonkai
Copy link
Member

Yes, timedCancellable should be used even with protected methods. The advice of not using decorators only apply to protected methods when dealing with the @ready decorator.

@tegefaulkes
Copy link
Contributor Author

In that case, the higher order functions are still an issue? Since the timer it passes along can and will be cancelled early?

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Oct 6, 2022

I'm not sure what you mean by that.

If you use async function, whether a normal function declaration or arrow function. The interpreter forces the usage of Promise (and totally ignores the PromiseCancellable).

The timedCancellable must be always the outer wrapper.

const f = async () => { ... };
const g = timedCancellable(f);

// If you do this, the `h` promise is no longer cancellable
const h = async () => {
  return g();
};

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Oct 6, 2022

Ok early cancellation occurs due to the idea of "ownership".

If the timer is owned by the decorator or the HOF variant.

Then it will shut that down once the result is returned.

Only when the timer is propagated in and therefore not owned by the decorator or HOF variant will it not clean up after itself.

I follow the principle of:

"Whatever creates it, must destroy it".

Unless you have an explicit "ownership passing" operation.

@CMCDragonkai
Copy link
Member

So what's happening is that the timer is set up and immediately cancelled when the arrow function is returned.

Ok so if you have a higher order function that returns a lower function. And the higher order function is timed cancellable. Then yes this is what will happen. But it doesn't make sense to apply timedCancellable to the higher order function. It should be applied the lower order function if that's what you want.

@CMCDragonkai
Copy link
Member

In that case, you may not be able to use the decorator, you have to use the HOF variant. That's why I created both @timedCancellable AND timedCancellable. The former is "OOP style syntax", the latter is a functional style syntax.

@tegefaulkes
Copy link
Contributor Author

All the nodes tests are passing again. I need to take a look and ping node again. It doesn't seem to be respecting the timeout.

@tegefaulkes
Copy link
Contributor Author

tomorrow I need to:

  • Take a look at the pingNode timing out properly
  • Add any tests as needed.
  • Check the object map locking for NodeManagerConnectionManager. It doesn't use the database so it doesn't make sense to use transaction locks. I think it was already converted to use LockBox already.

@tegefaulkes
Copy link
Contributor Author

Maybe long running tasks need their own parameter. Right now I have them using the connConnectTime for the timeout. This doesn't make sense for some methods such as syncNodeGraph or refreshBucket. For long running tasks such as theses i'm adding a new parameter to NodeManager called longTaskTimeout and defaulted it to 2 min.

@tegefaulkes
Copy link
Contributor Author

The pingNode not respecting the timeout is an odd one. It seems that the chaining of signal abortion is broken. I've tracked it down to a specific condition that is causing it.

When calling nodeManager.pingNode it does two things, first calls nodeConnectionManager.findNode and then nodeConnectionManager.pingNode. both of these are taking the ctx context. Within nodeManager.pingNode the timer ending event and signal event both happen. But in the NodeConnectionManager.pingNode and deeper, only the timer end event happens. If I don't pass the ctx to the findNode the signal propagates through the nodeConnectionManager.pingNode like expected.

SO it seems that the findNode call completing is causing the signal chain to be broken. But the findNode is called and completes before calling the nodeConnectionManager.pingNode. How is it breaking the signal chain between nodeManager.pingNode and nodeConnectionManager.pingNode.?

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Oct 7, 2022

The pingNode problem has been fixed. Turns out the timedCancellable decorator was mutating the ctx passed into it. It had the peculiar effect where instead of the signal chain branching out and diverging when f1 called f2 then f3. The mutation called the branching chains to flatten out to a single chain. ctx1 -> ctx2 -> ctx3. When f2 call returned, it's signal was cleaned up causing the ctx2 link to break. What we want is a chain of ctx1 -> ctx2, ctx1 -> ctx3. the mutation was causing the ctx1 in f1 to be replaced with ctx2 when f2 was called.

We fixed this by having the decorators copy the contexts before making changes and passing them along. This prevents the mutation and maintains the branching signal chains. We had to apply this fix to all 3 decorators.

Changes can be seen in commit fix: ctx is copied to avoid mutation side effects 0a2653a.

@tegefaulkes
Copy link
Contributor Author

As far as the tests go. I don't think we need to add any more. The only timeouts were really updated here. The two main tests for this, ping timing out and connections timing out already exist and were updated. Anything weird that could happen that would need more testing would come from either network or GRPC at this point.

I did add 1 test for if the NodeConnection timed out after the expected delay.

The `timedCancellable` methods in `NodeManager` and `NodeConnectionManager` should take the default timeouts dynamically from the constructed parameters.
`acquireConnection` returns an arrow function for getting the connection. As a result, the timer it created is cancelled immediately. This resulted in the connection timing out early.
This was causing a peculiar bug where if a `ctx` was passed to function calls twice in a method it could cause the signal abortion chain to be broken. In essence in stead of branching the chain out with multiple call paths. The mutation flattened the tree into a chain, when one branch completed the decorator clean up broke the abortion chain.
…imeout

This can still be overwritten but by default to total timeout depends on the sub methods.
@tegefaulkes tegefaulkes force-pushed the feature-nodes_cancellability branch from a987a62 to 69c0ffe Compare October 7, 2022 08:19
@tegefaulkes
Copy link
Contributor Author

This should be good to merge.

@tegefaulkes tegefaulkes marked this pull request as ready for review October 7, 2022 08:24
@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Oct 7, 2022

[nix-shell:~/matrixCode/polykey/js-polykey]$ npm run polykey -- nodes add --node-path ./tmp/A vr7bll6lhbd0d7pajgi1db  
gb35sasfu0j0ag65hr6fjquod9rc9vg 127.0.0.1 42439  
  
> [email protected] polykey  
> ts-node src/bin/polykey.ts "nodes" "add" "--node-path" "./tmp/A" "vr7bll6lhbd0d7pajgi1dbgb35sasfu0j0ag65hr6fjquod9  
rc9vg" "127.0.0.1" "42439"  
  
✔ Please enter the password … ********  
  
[nix-shell:~/matrixCode/polykey/js-polykey]$ npm run polykey -- nodes ping --node-path ./tmp/A vr7bll6lhbd0d7pajgi1d  
bgb35sasfu0j0ag65hr6fjquod9rc9vg  
  
> [email protected] polykey  
> ts-node src/bin/polykey.ts "nodes" "ping" "--node-path" "./tmp/A" "vr7bll6lhbd0d7pajgi1dbgb35sasfu0j0ag65hr6fjquod  
9rc9vg"  
  
Node is Active.

After the node has shutdown

[nix-shell:~/matrixCode/polykey/js-polykey]$ npm run polykey -- nodes ping --node-path ./tmp/A vr7bll6lhbd0d7pajgi1d  
bgb35sasfu0j0ag65hr6fjquod9rc9vg  
  
> [email protected] polykey  
> ts-node src/bin/polykey.ts "nodes" "ping" "--node-path" "./tmp/A" "vr7bll6lhbd0d7pajgi1dbgb35sasfu0j0ag65hr6fjquod  
9rc9vg"  
  
No response received  
ErrorCLINodePingFailed: Node was not online or not found. - No response received  
 exitCode      1  
 timestamp     Fri Oct 07 2022 19:50:51 GMT+1100 (Australian Eastern Daylight Time)  

A node it doesn't know

[nix-shell:~/matrixCode/polykey/js-polykey]$ npm run polykey -- nodes ping --node-path ./tmp/A vr7bll6lhbd0d7pajgi1d  
bgb35sasfu0j0ag65hr6fjquod2rc9vg                  
  
> [email protected] polykey  
> ts-node src/bin/polykey.ts "nodes" "ping" "--node-path" "./tmp/A" "vr7bll6lhbd0d7pajgi1dbgb35sasfu0j0ag65hr6fjquod  
2rc9vg"  
  
Failed to resolve node ID vr7bll6lhbd0d7pajgi1dbgb35sasfu0j0ag65hr6fjquod2rc9vg to an address.  
ErrorCLINodePingFailed: Node was not online or not found. - Failed to resolve node ID vr7bll6lhbd0d7pajgi1dbgb35sasf  
u0j0ag65hr6fjquod2rc9vg to an address.  
 exitCode      1  
 timestamp     Fri Oct 07 2022 19:48:58 GMT+1100 (Australian Eastern Daylight Time)  
 cause: ErrorPolykeyRemote: Remote error from RPC call  
   command     nodesPing  
   nodeId      vrrtjta93vhe7dputjb3441t6k2hg9504gb0tpjcjl3bh9d1bdqtg  
   host        127.0.0.1  
   port        39957  
   timestamp   Fri Oct 07 2022 19:48:58 GMT+1100 (Australian Eastern Daylight Time)  
   cause: ErrorNodeGraphNodeIdNotFound: Could not find NodeId  
     exitCode  67  
     timestamp Fri Oct 07 2022 19:48:58 GMT+1100 (Australian Eastern Daylight Time)

In terms of usage, I don't think it should be printing the error when it fails to ping. By design it exits with the error code of 1, but that is an expected failure.

@tegefaulkes
Copy link
Contributor Author

I'm merging this now. Next step, looking into the NAT test failures.

@tegefaulkes tegefaulkes merged commit e428206 into staging Oct 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Update Nodes domain to used timedCancellable
2 participants