-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring Nodes Domain: NodeConnection & NodeConnectionManager and Fixing Bugs #225
Comments
Consider the possibility that each "NodeConnection" object can be
considered a "Data Access Object" that is actually handed to the
relevant domains that require nodes to use. Then NodeManager is about
maintaining a lifecycle of `NodeConnection` objects. This also means no
other domains should maintain or keep the reference long-running. Node
connection objects should only be short-running, and a pool of them is
kept in the node manager.
This is similar to how database connection pools are done. See how async
postgresql is done in Nodejs, or asyncpg in Python. Whether we are
connecting to a database, or to another keynode, the architecture and
style should be the same.
…On 8/11/21 3:21 PM, Josh wrote:
Requirements of this design
Eventually, we'd like to facilitate some discussion about the
structure of the |nodes| domain. This would aim to improve the
abstractions and separate shared classes of functionalities.
Additional context
Currently the |nodes| domain has the following structure:
* |NodeManager|: the top-level class that exposes API functions to
the outside world. Contains and manages its own instance of
|NodeGraph|, as well as handling all instances of |NodeConnection|
objects. Currently, these functions are either:
o public API 'relays' that internally call |NodeGraph| functions
o connection-related functions (internally set up a
|NodeConnection| object and then call |NodeConnection| functions)
* |NodeGraph|: a data structure containing the implementation of
Kademlia, used to distribute and store the node ID -> node address
mappings in a keynode
* |NodeConnection|: a class that handles a single client -> server
connection between 2 keynodes (entry-point for agent to agent GRPC
functions and message creation)
|NodeManager| is currently the only public-facing entrypoint to the
nodes domain. Both |NodeGraph| and |NodeConnection| could be
considered as a layer below |NodeManager|.
Roger and I were briefly discussing this structure when the idea of
dependency injecting |NodeGraph| into |NodeManager| was brought up (in
line with the other domains of Polykey). However, |NodeGraph| is
currently taking |NodeManager| as a dependency in its constructor
(passed as |this| when |NodeGraph| is constructed in |NodeManager|'s
constructor):
// NodeManager.ts
constructor({ ...}) {
this.nodeGraph = new NodeGraph({
nodeManager:this,
...
})
}
This is required such that |NodeGraph| can call the connection-related
functionality specified in |NodeManager|.
A solution to resolve this mutual dependency would be to "abstract the
commonality". We could potentially remove the connections
functionality from |NodeManager| entirely, replacing it into a
|NodeConnections| class (basically a manager specifically for the
lifetime of |NodeConnection| objects).
But then we could ask ourselves what the point of this is? That is,
what is |NodeManager| actually responsible for anymore?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#225>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAE4OHM5TT6QG75PBJTS4VLT4ICG7ANCNFSM5B5QZ7AA>.
|
I think we should follow an old-school "Connection Pool" design pattern.
Basically how client-server databases like postgresql and mysql is
provided to web applications.
…On 8/11/21 3:21 PM, Josh wrote:
Requirements of this design
Eventually, we'd like to facilitate some discussion about the
structure of the |nodes| domain. This would aim to improve the
abstractions and separate shared classes of functionalities.
Additional context
Currently the |nodes| domain has the following structure:
* |NodeManager|: the top-level class that exposes API functions to
the outside world. Contains and manages its own instance of
|NodeGraph|, as well as handling all instances of |NodeConnection|
objects. Currently, these functions are either:
o public API 'relays' that internally call |NodeGraph| functions
o connection-related functions (internally set up a
|NodeConnection| object and then call |NodeConnection| functions)
* |NodeGraph|: a data structure containing the implementation of
Kademlia, used to distribute and store the node ID -> node address
mappings in a keynode
* |NodeConnection|: a class that handles a single client -> server
connection between 2 keynodes (entry-point for agent to agent GRPC
functions and message creation)
|NodeManager| is currently the only public-facing entrypoint to the
nodes domain. Both |NodeGraph| and |NodeConnection| could be
considered as a layer below |NodeManager|.
Roger and I were briefly discussing this structure when the idea of
dependency injecting |NodeGraph| into |NodeManager| was brought up (in
line with the other domains of Polykey). However, |NodeGraph| is
currently taking |NodeManager| as a dependency in its constructor
(passed as |this| when |NodeGraph| is constructed in |NodeManager|'s
constructor):
// NodeManager.ts
constructor({ ...}) {
this.nodeGraph = new NodeGraph({
nodeManager:this,
...
})
}
This is required such that |NodeGraph| can call the connection-related
functionality specified in |NodeManager|.
A solution to resolve this mutual dependency would be to "abstract the
commonality". We could potentially remove the connections
functionality from |NodeManager| entirely, replacing it into a
|NodeConnections| class (basically a manager specifically for the
lifetime of |NodeConnection| objects).
But then we could ask ourselves what the point of this is? That is,
what is |NodeManager| actually responsible for anymore?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#225>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAE4OHM5TT6QG75PBJTS4VLT4ICG7ANCNFSM5B5QZ7AA>.
|
To add to this as well, currently we don't check for "liveness" of an active Obviously, if we move to a connection pool pattern, then this will be handled with that. |
Furthermore, we need to be correctly handling networking errors. From the nodes CLI fixes MR:
Some notes regarding this from our sprint meeting a couple of weeks ago (9th August):
If we use the connection pool pattern, the above points need to be considered. |
So we already have the concept of a "connection pool", whereby no other domains maintain or hold a long-running reference to a |
I've been thinking about the relationship of the nodes domain relative to other domains. Currently the nodes domain gets injected to other domains to be used and those domains will often asks the nodes domain for node connections to work with. I realise that there's a hierarchy here. Nodes domain is a lower level domain compared to vaults and, gestalt graph and related. High and low level relate to how user-facing it is. User operations include CLI and API calls to the PK agent which then are handled by a grpc server which will call these domains. For example if vaults domain were at the same level as nodes domain, then it would mean that the call handler would interact with both domains directly and then pass any relevant objects from the nodes domain to the vaults domain. This is how I've often done my database structure like:
The reason to do this was to be able to share db connection instance with multiple operations and to potentially allow the controller to control nested contexts relating to the connection such as a transaction context across multiple operations. Furthermore connection resource lifetime was tied to the controller handling rather than to specific model operations. To a significant extent multiple domains may be involved in working against the same node in the same way multiple models may be working with the same DB connection. If the nodes domain is considered something lower level that is injected into the different domains, it would be like constructing models by injecting a db dependency in them. This works as long as we think that the "controller" which in our case is the gRPC API handlers doesn't have to control nested contexts for a given node connection used by multiple domains. This would mean that the domains like vaults and such is much more richer and performs controller-like functionality. If we find out that there needs to be operations done across domains, how do we deal with this atm? Either we abstract another domain on top and then use that like we are doing with vaults domain on top of nodes and db domain, or we eventually flatten the hierarchy and do alot more operations at the grpc handlers. The file structure right now lends towards a flattened hierarchy. But the internal connections through DI indicates a more complex hierarchy. It's usually better to have more flattened hierarchy and leave the composition at the very end for maximum flexibility but the trade off is more overhead and boilerplate. I'll need to see how the vaults use the nodes domain along with other domains to really get a good feel of how to proceed. |
@joshuakarp I think you have been focusing on nodes domain internal architecture and I'm also concerned about the relationship between nodes and other domains and how it will all work. |
Handling |
I've turned this issue into an epic and added all the related issues that can all be solved in one go with some restructuring of the nodes domain. This was based on the discussion on the nodes domain test failures happening since we were reviewing the CARL, session management and async-init fixes #281 applied there. I'll be fleshing out the specification here in the issue description regarding what we want to do. |
Spec has been updated, please review. This is going to be the biggest loose end before release. |
Implementation of this will require @tegefaulkes and @joshuakarp. |
I've been playing around with some timer mocking for the existing tests (https://jestjs.io/docs/timer-mocks). Ideally I could get these reduced and working before we start refactoring, such that it's easier to diagnose a test failure from implementation as opposed to attempting to diagnose issues with this after refactoring nodes. We can see the following test using import { sleep } from '@/utils';
describe('Timers', () => {
test('testing', async () => {
jest.useFakeTimers();
const s = sleep(10000);
jest.runAllTimers();
await s;
});
});
For the
What's interesting with the |
We'll be merging #289 before beginning work on this refactoring effort. The seed nodes work is simple enough to merge, such that we don't over-complicate it with this refactoring. |
As a very rough starting point, I'm estimating this would take about a week and a half (including testing and review). Currently scheduling it after the testnet deployment, from Monday Dec 6th to Wednesday Dec 15th. |
As another supporting task, @emmacasolin could also be assigned to figuring out the timer mocking for the nodes tests with this issue (this already has some relation to #264 too). Would also be a reasonable introduction to the |
@tegefaulkes will most likely be on leave when we're tackling this issue, so for now have unassigned him. |
For our |
There was also a point brought up whether it would be worthwhile to refactor the nodes domain to use the |
Yes it just means you would be doing |
A quick note for myself/@emmacasolin: when looking into connection timeouts for testing, have a read of this comment too #289 (comment). |
I've been having a think about this. Right now:
Given that |
Specifically with respect to the vaults domain, this refactoring effort shouldn't actually cause too many changes there. Previously, the vaults domain was already (previously incorrectly) performing operations on the const nodeConnection = await this.nodeManager.getConnectionToNode(
pullNodeId!,
);
const client = nodeConnection.getClient();
// ... do operations on client This functionality is actually what we intend to move towards, and as such, shouldn't require too many changes. We'll instead just need to wrap the operations on the client inside Notably, in the vaults domain, the following will need to be checked and changed:
The actual API of the vaults domain remains unchanged though. We simply inject |
After some discussion, a proposed architecture diagram:
In this architecture some desired APIs:
All other domains should be only using
Note that |
@joshuakarp we will need specced out APIs for of the nodes classes. Some notes from our discussion about comparing its API to the vaults domain:
// The vaults API is an interesting comparison as they have a low level API
public vaultManager.createVault(): Promise<void>
protected vaultManager.openVault()
protected vaultManager.closeVault()
public vaultManager.destroyVault(): Promise<void>
// If we follow `NodeConnectionManager`, we may also expose a higher level callback API
vaultManager.withVault(async (v1) => {
vaultManager.withVault(async (v2) => {
vaultOps.mv(v1, v2);
});
});
// Example of using this with vault ops which takes the vaults
vaultOps.mv(v1, v2) {
// dual contexts, nested commits, similar to python's `with`
// watch out, what happens if v1 and v2 are the same vault?
withCommit([v1, v2], async ([fs1, fs2]) => {
});
} |
Specification
The Nodes domain is an integration of all the other domains that enables Polykey to be a distributed and decentralised application. Right now it's design is causing its tests and its code to be quite brittle as other domains are being iterated on. This is slowing down our development process. At the same time, it has a number of functionalities that time to do such as pinging, these are now causing very slow tests up to 60 or 80 seconds for a single test to just do pinging.
The nodes domain require a significant refactoring to ensure stability and robustness and performance going into the future.
After a small review over the nodes codebase, here is the new proposed structure.
NodeConnectionManager
NodeGraph
(optional dependency)claimNode
NodeConnection
in a pool of node connectionswithConnection
to run an asynchronous callback that usesNodeConnection
NodeManager
's "wrapper" functions that performNodeConnection
operations would instead create their own internal functions (within their own domains) that use this higher-order functionwithConnection
to perform their connection functionality.NodeManager
should not have any of this anymoreNodeConnection
and closes the connection after timeoutGRPCClientAgent
, and provides some lifecycle features specific to the node connectionGRPCClientAgent
or es6 proxy or wrapper classclient
propertyNodeId
->NodeAddress
mappings in a keynodeBoth
NodeManager
andNodeConnectionManager
should be exposed at the high level inPolykeyAgent
and provided tocreateClientService
.Furthermore, the following lists the current usage of
NodeManager
forNodeConnection
operations (i.e. the places whereNodeConnectionManager
would instead need to be injected):Discovery
NodeManager
NodeGraph
- this is a special case that will need some potential refactoring, see Refactoring Nodes Domain: NodeConnection & NodeConnectionManager and Fixing Bugs #225NotificationsManager
VaultManager
The following diagram depicts these dependencies:
NodeManager
The
NodeManager
manages nodes, but does not manage the node connections.It is expected that
NodeManager
requiresNodeConnectionManager
which is shared as singleton with other domains.Operations relating to manipulating nodes should be on
NodeManager
such asclaimNode
and related operations.Most of the time this will be called by GRPC handlers, but some domains like discovery might be using the nodes domain. It is also what encapsulates access and mutation to the
NodeGraph
.NodeConnectionManager
The
NodeConnectionManager
manages a pool of node connections. This is what is being shared as a singleton to all the domains that need to contact a node and call their handlers. For exampleVaultManager
would not useNodeManager
but onlyNodeConnectionManager
. This also includes notifications domain.The connections are shared between each node connection.
Notice the lock being used to ensure that connection establishment and connection closure is serialised. Refer to:
https://gist.github.com/CMCDragonkai/f58f08e7eaab0430ed4467ca35527a42 for example. This pattern is also used in the vault manager.
A draft mock of the class:
In the
withConnection
call, this is the only way to acquire a connection to be used by other domains. If this is to be similar to theVaultManager
, that mean thatVaultManager.withVault
would need to exist.The locking side effects could be done in
NodeConnectionManager
or inside theNodeConnection
. However since this problem also exists in theVaultManager
, I believe we want to do the side-effects as much as possible in theNodeConnectionManager
. If locking during creation and destruction needs to occur, then it has to occur via callbacks that are passed into theNodeConnection
.NodeConnection
There may be a
NodeConnectionInternal
to have the actual methods whileNodeConnection
maybe a public interface similar toVault
andVaultInternal
.Usage
Any call to another keynode has to go through the
NodeConnection
. It is expected that callers will perform something like this:Such usage would occur in both
NodeManager
and other domains such asVaultManager
.Note that the
conn.client.doSomething()
may beconn.doSomething()
instead. This depends on howNodeConnection
is written.Either way it is expected that
doSomething
is the same gRPC call available inGRPCClientAgent
.This means it is up to the domains to manage and handle gRPC call behaviour which includes the usage of async generators, metadata and exceptions. Refer to #249 for details on this.
The reason we are doing this is that abstractions on the gRPC call side are domain-specific behaviour. We wouldn't want to centralise all of the domain call gRPC abstractions into the nodes domain, that would be quite brittle.
This means the vaults domain could if they wanted to create their own wrapper over the connection object such as:
Right now we shouldn't do any gRPC abstractions until #249 is addressed.
Lifecycle
The
withConn
method provides a "resource context". It's a higher order function that takes the asynchronous action that is to be performed with the connection.It has a similar pattern to our
poll
,retryAuth
,transact
,transaction
,_transaction
wrapper functions.The advantage of this pattern is that the node connection will not be closed (by the TTL) while the action is being executed.
The
withConn
will need to use theRWLock
pattern as provided insrc/utils.ts
. It has to acquire a read lock when executing the action.This enables multiple
withConn
to be executing concurrently without blocking each other.At the end of each
withConn
, theTLL
should be reset.When the TLL is expired, it may attempt to close the connection. To do this, a write lock is attempted. If the lock is already locked, then we skip. If we acquire the write lock, we can close the connection.
If the client were to be closed or terminated due to other reasons, then an exception should be thrown when accessing the
client
or its methods.This would be
ErrorNodeConnectionNotRunning
orErrorGRPCClientAgentNotRunning
orErrorGRPCClientNotRunning
.Other domains should not be keeping around references to
NodeConnection
it should only be used inside thewithConn
context.Hole Punching Relay
Hole punching is done between the
ForwardProxy
andReverseProxy
. Hole punching is not done explicitly in theNodeConnectionManager
orNodeConnection
.Instead, hole punching relay is what is done by
NodeConnectionManager
. This means you relay a hole punching message across the network so that the target Node will perform reverse hole punching in case we need it. This process should be done at the same time as a connection is established.This means
NodeConnectionManager
requires theNodeGraph
, the sameNodeGraph
that is used byNodeManager
.The graph is needed to do optimal routing of the the hole punching relay message.
This means
NodeConnectionManager
would need to be used bycreateAgentService
as the current keynode may be in the middle of a chain of relaying operations.Public Interface
The public interface of
NodeConnection
can be a limited version of the internal class. This can hide methods likedestroy
from the users ofNodeConnection
. This ensures that onlyNodeConnectionManager
can destroy the connection.Dealing with Termination/Errors
The underlying
GRPCClientAgent
may close/terminate for various reasons. In order to capture this behaviour, you must hook into the state changes in theClient
object. This will require changes tosrc/grpc/GRPCClient.ts
, as it currently does not expose theClient
object. Discussions on how to do this are in: #224 (comment)NodeGraph
NodeGraph is a complex data structure. It appears to be needed by both
NodeManager
andNodeConnectionManager
, but itself also needs to perform operations onNodeManager
. Whenever there is mutual recursion, it can be factored out as another common dependency "absract the commonality". However if this is not suitable, then it can be done by passingNodeManager
asthis
intoNodeGraph
.However if
NodeConnectionManager
needsNodeGraph
as well, then this is not possible unlessNodeConnectionManager
was also produced byNodeManager
.Further investigation is needed into the
NodeGraph
to understand how we can turn it into a data structure. Perhaps we can use a pattern of hooks whereNodeGraph
exposes a number of hook points which can be configured byNodeManager
and thus allowNodeGraph
to "callback"NodeManager
andNodeConnectionManager
.If
NodeGraph
can be made purely a data structure/database, then:getClosestLocalNodes
andgetClosestGlobalNodes
could be part of theNodeManager
, and not theNodeGraph
.NodeManager
, and instead you just plug it with actual values.There's only 3 places where
NodeGraph
is calling out to the world. All 3 can be done inNodeManager
instead. That meansNodeGraph
no longer needsNodeConnectionManager
.Testing
For
NodeConnectionManager
andNodeConnection
, it's essential that we don't duplicate tests forGRPCClientAgent
as that would be handled bytests/agent
. It must focus on just testing the connection related functionality. This is where mocks should be useful here, since we want to delay integration testing until the very end.There are 4 kinds of mocks we have to use here.
tests/bin/utils.retryAuth.test.ts
, function mocks can be very useful way of providing a dummy function that we can later instrument. This can be useful if we pursue a hook pattern withNodeGraph
.tests/bin/utils.retryAuth.test.ts
(jest.mock('prompts')
) this is high level mocking that should only be used when no other methods can be done. This is more complicated, but you can mock anything this way.Module/class
mocks.The tests for the nodes domain should be robust, reproducible, not flaky and complete within 10 - 30 seconds.
Larger integration tests should be done at a higher level, which will help us achieve #264.
This should ensure that nodes domain is more stable and does not break every time we change stuff in other domains.
Additional context
Tasks
NodeConnectionManager
NodeConnection
intoNodeConnectionManager
andNodeManager
destroy
call ofNodeConnection
(the same process as construction): Refactoring Nodes Domain: NodeConnection & NodeConnectionManager and Fixing Bugs #225 (comment)NodeGraph
andNodeManager
andNodeConnectionManager
NodeConnection
andNodeConnectionManager
testing. Remove any tests that test agent service handling or forward proxy and reverse proxy usage. Rely on the other domain tests to handle this.NodeConnection
to acquire their connections fromNodeConnectionManager
.The text was updated successfully, but these errors were encountered: