-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidating Configuration Entropy and Nodes and Client Services Refactoring #552
Conversation
This PR is focused on have a consistent user-friendly "application-level" configuration of PK. The fact that upstream libraries/domains may require a specific naming that doesn't match is not relevant for this. Upstream can change their naming (like aligning |
66dc5c3
to
3064e24
Compare
@tegefaulkes there are
Removing these from |
Rebased on staging, now to continue. |
All paths that was in the defaults is now in:
These are basically application constants, they are not really user config, no need to put in |
There are a bunch of system parameters. That I'm moving to These parameters are tuned by the developers. They are not to be specified by the user. We are only raising it to the top level to have one place acting as the main constant specification. The naming of these parameters should be consistent and should have consistent meaning. And I'm already finding some strange aspects of these that I need you input on @tegefaulkes. Firstly for agent related networking. I'm noticing that there's overlap between node connection related parameters and the agent transport parameters. I've currently got them named like this in this PR:
Firstly right now the entire agent transport is being encapsulated inside the Secondly, isn't there overlapping behaviour here. We have a connect time and timeout time for node connections. But doesn't this end up encapsulating the entire agent connection keep alive interval time and max idle timeout time? Let's say
In all these cases, this 2000ms is used across all the asynchronous operations and finally fed into So in terms of timing hierarchy:
Right now the max idle time is set to 60000ms. While the node connection connect time is 2000ms. That actually means the total setup time is limited to 2000ms right? Then there's also the
Then this means, that after 60s without packet activity, the transport connection will be killed. But even if there is 60s of packet activity, if there is no "node" activity on a node connection, after 60s, the node connection will be killed. This means that as long as the keep alive is running, the underlying agent connection will continue running, but if no node-related operations are running for 60s, then the node connection is terminated. Then there's also the So there are some overlaps and also specific constraints. Let's see if we can simplify all of this.
If this is true, then we can:
Now I do remember we talked about this. I was asking why not set the underlying connection to have an infinite idle timeout, then rely on the Furthermore such defaults doesn't need to be specified by the I'm a fan of "terminal defaults" - either the minimum value, or the maximum value. That means either:
Such defaults ensure no assumptions on the end-user. Keep-alives becomes opt-in, and timeouts become opt-in. Applying this principle ensures that the default configuration from a library perspective is always the minimal behaviour which is disabling keepalives and disabling timeouts. However applications are different and they should have tuned convention. This is because applications have contextual understanding, libraries don't. So what does this mean for the config? I think:
I'm ordering them from most higher level to lower level. So config is ordered: RPC, nodes, client transport, then agent transport. Another thing is that if defaults are specified at the application level, then in each domain, their defaults are specified right now directly. That's not quite correct. They should just directly refer to the |
I think there's also the problem of https://github.com/MatrixAI/MatrixAI-Graph/issues/43. We also have to decide on the structuring of the timeout especially if there is a fan-out behaviour in the async code paths. We should label them correctly and ensure that we have specified behaviour here. |
3064e24
to
a2e0692
Compare
Ok I did talk about this earlier about the issue with using Need to reconcile this. |
a2e0692
to
3af5c05
Compare
Max idle timeout is a bit of a technical limitation of quic. It's the best way we can detect and handle a connection failure, so id prefer if it wasn't infinite. But for quic, it's used for the idle timeout and connection establishment timeout. so we can't really separate the two. If we set the idle timeout to infinite, then we wouldn't have a nice way to detect if a connection has failed without re-implementing a bunch of logic for checking connection level idleness. At the time we decided it was better to let quic handle this. As for the startup timeout, that needs to be less than the maxIdleTimeoutTime since idle time is essentially the timeout time for a connection in quic. We coded it so that if the start timeout was more than the idle timeout then the connection creation would throw. In my opinion, the idle timeout QUIC provides is too useful not to use, but given how it works, if we use it we can't have the start timeout be longer or infinite. I suppose we could default both to being being infinite and have anything using specify these parameters. But it seems like a bad idea to have any connections default to never timing out. |
I'm going to be trying to abstract the agent transport parameters to the |
Here's the revised configuration for the system:
All other parameters should be subsumed to this. In order to achieve this both
The Right now This is all due to the implementation of Correspondingly it makes sense to throw an exception if So for
As for
However if |
Furthermore I'm removing some parameters that should be auto-derived based on https://github.com/MatrixAI/MatrixAI-Graph/issues/43. Things like the node ping timeout time, should really be automatically derived. |
Note that |
@tegefaulkes the constraint can be checked in |
So simply because Also if it turns out it's possible to actually mutate the config after you create the connection, that might actually solve the problem. I asked this but you need to experiment to see if it is true. https://chat.openai.com/share/5a302ea7-6134-4020-84fc-015200de4aa2 |
Regarding interval time, their definitions all depend on underlying implementation. In some simple cases it's just a fixed interval time. In other cases it is more efficient as in QUIC. I believe any implementation we do should default to the simple interval and let's not doing anything fancy yet. So |
Ok so inside the It looks like this:
Located here: https://docs.quic.tech/src/quiche/lib.rs.html#1726 Anyway this means if we do this:
The first call works because it's done before the connection copies it. The second one doesn't. The call succeeds, but the mutation doesn't do anything because it doesn't affect the underlying property. |
Furthermore the |
The solution is:
|
@amydevs for js-ws, I'm creating these 3 parameters: /**
* Timeout for the transport connecting to the client service.
*
* This bounds the amount of time that the client transport will wait to
* establish a connection to the client service of a Polykey Agent.
*/
clientConnectTimeoutTime: 15_000, // 15 seconds
/**
* Timeout for the keep alive of the transport connection to the client
* service.
*
* It is reset upon sending or receiving any data on the client service
* transport connection.
*
* This is the default for both sides (client and server) of the connection.
*
* This should always be greater than the connect timeout.
*/
clientKeepAliveTimeoutTime: 30_000, // 30 seconds (3x of interval time)
/**
* Interval for the keep alive of the transport connection to the client
* service.
*
* This is the minimum interval time because transport optimisations may
* increase the effective interval time when a keep alive message is not
* necessary, possibly due to other data being sent or received on the
* connection.
*/
clientKeepAliveIntervalTime: 10_000, // 10 seconds Note that Remember that in |
This PR can install the This is the current state now with all comments explaining their meaning. defaultsSystem: {
/**
* Timeout for each RPC stream.
*
* The semantics of this timeout changes depending on the context of how it
* is used.
*
* It is reset upon sending or receiving any data on the stream. This is a
* one-shot timer on unary calls. This repeats for every chunk of data on
* streaming calls.
*
* This is the default for both client calls and server handlers. Both the
* client callers and server handlers can optionally override this default.
*
* When the server handler receives a desired timeout from the client call,
* the server handler will always choose the minimum of the timeouts between
* the client call and server handler.
*
* With respect to client calls, this timeout bounds the time that the client
* will wait for responses from the server, as well as the time to wait for
* additional to be sent to the server.
*
* With respect to server handlers, this timeout bounds the time that the
* server waits to send data back to the client, as well as the time to wait
* for additional client data.
*
* Therefore it is expected that specific clients calls and server handlers
* will override this timeout to cater to their specific circumstances.
*/
rpcTimeoutTime: 15_000, // 15 seconds
/**
* Buffer size of the JSON RPC parser.
*
* This limits the largest parseable JSON message. Any JSON RPC message
* greater than this byte size will be rejecte by closing the RPC stream
* with an error.
*
* This has no effect on raw streams as raw streams do not use any parser.
*/
rpcParserBufferSize: 64 * 1024, // 64 KiB
/**
* Timeout for the transport connecting to the client service.
*
* This bounds the amount of time that the client transport will wait to
* establish a connection to the client service of a Polykey Agent.
*/
clientServiceConnectTimeoutTime: 15_000, // 15 seconds
/**
* Timeout for the keep alive of the transport connection to the client
* service.
*
* It is reset upon sending or receiving any data on the client service
* transport connection.
*
* This is the default for both sides (client and server) of the connection.
*
* This should always be greater than the connect timeout.
*/
clientServiceKeepAliveTimeoutTime: 30_000, // 30 seconds (3x of interval time)
/**
* Interval for the keep alive of the transport connection to the client
* service.
*
* This is the minimum interval time because transport optimisations may
* increase the effective interval time when a keep alive message is not
* necessary, possibly due to other data being sent or received on the
* connection.
*/
clientServiceKeepAliveIntervalTime: 10_000, // 10 seconds
/**
* Concurrency pool limit when finding other nodes.
*
* This is the parallel constant in the kademlia algorithm. It controls
* how many parallel connections when attempting to find a node across
* the network.
*/
nodesConnectionFindConcurrencyLimit: 3,
/**
* Timeout for idle node connections.
*
* A node connection is idle, if nothing is using the connection. A
* connection is being used when its resource counter is above 0.
*
* The resource counter of node connections is incremented above 0
* when a reference to the node connection is maintained, usually with
* the bracketing pattern.
*
* This has nothing to do with the data being sent or received on the
* connection. It's intended as a way of garbage collecting unused
* connections.
*
* This should always be greater than the keep alive timeout.
*/
nodesConnectionIdleTimeoutTime: 60_000, // 60 seconds
/**
* Timeout for establishing a node connection.
*
* This applies to both normal "forward" connections and "reverse"
* connections started by hole punching. Reverse connections
* is started by signalling requests that result in hole punching.
*
* This is the default for both client and server sides of the connection.
*
* Due to transport layer implementation peculiarities, this should never
* be greater than the keep alive timeout.
*/
nodesConnectionConnectTimeoutTime: 15_000, // 15 seconds
/**
* Timeout for the keep alive of the node connection.
*
* It is reset upon sending or receiving any data on the connection.
*
* This is the default for both sides (client and server) of the connection.
*
* This should always be greater than the connect timeout.
*/
nodesConnectionKeepAliveTimeoutTime: 30_000, // 30 seconds (3x of interval time)
/**
* Interval for the keep alive of the node connection.
*
* This is the minimum interval time because transport optimisations may
* increase the effective interval time when a keep alive message is not
* necessary, possibly due to other data being sent or received on the
* connection.
*/
nodesConnectionKeepAliveIntervalTime: 10_000, // 10 seconds
/**
* Interval for hole punching reverse node connections.
*/
nodesConnectionHolePunchIntervalTime: 1_000, // 1 second
},
/**
* Default user configuration.
* These are meant to be changed by the user.
* However the defaults here provide the average user experience.
*/
defaultsUser: {
nodePath: getDefaultNodePath(),
rootCertDuration: 31536000,
/**
* If using dual stack `::`, then this forces only IPv6 bindings.
*/
ipv6Only: false,
/**
* Agent host defaults to `::` dual stack.
* This is because the agent service is supposed to be public.
*/
agentServiceHost: '::',
agentServicePort: 0,
/**
* Client host defaults to `localhost`.
* This will depend on the OS configuration.
* Usually it will be IPv4 `127.0.0.1` or IPv6 `::1`.
* This is because the client service is private most of the time.
*/
clientServiceHost: 'localhost',
clientServicePort: 0,
}, |
…structure [ci skip]
…structure [ci skip]
[ci skip]
[ci skip]
[ci skip]
[ci skip]
[ci skip]
[ci skip]
[ci skip]
The agent level RPC is not fully managed by the Nodes domain. [ci skip]
[ci skip]
debf777
to
44609b9
Compare
Ok, so I've fixed up the history and pushed that up. #560 has been rebased on the new history as well. I'm going to do a final check and then merge this. |
[ci skip]
I'm deferring fixing tests to a new PR, we need a pre-release of this sooner than later. Off the top of my head, the following needs to be addressed still.
|
The 1. still implies code in Well I guess you could work on that independent of the WS and RPC. But #560 will need to do some testing to make sure that's working for Polykey-CLI. |
I updated anything in |
Was task 5. in the OP done too?
No link to an issue if there was even one. |
No issue was made, it was just changed directly in this PR. |
Ok |
Description
The configuration of Polykey has suffered from alot of entropy. This PR attempts to consolidate our configuration structure.
It's based on this comment: MatrixAI/Polykey-CLI#4 (comment)
A note about task 5. It's important to understand that in
PolykeyAgent
, we have a bunch of dependencies that can be injected, but their injection is really intended for mocking, which is only for testing, not intended for production usage. This is different from optional dependencies which has legitimate usecases when injecting or not injecting. If we were to follow task 5. strictly, this would add alot of boilerplate to managing optional injected dependencies for everything inPolykeyAgent
. And right now, there's no easy way to manage this without manually setting up booleans and conditional checks for all these optional dependencies. Therefore... task 5. should only be applied to situations where injecting or not-injecting is a legitimate usecase such as betweenQUICSocket
andQUICServer
andQUICClient
and betweenEncryptedFS
andDB
. But optional dependencies inPolykeyAgent
are the exception. IN FACT, because JS allows module mocking... we may even remove the ability to inject these dependencies and usejest.mock
instead.Issues Fixed
NodeID
has changed. #386Tasks
*base
toconfig.paths
. As these are path constants. They are not defaults to be configured.config.defaultsUser
config.defaultsSystem
config.defaultsUser
and have it be referenced everywhere there needs to be some spread of the parameters that will be defaulted byconfig.defaultsUser
. This prevents entropy where parameter names get changed and we forget that they need to be consistent.src/config.ts
so that way it is always consistent, even if it is a bit weird. It gives us one place to ensure compliance between parameters that should have the same name and same meaning.config
parameter, and do a deep mergePolykeyAgent.createPolykeyAgent
- create aPolykeyAgentConfig
typebootstrapState
- also needs to useBootstrapConfig
, but it's a far simpler typeNodeConnectionManager
configRPCServer
configWebsocket
configNodeConnectionManager
and convert to usingkeys/utils/symmetric
utilities rather than the HMAC SHA256.QUICSocket
lifecycle intoNodeConnectionManager
. All of the QUIC transport will be now encapsulated withinnodes
domain.[ ] 8. Move all of RPC and websocket into the- this is done in Consolidatingclient
domain, so that way both transports are encapsulated in their respective domain.ClientService
and js-ws and js-rpc integration #560nodes
domain ifclient
domain contains handlers. We end up having "client service" and "node service", respectively meaning serving clients and serving nodes.webcrypto
polyfill has the correctCrypto
type due to changes in Node's interface. Do this by usingas Crypto
.NodeConnectionManager
andNodeConnection
have both pull and push flows.NodeConnectionManager
an extension ofEventTarget
just likeNodeConnection
. This prepares in relation to Integrate Observables to coordinate push-based Dataflow (with reactive properties) #444 for subsequent observable integration.Final checklist
Domain specific testsFull tests