-
Notifications
You must be signed in to change notification settings - Fork 657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: channelz improvements, idle timeout implementation #2677
feat: channelz improvements, idle timeout implementation #2677
Conversation
I just submitted a change on the 1.10.x branch to run the code formatter and fix the one lint error. Can you rebase onto that so that we can focus on the substantive changes in this PR? |
b2da92a
to
e0b900d
Compare
done |
Also, can you rearrange the modified functions in |
cleaned up server.ts more - reverted closure (originally needed server ref, but was able to make it work without it), fewer loc to go through now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a way of implementing the session idle timeout that is both simpler and more efficient than what you have here: for each session, track only the number of active streams, and call session.setTimeout
with a handler that closes the session if the number of active streams is 0.
Alongside that change found a bug that was introduced in 1.10.x where sessions would be cleaned up based on the "allSessions" ref and not a specific server ref (now that there could be multiple servers it matters)
I don't believe that was a bug. The relevant change I see was in tryShutdown
, which is supposed to be shutting down all sessions anyway. Also, there could always be multiple servers, but the internal representation was different in earlier versions.
packages/grpc-js/src/channelz.ts
Outdated
@@ -598,27 +585,34 @@ function GetTopChannels( | |||
call: ServerUnaryCall<GetTopChannelsRequest__Output, GetTopChannelsResponse>, | |||
callback: sendUnaryData<GetTopChannelsResponse> | |||
): void { | |||
const maxResults = Number.parseInt(call.request.max_results); | |||
const maxResults = parseInt(call.request.max_results, 10) || 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this 100 come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protofiles say there is a sane default that will be selected by the server if max_results is not provided by the request - 100
is my choice for the sane default. problem with 0 is that there will be no results returned with the current implementation. it could be a much higher number, but I think a 100 for an iteration is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially because this is used in multiple places, it would be better to have a constant named something like DEFAULT_MAX_REQUESTS
. Then it's clearer what the number is, and if we decide that a different value is more appropriate in the future we can change it in only one place.
packages/grpc-js/src/server.ts
Outdated
sessionClosedByServer = true; | ||
|
||
this.trace( | ||
`Connection dropped by max connection age: ${session.socket?.remoteAddress}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing trace logs use string concatenation instead of string interpolation because of a previous report that it is faster. Please follow that convention here and in other instances.
packages/grpc-js/src/server.ts
Outdated
sessionClosedByServer = true; | ||
this.channelzTrace.addTrace( | ||
'CT_INFO', | ||
`Connection dropped due to error of a ping frame ${err.message} return in ${duration}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the regular trace logs, I think we should avoid string interpolation in channelz trace logs, especially now that the string will always be created even if channelz is disabled.
I believe session.setTimeout would not be very useful as it treats activity differently from having no active in-flight requests. keepalive pings would keep it opened forever, whereas timeout on no active requests would force those sessions to be closed. There should be very little performance drop in cases where active session has many ongoing requests and rarely reaches 0 tracked streams simply because setTimeout would almost never be called. Where this comes handy is when there is no edge proxy that actively forces all streams into few sessions. recently I've started using https://cloud.google.com/load-balancing/docs/https with |
The performance concern is the case where the number of active streams quickly switches between 0 and 1. I understand why
There is another problem here, independent of gRPC. A client/proxy should not be creating many connections to the same backend, and keeping them around without using them. There may be a configuration error or bug in your LB. |
HTTP/2 between the load balancer and the instance can require significantly more TCP connections to the instance than HTTP(S). Connection pooling, an optimization that reduces the number of these connections with HTTP(S), is not currently available with HTTP/2. As a result, you might see high backend latencies because backend connections are made more frequently. Just a drawback of current implementation for http/2 that google has ^ Could be solved with another edge proxy in-between, but trying to keep it without a managed proxy in-between
Take a look at the latest version of the implementation. I think it might work even better with .refresh() on the timer, lmk |
I think the new timer strategy can be further simplified: set the timeout to the full max connection idle time, and then the handler logic can just be "if there are 0 active streams, close the connection". Refreshing whenever the number of active connections becomes 0 guarantees that the timer only runs with 0 active connections if the max connection idle time has passed. |
you are right, fixed a bug as well - forgot to actually update idle date, as well as wrong this ref |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the channelz class stub changes, I wonder if the split between channelz and non-channelz versions of the stream and session handlers is still worth it. That split is an optimization, to avoid doing any channelz work when it's disabled, but it results in duplicated code, and even more duplicated code in this change.
packages/grpc-js/src/server.ts
Outdated
@@ -972,7 +1006,7 @@ export class Server { | |||
for (const session of allSessions) { | |||
session.destroy(http2.constants.NGHTTP2_CANCEL as any); | |||
} | |||
}, graceTimeMs).unref?.(); | |||
}, graceTimeMs).unref(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this, here and anywhere else you made this change. We need a null check on unref
because that method doesn't exist on Electron.
packages/grpc-js/src/server.ts
Outdated
this.sessionHalfIdleTimeout, | ||
this, | ||
session | ||
).unref(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, change this to unref?.()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that makes sense, though on electron it would result in a timer ref being undefined, so need to unref afterwards :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. That behavior might be incorrect elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty much every timer wouldnt work correctly as of now. is the server being actually used on electron? its no problem for me to fix the related bugs everywhere, but if we can avoid extra logic - would be happy :)
packages/grpc-js/src/server.ts
Outdated
|
||
private onIdleTimeout(ctx: Server, session: http2.ServerHttp2Session) { | ||
const { socket } = session; | ||
ctx.trace( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As currently written, this trace log seems potentially noisy, without a lot of information relevant to debugging a problem with the idle timeout behavior. I would remove this log, or I would want to at least include the number of active streams, and either the lastIdle
timestamp or the computed idleFor
time.
packages/grpc-js/src/server.ts
Outdated
streamTracker: new ChannelzCallTracker(), | ||
streamTracker: this.channelzEnabled | ||
? new ChannelzCallTracker() | ||
: new ChannelzCallTrackerStub(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are in the _channelzSessionHandler
, that implies that this.channelzEnabled === true
. Branching on that again here is redundant.
Tried to address all existing concerns with code duplication, as well as kept perf optimizations in mind:
LMK if the changes make sense and if you spot more bugs or logical issues |
I don't like this |
It's true, but how do we "dry" the codebase and keep it performant? this change sort of addresses both issues, but I understand the expect error is not ideal. its fairly easy to "fix" it by changing visibility level on the property. Larger issue is whether we are ok with duplicated code & perf hit for channelz (I'd prefer not to have any perf hit due to that) or not. We could do binds, or create closures each time, like we currently do, but it adds up to GC pressure and code paths can't be optimized properly, ideally we want all functions to be monomoprhic and inlined and no GC where it's not needed If we don't want to do anything about code duplication in this PR - I can revert the latest change, and keep |
To be honest, I had forgotten that you were the one who originally split up Please do remove the |
I don't particularly like *Ref objects that are used for tracking channelz stuff - to fully converge these paths need to be also covered and refs shouldn't be created when channelz is not enabled. I'll think more on how to converge both listeners into a single one with little perf hit Will only keep relevant changes to timers in this PR and make a new PR re: code convergence |
At this point, channelz refs are used as general purpose process-unique identifiers for various kinds of objects. When channelz is disabled, they're very lightweight, so I don't see that it's important to remove them. |
8f49ea8
to
62e8ea9
Compare
Reverted sessionCtx change, kept cleaned up `unref?.(), added tests One thing I noticed is that there is no way for client to stop reconnecting except for .close() to be called on it. Is this intended? Shouldn't it lazy connect on first outgoing call after the disconnect instead of connecting channel right away? |
Actually, you're right, that is what it's supposed to do. It looks like that behavior got screwed up a little while ago. |
packages/grpc-js/src/channelz.ts
Outdated
|
||
if (trackedChild === undefined) { | ||
tracker.setElement(child.id, { | ||
// @ts-expect-error union issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like having this. What exactly is the error here? Are there any changes that could be made to the function or the data type to avoid that error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just typescript being silly and not generalizing the type good enough, I'll see if I can work around that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjusted this a little, but its give or take the same, just not an expect error, so it will be more noticeable if there is something other than the typing bug
packages/grpc-js/src/channelz.ts
Outdated
@@ -598,27 +585,34 @@ function GetTopChannels( | |||
call: ServerUnaryCall<GetTopChannelsRequest__Output, GetTopChannelsResponse>, | |||
callback: sendUnaryData<GetTopChannelsResponse> | |||
): void { | |||
const maxResults = Number.parseInt(call.request.max_results); | |||
const maxResults = parseInt(call.request.max_results, 10) || 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially because this is used in multiple places, it would be better to have a constant named something like DEFAULT_MAX_REQUESTS
. Then it's clearer what the number is, and if we decide that a different value is more appropriate in the future we can change it in only one place.
packages/grpc-js/src/server.ts
Outdated
@@ -334,65 +370,61 @@ export class Server { | |||
|
|||
private getChannelzSessionInfoGetter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this modified signature, this method is more accurately named getChannelzSessionInfo
.
packages/grpc-js/src/channelz.ts
Outdated
const id = getNextId(); | ||
const ref = { id, name, kind } as RefByType<R>; | ||
if (channelzEnabled) { | ||
// @ts-expect-error typing issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another instance that I didn't notice before. Please change this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll fix it up similarly.
Noticed tests failed, should I increase the deadline for the state change? Or maybe we can look at traces for this failure? |
First, I'm just rerunning them, to check if that failure is consistent. You could try just increasing the timeout, to see if that works. Otherwise, you could add a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is in a reasonable state now. I would like to have less flakiness in the new tests, but I don't think that needs to be a blocker.
400147c
into
grpc:@grpc/[email protected]
This is out in 1.10.2, along with a fix for the "client reconnects even when there are no active requests" behavior. |
Hopefully you'd find the following code useful, namely it works on 2 areas:
channelz perf improvements
In the original implementation arrays are used to track active servers, channels, subchannels and sockets. while it generally performs ok with the first 3, using deletes on sockets arrays creates holey arrays, and iterating over them is extremely slow when, lets say, you perform a few thousand queries across many sessions and you have left over sessions with ids that have intervals between them - you have to go over each "undefined" item in the array and check it. time complexity grows with the array length. to deal with that I've used ordered map based on rb trees, which allows us to iterate over many sockets with no problems, as well as perform deletes fairly fast
furthermore I've added "stubbed" tracker implementations, which doesnt currently cover all the calls, but covers mostly everything - there could be more improvements done to non-channelz paths. with the stubbed calls, which do nothing we can have less branching in shared areas between channelz/non-channelz code.
added channelz/non-channelz session handler
idle session termination
More efficient than sending keepalive pings, creates a timer when session has no active streams and destroys it when a stream is created
Sessions are now tracked via their references
Alongside that change found a bug that was introduced in 1.10.x where sessions would be cleaned up based on the "allSessions" ref and not a specific server ref (now that there could be multiple servers it matters)
TODO:
dependency updates
prettier changes
to get the tests going had to "fix" linting issues from pretties
Let me know your thoughts on proposed changes