-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throw startup error if TeleportReadyEvent
is not emitted
#11725
Conversation
Before this commit, the `TeleportReadyEvent` was only waited for when a process reload occurred. Thus, if a bug exists in the code that emits this event (as it's currently the case since the `MetricsReady` and `WindowsDesktopReady` events are never emitted), such a bug may go unnoticed for a while. This commit ensures that the `TeleportReadyEvent` is always waited for on startup, and throws an error if the event is not emitted (after some timeout). This commit also: - removes the `MetricsReady` event (as this is not produced by a component that sends heartbeats, which is the case of every other event required by the `TeleportReadyEvent` event mapping) - ensures that `WindowsDesktopReady` event is emitted - refactors some of the code in `lib/service/supervisor.go`
72f4452
to
1c76b73
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to find objections to not waiting for the new process to emit TeleportReady before shutting down the old one but I couldn't find any
lib/service/service.go
Outdated
// Wait for the service to report that it has started. | ||
startTimeoutCtx, startCancel := context.WithTimeout(ctx, signalPipeTimeout) | ||
defer startCancel() | ||
eventC := make(chan Event, 1) | ||
srv.WaitForEvent(startTimeoutCtx, TeleportReadyEvent, eventC) | ||
select { | ||
case <-eventC: | ||
cfg.Log.Infof("Service has started successfully.") | ||
case <-startTimeoutCtx.Done(): | ||
warnOnErr(srv.Close(), cfg.Log) | ||
return trace.BadParameter("service has failed to start") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks that waitAndReload
logic already waits for TeleportReadyEvent so after exiting waitAndReload
the logic triggers second wait for TeleportReadyEvent
event but it the TeleportReadyEvent event was ready broadcasted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point. In a way this is ok since an already broadcast event will still be emitted if waited on, but there might be a better alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this PR remove that WaitForEvent
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think that a similar situation would already occur if two process reloads were triggered.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this PR remove that
WaitForEvent
?
You're right. Sorry @smallinsky @espadolini, I misunderstood what waitAndReload
was doing.
I pushed a new change (f73af18) to ensure that we always for the the TeleportReadyEvent
after calling LocalSupervisor.Start()
(both on the initial startup, and after a process reload), reverting the changes to waitAndReload
.
CI seems to be passing now. Could you please take another look @espadolini? |
…#11249) Fixes #11065. This commit: - ensures that `TeleportReadyEvent` is only produced when all components that send heartbeats (i.e. call [`process.onHeartbeat`](https://github.com/gravitational/teleport/blob/16bf416556f337b045b66dc9c3f5a3e16f8cc988/lib/service/service.go#L358-L366)) are ready - changes `TeleportProcess.registerTeleportReadyEvent` so that it returns a count of these components (let's call it `componentCount`) - uses `componentCount` to also ensure that `stateOK` is only reported when all the components have sent their heartbeat, thus fixing #11065 Since it seems difficult to know when `TeleportProcess.registerTeleportReadyEvent` should be updated, with the goal of quickly detecting a bug when it's introduced we have that: 1. if `componentCount` is lower than it should, then the service fails to start (due to #11725) 2. if `componentCount` is higher than it should, then an error is logged in function `processState.getStateLocked`.
* Throw startup error if `TeleportReadyEvent` is not emitted Before this commit, the `TeleportReadyEvent` was only waited for when a process reload occurred. Thus, if a bug exists in the code that emits this event (as it's currently the case since the `MetricsReady` and `WindowsDesktopReady` events are never emitted), such a bug may go unnoticed for a while. This commit ensures that the `TeleportReadyEvent` is always waited for on startup, and throws an error if the event is not emitted (after some timeout). This commit also: - removes the `MetricsReady` event (as this is not produced by a component that sends heartbeats, which is the case of every other event required by the `TeleportReadyEvent` event mapping) - ensures that `WindowsDesktopReady` event is emitted - refactors some of the code in `lib/service/supervisor.go` - moves the event mapping registration to a new `registerTeleportReadyEvent` function
…#11249) Fixes #11065. This commit: - ensures that `TeleportReadyEvent` is only produced when all components that send heartbeats (i.e. call [`process.onHeartbeat`](https://github.com/gravitational/teleport/blob/16bf416556f337b045b66dc9c3f5a3e16f8cc988/lib/service/service.go#L358-L366)) are ready - changes `TeleportProcess.registerTeleportReadyEvent` so that it returns a count of these components (let's call it `componentCount`) - uses `componentCount` to also ensure that `stateOK` is only reported when all the components have sent their heartbeat, thus fixing #11065 Since it seems difficult to know when `TeleportProcess.registerTeleportReadyEvent` should be updated, with the goal of quickly detecting a bug when it's introduced we have that: 1. if `componentCount` is lower than it should, then the service fails to start (due to #11725) 2. if `componentCount` is higher than it should, then an error is logged in function `processState.getStateLocked`.
* Throw startup error if `TeleportReadyEvent` is not emitted Before this commit, the `TeleportReadyEvent` was only waited for when a process reload occurred. Thus, if a bug exists in the code that emits this event (as it's currently the case since the `MetricsReady` and `WindowsDesktopReady` events are never emitted), such a bug may go unnoticed for a while. This commit ensures that the `TeleportReadyEvent` is always waited for on startup, and throws an error if the event is not emitted (after some timeout). This commit also: - removes the `MetricsReady` event (as this is not produced by a component that sends heartbeats, which is the case of every other event required by the `TeleportReadyEvent` event mapping) - ensures that `WindowsDesktopReady` event is emitted - refactors some of the code in `lib/service/supervisor.go` - moves the event mapping registration to a new `registerTeleportReadyEvent` function
…#11249) Fixes #11065. This commit: - ensures that `TeleportReadyEvent` is only produced when all components that send heartbeats (i.e. call [`process.onHeartbeat`](https://github.com/gravitational/teleport/blob/16bf416556f337b045b66dc9c3f5a3e16f8cc988/lib/service/service.go#L358-L366)) are ready - changes `TeleportProcess.registerTeleportReadyEvent` so that it returns a count of these components (let's call it `componentCount`) - uses `componentCount` to also ensure that `stateOK` is only reported when all the components have sent their heartbeat, thus fixing #11065 Since it seems difficult to know when `TeleportProcess.registerTeleportReadyEvent` should be updated, with the goal of quickly detecting a bug when it's introduced we have that: 1. if `componentCount` is lower than it should, then the service fails to start (due to #11725) 2. if `componentCount` is higher than it should, then an error is logged in function `processState.getStateLocked`.
* Throw startup error if `TeleportReadyEvent` is not emitted (#11725) * Throw startup error if `TeleportReadyEvent` is not emitted Before this commit, the `TeleportReadyEvent` was only waited for when a process reload occurred. Thus, if a bug exists in the code that emits this event (as it's currently the case since the `MetricsReady` and `WindowsDesktopReady` events are never emitted), such a bug may go unnoticed for a while. This commit ensures that the `TeleportReadyEvent` is always waited for on startup, and throws an error if the event is not emitted (after some timeout). This commit also: - removes the `MetricsReady` event (as this is not produced by a component that sends heartbeats, which is the case of every other event required by the `TeleportReadyEvent` event mapping) - ensures that `WindowsDesktopReady` event is emitted - refactors some of the code in `lib/service/supervisor.go` - moves the event mapping registration to a new `registerTeleportReadyEvent` function * Ensure stateOK is reported only when all components have sent updates (#11249) Fixes #11065. This commit: - ensures that `TeleportReadyEvent` is only produced when all components that send heartbeats (i.e. call [`process.onHeartbeat`](https://github.com/gravitational/teleport/blob/16bf416556f337b045b66dc9c3f5a3e16f8cc988/lib/service/service.go#L358-L366)) are ready - changes `TeleportProcess.registerTeleportReadyEvent` so that it returns a count of these components (let's call it `componentCount`) - uses `componentCount` to also ensure that `stateOK` is only reported when all the components have sent their heartbeat, thus fixing #11065 Since it seems difficult to know when `TeleportProcess.registerTeleportReadyEvent` should be updated, with the goal of quickly detecting a bug when it's introduced we have that: 1. if `componentCount` is lower than it should, then the service fails to start (due to #11725) 2. if `componentCount` is higher than it should, then an error is logged in function `processState.getStateLocked`. * Make `PortList.Pop()` thread-safe (#11799)
* Throw startup error if `TeleportReadyEvent` is not emitted (#11725) * Throw startup error if `TeleportReadyEvent` is not emitted Before this commit, the `TeleportReadyEvent` was only waited for when a process reload occurred. Thus, if a bug exists in the code that emits this event (as it's currently the case since the `MetricsReady` and `WindowsDesktopReady` events are never emitted), such a bug may go unnoticed for a while. This commit ensures that the `TeleportReadyEvent` is always waited for on startup, and throws an error if the event is not emitted (after some timeout). This commit also: - removes the `MetricsReady` event (as this is not produced by a component that sends heartbeats, which is the case of every other event required by the `TeleportReadyEvent` event mapping) - ensures that `WindowsDesktopReady` event is emitted - refactors some of the code in `lib/service/supervisor.go` - moves the event mapping registration to a new `registerTeleportReadyEvent` function * Ensure stateOK is reported only when all components have sent updates (#11249) Fixes #11065. This commit: - ensures that `TeleportReadyEvent` is only produced when all components that send heartbeats (i.e. call [`process.onHeartbeat`](https://github.com/gravitational/teleport/blob/16bf416556f337b045b66dc9c3f5a3e16f8cc988/lib/service/service.go#L358-L366)) are ready - changes `TeleportProcess.registerTeleportReadyEvent` so that it returns a count of these components (let's call it `componentCount`) - uses `componentCount` to also ensure that `stateOK` is only reported when all the components have sent their heartbeat, thus fixing #11065 Since it seems difficult to know when `TeleportProcess.registerTeleportReadyEvent` should be updated, with the goal of quickly detecting a bug when it's introduced we have that: 1. if `componentCount` is lower than it should, then the service fails to start (due to #11725) 2. if `componentCount` is higher than it should, then an error is logged in function `processState.getStateLocked`. * Make `PortList.Pop()` thread-safe (#11799)
* Revert "Make `PortList.Pop()` thread-safe (#11799)" This reverts commit a17337d. * Revert "Ensure stateOK is reported only when all components have sent updates (#11249)" This reverts commit b749302. * Revert "Throw startup error if `TeleportReadyEvent` is not emitted (#11725)" This reverts commit 933e247. * Revert "Fix ProxyKube not reporting its readiness (#12150)" This reverts commit 6cdcfe7.
Fixes #11724.
Before this PR, the
TeleportReadyEvent
was only waited for when a process reload occurred. Thus, if a bug exists in the code that emits this event (as it's currently the case since theMetricsReady
andWindowsDesktopReady
events are never emitted - #11724), such a bug may go unnoticed for a while.This PR ensures that the
TeleportReadyEvent
is always waited for on startup, and throws an error if the event is not emitted (after some timeout).This PR also:
MetricsReady
event (as this is not produced by a component that sends heartbeats, which is the case of every other event required by theTeleportReadyEvent
event mapping)WindowsDesktopReady
event is emittedlib/service/supervisor.go
registerTeleportReadyEvent
functionTesting
None.