Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stateOK reported in processState.getState() before all components have sent updates #11065

Open
vitorenesduarte opened this issue Mar 11, 2022 · 2 comments · Fixed by #11249
Open
Labels

Comments

@vitorenesduarte
Copy link
Contributor

Description

I think that processState.getState() can return stateOK before it should.

In the following piece of code it seems that stateOK can be returned as soon as a single component reports their status as ok, not all of them:

// Only return stateOK if *all* components are in stateOK.
if numNotOK == 0 && len(f.states) > 0 {
state = stateOK

Replacing len(f.states) > 0 with something like len(f.states) == componentsCount should fix it.

This was initially noticed in https://github.com/gravitational/cloud/issues/1432.

@vitorenesduarte
Copy link
Contributor Author

Started working on a fix in 35806dd but couldn't quickly figure out how to get the number of components given a TeleportProcess.

@vitorenesduarte vitorenesduarte self-assigned this Mar 18, 2022
vitorenesduarte pushed a commit that referenced this issue Apr 7, 2022
…#11249)

Fixes #11065.

This commit:
- ensures  that `TeleportReadyEvent` is only produced when all components that send heartbeats (i.e. call [`process.onHeartbeat`](https://github.com/gravitational/teleport/blob/16bf416556f337b045b66dc9c3f5a3e16f8cc988/lib/service/service.go#L358-L366)) are ready
- changes `TeleportProcess.registerTeleportReadyEvent` so that it returns a count of these components (let's call it `componentCount`)
- uses `componentCount` to also ensure that `stateOK` is only reported when all the components have sent their heartbeat, thus fixing #11065

Since it seems difficult to know when `TeleportProcess.registerTeleportReadyEvent` should be updated, with the goal of quickly detecting a bug when it's introduced we have that:
1. if `componentCount` is lower than it should, then the service fails to start (due to #11725)
2. if `componentCount` is higher than it should, then an error is logged in function `processState.getStateLocked`.
vitorenesduarte pushed a commit that referenced this issue Apr 7, 2022
…#11249)

Fixes #11065.

This commit:
- ensures  that `TeleportReadyEvent` is only produced when all components that send heartbeats (i.e. call [`process.onHeartbeat`](https://github.com/gravitational/teleport/blob/16bf416556f337b045b66dc9c3f5a3e16f8cc988/lib/service/service.go#L358-L366)) are ready
- changes `TeleportProcess.registerTeleportReadyEvent` so that it returns a count of these components (let's call it `componentCount`)
- uses `componentCount` to also ensure that `stateOK` is only reported when all the components have sent their heartbeat, thus fixing #11065

Since it seems difficult to know when `TeleportProcess.registerTeleportReadyEvent` should be updated, with the goal of quickly detecting a bug when it's introduced we have that:
1. if `componentCount` is lower than it should, then the service fails to start (due to #11725)
2. if `componentCount` is higher than it should, then an error is logged in function `processState.getStateLocked`.
vitorenesduarte pushed a commit that referenced this issue Apr 7, 2022
…#11249)

Fixes #11065.

This commit:
- ensures  that `TeleportReadyEvent` is only produced when all components that send heartbeats (i.e. call [`process.onHeartbeat`](https://github.com/gravitational/teleport/blob/16bf416556f337b045b66dc9c3f5a3e16f8cc988/lib/service/service.go#L358-L366)) are ready
- changes `TeleportProcess.registerTeleportReadyEvent` so that it returns a count of these components (let's call it `componentCount`)
- uses `componentCount` to also ensure that `stateOK` is only reported when all the components have sent their heartbeat, thus fixing #11065

Since it seems difficult to know when `TeleportProcess.registerTeleportReadyEvent` should be updated, with the goal of quickly detecting a bug when it's introduced we have that:
1. if `componentCount` is lower than it should, then the service fails to start (due to #11725)
2. if `componentCount` is higher than it should, then an error is logged in function `processState.getStateLocked`.
vitorenesduarte pushed a commit that referenced this issue Apr 8, 2022
* Throw startup error if `TeleportReadyEvent` is not emitted (#11725)

* Throw startup error if `TeleportReadyEvent` is not emitted

Before this commit, the `TeleportReadyEvent` was only waited for when a
process reload occurred. Thus, if a bug exists in the code that emits
this event (as it's currently the case since the `MetricsReady` and
`WindowsDesktopReady` events are never emitted), such a bug may go
unnoticed for a while.

This commit ensures that the `TeleportReadyEvent` is always waited for
on startup, and throws an error if the event is not emitted (after some
timeout).

This commit also:
- removes the `MetricsReady` event (as this is not produced by a
  component that sends heartbeats, which is the case of every other
  event required by the `TeleportReadyEvent` event mapping)
- ensures that `WindowsDesktopReady` event is emitted
- refactors some of the code in `lib/service/supervisor.go`
- moves the event mapping registration to a new `registerTeleportReadyEvent` function

* Ensure stateOK is reported only when all components have sent updates (#11249)

Fixes #11065.

This commit:
- ensures  that `TeleportReadyEvent` is only produced when all components that send heartbeats (i.e. call [`process.onHeartbeat`](https://github.com/gravitational/teleport/blob/16bf416556f337b045b66dc9c3f5a3e16f8cc988/lib/service/service.go#L358-L366)) are ready
- changes `TeleportProcess.registerTeleportReadyEvent` so that it returns a count of these components (let's call it `componentCount`)
- uses `componentCount` to also ensure that `stateOK` is only reported when all the components have sent their heartbeat, thus fixing #11065

Since it seems difficult to know when `TeleportProcess.registerTeleportReadyEvent` should be updated, with the goal of quickly detecting a bug when it's introduced we have that:
1. if `componentCount` is lower than it should, then the service fails to start (due to #11725)
2. if `componentCount` is higher than it should, then an error is logged in function `processState.getStateLocked`.

* Make `PortList.Pop()` thread-safe (#11799)
vitorenesduarte pushed a commit that referenced this issue Apr 8, 2022
* Throw startup error if `TeleportReadyEvent` is not emitted (#11725)

* Throw startup error if `TeleportReadyEvent` is not emitted

Before this commit, the `TeleportReadyEvent` was only waited for when a
process reload occurred. Thus, if a bug exists in the code that emits
this event (as it's currently the case since the `MetricsReady` and
`WindowsDesktopReady` events are never emitted), such a bug may go
unnoticed for a while.

This commit ensures that the `TeleportReadyEvent` is always waited for
on startup, and throws an error if the event is not emitted (after some
timeout).

This commit also:
- removes the `MetricsReady` event (as this is not produced by a
  component that sends heartbeats, which is the case of every other
  event required by the `TeleportReadyEvent` event mapping)
- ensures that `WindowsDesktopReady` event is emitted
- refactors some of the code in `lib/service/supervisor.go`
- moves the event mapping registration to a new `registerTeleportReadyEvent` function

* Ensure stateOK is reported only when all components have sent updates (#11249)

Fixes #11065.

This commit:
- ensures  that `TeleportReadyEvent` is only produced when all components that send heartbeats (i.e. call [`process.onHeartbeat`](https://github.com/gravitational/teleport/blob/16bf416556f337b045b66dc9c3f5a3e16f8cc988/lib/service/service.go#L358-L366)) are ready
- changes `TeleportProcess.registerTeleportReadyEvent` so that it returns a count of these components (let's call it `componentCount`)
- uses `componentCount` to also ensure that `stateOK` is only reported when all the components have sent their heartbeat, thus fixing #11065

Since it seems difficult to know when `TeleportProcess.registerTeleportReadyEvent` should be updated, with the goal of quickly detecting a bug when it's introduced we have that:
1. if `componentCount` is lower than it should, then the service fails to start (due to #11725)
2. if `componentCount` is higher than it should, then an error is logged in function `processState.getStateLocked`.

* Make `PortList.Pop()` thread-safe (#11799)
@vitorenesduarte
Copy link
Contributor Author

Reopening as the PRs fixing this have been reverted (#12244) due to #12139 and then #12202.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants