Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support health status interface #239

Closed
1 of 3 tasks
eyalkraft opened this issue Jun 23, 2022 · 12 comments · Fixed by #583
Closed
1 of 3 tasks

Support health status interface #239

eyalkraft opened this issue Jun 23, 2022 · 12 comments · Fixed by #583
Assignees
Labels
8.7 Candidate Team:Cloud Security Cloud Security team related verified label for fixed and retested issues

Comments

@eyalkraft
Copy link
Contributor

eyalkraft commented Jun 23, 2022

Motivation

Once libbeat is migrated to v2 cloudbeat should support the new health status interface to report detailed health information to the user. Including requiring the user to upgrade.
This information will surfaced in the UI. which is being tracked in .

Definition of done

  • If there's an error, cloudbeat should report it an a way viewable from Kibana
  • If there's a mismatch between the integration and the cloudbeat version (which could mean missing rules/fetchers on cloudbeats side) cloudbeat should indicate the user he needs to upgrade his agent.
  • make sure we understand what is the behavior once the status is not healthy - does the agent restarts cloudbeat?

Out of scope

What is not included in this task

Related tasks/epics

@yashtewari
Copy link
Contributor

yashtewari commented Nov 22, 2022

BeatV2Manager with the new status/health update interfaces is not yet as mature as we'd expected. It's not being used by the "core" beats yet, as described by this issue

So it would be risky to touch it for the v8.6.0 release. I'll add more details about this in another comment.

For now, I've setup the required information/logic for version compatibility, and also started using V1 of Manager.UpdateStatus so that it may be easier to update to V2 going forward.

The current behavior is the same as any other error in cloudbeat: the error about incompatible versions is logged and cloudbeat is restarted by the Agent. But this is only tentative and we should make a decision on it.

@yashtewari
Copy link
Contributor

yashtewari commented Nov 23, 2022

I was able to send a Degraded signal using the V1 interface and make the Agent status on Kibana change to "Unhealthy".

Screenshot 2022-11-23 at 5 04 23 PM

We have two options now:

1. Fail fast: cloudbeat aborts when it is running an incompatible version

Pro: It makes doing upgrades to compatible versions "compulsory" in a way, so the user will immediately know that this action needs to be taken.
Con: If we change direction in the future on how we want to deal with version compatibility, we will have to still keep accounting for the current versions that abort on having a incompatible version.

2. Mark cloudbeat as "degraded" when running incompatible versions but keep going

Pro: Doesn't disrupt flow and is more "forward compatible".
Con: Currently the status change to "degraded" doesn't reveal too much information on Kibana to the user. It's not clear from the UI which integration specifically makes the Agent "unhealthy", and so it's not easy to look at the right logs, find the problem, and rectify it.

@eyalkraft
Copy link
Contributor Author

2 seems closer to what we ant to end up with when the v2 health status reports work as expected

@oren-zohar
Copy link
Collaborator

@eyalkraft @yashtewari agreed, let's go with the second option

@yashtewari
Copy link
Contributor

Sounds good, PRs have been updated.

@olegsu
Copy link
Contributor

olegsu commented Dec 12, 2022

I also think the second approach is better but it might confuse the users.
For example (theoretically):

  1. Running 8.5 stacks (kibana and cloudbeat) and upgrading the stack to 8.6.
  2. As a result, the integration will also be bumped due to auto-grade policy.
  3. The new integration specifies cloudbeat: ">= 8.6.0 <= 8.7.0".
  4. cloudbeat reports a degraded status when the only thing that was changed was the stack (from users point of view). The only way to "fix" this is to upgrade all the agents in that policy, which will take time and work.

@eyalkraft
Copy link
Contributor Author

eyalkraft commented Dec 13, 2022

@olegsu In case the specific benchmark wasn't effected, the integration shouldn't require newer version of cloudbeat.

I'm not sure where in the integration you decided to include

  1. The new integration specifies cloudbeat: ">= 8.6.0 <= 8.7.0".

But ideally it belongs to a policy template and then for KSPM (which didn't change) you can specify
cloudbeat: ">= 8.5.0 <= 8.7.0"
And for CSPM (not these exact versions but you get the idea):
cloudbeat: ">= 8.6.0 <= 8.7.0"

And maybe not even include the <= 8.7.0

@olegsu
Copy link
Contributor

olegsu commented Dec 13, 2022

I'm not sure where in the integration you decided to include

I was referring the template, where the version will be added.

@olegsu In case the specific benchmark wasn't effected, the integration shouldn't require newer version of cloudbeat.

You said it even better, when the benchmark was updated. I think the issue is that it will be updated regardless users knowledge. The only thing that was done from his point of view is to upgrade the Elastic stack, and for some reason now KSPM integration is degraded.

@eyalkraft
Copy link
Contributor Author

@olegsu but specific policy templates (KSPM) including their cloudbeat version requirements can remain intact, even if we upgrade the integration package as a whole

@olegsu
Copy link
Contributor

olegsu commented Dec 13, 2022

Update after sync with @eyalkraft
Instead of check versions compitability, cloudbeat will check that all the benchmark types that are regocnized (part of hard-coded list).
When unknown type will be added, cloudbeat will report degraded status.
This will answer on the case above.

@olegsu olegsu linked a pull request Dec 18, 2022 that will close this issue
2 tasks
@olegsu
Copy link
Contributor

olegsu commented Feb 14, 2023

This is not trivial to test on BC as it either requires changing the integration/cloudbeat to custom version.

If there's an error, cloudbeat should report it in a way viewable from Kibana

This was not addressed in a general manner, we report errors on re-configuration issues only.

We have two more categories of issues that we might want to address

  1. Initialization issues: such as non-valid credentials. Cloudbeat currently crashes with a message in the log.
  2. Runtime issues: with the same example, the credentials might get invalid/expired after some period of time. I am not sure what will be the outcome of this.

Update
I compiled custom cloudbeat from 8.7 branch with minor change to report every 30 seconds degraded status

Outcome

In Fleet UI the status of the agent is degraded (nothing points to cloudbeat)
image

Cloudbeat does not restart after almost 10 min of the test period
image

elastic-agent status --output json shows that cloudbeat in degradad status with the messages that was reported.
image

elastic-agent status shows that one component is in degradad status but every running beat is healthy (this might be a bug @cmacknz)
image

@cmacknz
Copy link
Member

cmacknz commented Feb 14, 2023

elastic-agent status shows that one component is in degradad status but every running beat is healthy (this might be a bug @cmacknz)

Run elastic-agent status --output=json to see the entire status report. We hide this by default but we shouldn't elastic/elastic-agent#2107.

@olegsu olegsu added the verified label for fixed and retested issues label Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.7 Candidate Team:Cloud Security Cloud Security team related verified label for fixed and retested issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants