-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[test-failed]: Chrome UI Functional Tests1.test/functional/apps/status_page/index·ts - status page should display the server status #92299
Comments
Pinging @elastic/kibana-core (Team:Core) |
I am not sure what is causing the instance to go to red, until we know more I am going to label as a bug so we can investigate it with higher priority. |
Also to note, before the tests start the status is green.
|
|
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
cc: Alerting team - would alerting framework being unavailable cause this ? How would we debug ? |
You can do some amount of debugging by looking at the api status, via:
I just tried that with a new-ish 7.11.0 deployment, and can see there's a 503 error for the alert health code trying to access the |
Should mention that it makes that check against the task manager index every 5 minutes, so I'd expect if the 503 is transient, it will only affect a 5 minute window of time - of course that's still awful, but it would be interesting to know if this status ends up "clearing" over time. |
Thanks @pmuellr in my case it did not appear to clear. |
Just want to add that my cluster is on staging, which has seen higher amounts of service unavailable. |
How does this impact how alerting functions ? |
As near as I can tell, it doesn't. In my deployment, while seeing the red status on alerting, I can run an existing alert and see it firing. Meaning all sorts of things are in working order - alerting, task manager, saved objects, etc. Honestly not sure that we USE the health status to gate any function - it appears we don't for normal alert operations - we're just updating the core status every couple of minutes, and somewhere along the way, if an error occurs, it will get stuck on that error. It's also unfortunate in that status blob we don't have any dates associated with the individual plugin status - I bet we'd see the date was some time ago. We'd also want to add dates to the blob that the alerting framework provides to the status, since there's the date we query task manager, and also the date the task manager document was last updated, all of which would be good to have for diagnostic purposes. |
I have a case where the status is going red on kibana but I did not receive any 503 service unavailable errors. |
Just noting it looks like it is expected to fail on any http error code and does not clear. |
Hi @ymao1 thanks for cloud we can't test the PR, however I am seeing this test fail on local testing also, maybe if we can reproduce we can test it that way. It is not service available causing the issue in what I am seeing. |
Okay so I am seeing this occur in two different ways. 1) Is through cloud, a system running for awhile hits service unavailable and the status gets stuck in red. 2) Through our function tests locally running the test @ymao1 |
@elastic/kibana-core and @elastic/kibana-security teams when I run this test |
Server Log
|
https://localhost:5601/api/status?v8format=true Shows not found
|
It looks like the alerting framework health check is failing with The alerting framework creates a task manager task to check the alert status. Then it periodically performs a I recently fixed another flaky functional test that was caused by trying to access a saved object mid migration when starting up the test and this feels very similar to me. |
After messaging with @ymao1 looks like we have two different issues then, one is just a test issue limitation with FTR. For the cloud one, we won't be able to test it until the PR is merged and is contained in next cloud snapshot. |
@kobelb gave me this command:
Which does cause the status to go red and throws a not found error, is that what the functional tests are causing to happen? If so, is it a test issue and how do we fix? cc @elastic/kibana-core |
No, the test is just accessing the status page and asserting that the status is From your logs in #92299 (comment), it seems the failure was caused by a migration issue.
How often is the test failing exactly? |
Thanks @pgayvallet for taking a look, it is failing consistently since 7.12 and it was not failing before. |
This PR addressed the issue of the alerting status health check never recovering from a transient failure, which was brought to light by this issue. Although that PR does not resolve the overall issue, I believe it does resolve the alerting portion of it. |
Version: 7.12.1
Other test failures: Test Report: https://internal-ci.elastic.co/view/Stack%20Tests/job/elastic+estf-cloud-kibana-tests/1653/testReport/ |
Version: 7.13.0
Other test failures: Test Report: https://internal-ci.elastic.co/view/Stack%20Tests/job/elastic+estf-cloud-kibana-tests/1685/testReport/ |
This might be similar to the issue I described here: #94968 (comment) |
Given this update by @ymao1 I'm removing this bug from the Alerting feature/team. |
Version: 7.12.0
Class: Chrome UI Functional Tests1.test/functional/apps/status_page/index·ts
Stack Trace:
Other test failures:
Test Report: https://internal-ci.elastic.co/view/Stack%20Tests/job/elastic+estf-cloud-kibana-tests/1375/testReport/
The text was updated successfully, but these errors were encountered: