-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Agent gets stuck in the updating state if the upgrade action fails #2508
Comments
@manishgupta-qasource Please review. |
Secondary review for this ticket is Done |
The upgrade on the agent side starts and immediately fails with: {"log.level":"info","@timestamp":"2023-02-23T19:12:04.841Z","log.origin":{"file.name":"upgrade/upgrade.go","file.line":116},"message":"Upgrading agent","log":{"source":"elastic-agent"},"version":"8.7.0","source_uri":"","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-02-23T19:12:04.849Z","log.logger":"transport","log.origin":{"file.name":"transport/tcp.go","file.line":52},"message":"DNS lookup failure \"test-artifacts.elastic.co\": lookup test-artifacts.elastic.co: no such host","ecs.version":"1.6.0"} @amolnater-qasource can you try this again with debug logging enabled and upload the diagnostics? This is a real bug I just want as much detail about it as possible. |
I think this is an easy way to reproduce that the agent gets stuck in the Updating state if the upgrade fails for any reason. Using an invalid URL for the artifact is just a very convenient way to trigger this problem. |
Hi @cmacknz Thank you for looking into this. Agent Logs: Please let us know if anything else is required from our end. |
@cmacknz Could you help investigate this? I looked at the logs, and I see the error with the artifacts url, but there is nothing related to upgrade after that. Is is possible that the agent didn't ack the action after the error?
|
Yes I'll take another look, it is possible the agent isn't acknowledging the upgrade here. |
I definitely think there are ways that the upgrade acknowledgment can get lost. The situation in this issue is almost the best case though where we can detect the upgrade will never complete as soon as we attempt to download the upgrade artifact. We should be acknowledging the action in this case, so I am inclined to think the problem is on the agent side of things here. |
Well, we aren't automatically setting the Error field of an action acknowledgement for non-retryable errors (see the one call to elastic-agent/internal/pkg/agent/application/dispatcher/dispatcher.go Lines 135 to 154 in e1b4c21
The upgrade action is a retryable error however, as it satisfies the retryable action interface: elastic-agent/internal/pkg/fleetapi/action.go Lines 64 to 84 in e1b4c21
elastic-agent/internal/pkg/fleetapi/action.go Lines 294 to 295 in e1b4c21
So we should be acking it here: elastic-agent/internal/pkg/agent/application/dispatcher/dispatcher.go Lines 247 to 261 in e1b4c21
|
Based on the similar symptoms in #2433 I can observe that the agent does acknowledge upgrade failures but the UI continues to show the agent as updating. Converting this to a Fleet issue. |
Pinging @elastic/fleet (Team:Fleet) |
Bug ConversionTest-Case not required as this particular checkpoint is already covered in the following testcase: Thanks |
@kpollich @juliaElastic I raised this one's priority as it is causing many SDHs |
I'm also still experiencing the same issue I reported in elastic/kibana#2343, except that the first 5-10 agents in the batch are updating, where they weren't before. E.g., I'll schedule an upgrade for 200+ agents and only the first 7 will upgrade. They upgrade as expected if I schedule agents for "immediately." |
I could reproduce this issue locally, this is what I'm seeing in the agent doc. The
When looking at the agent logs, I don't see anything related to retry, only a warn log with the failure.
I'm trying to add more agent logs to see what's going on. EDIT: added some logging and scheduleRetry is being called.
Checking the agent this morning, I found that it went back to
According to the defaultRetryConfig, the retries should finish in about 2 hours, so I am not sure why it took half a day. I'll assign this back to |
Repeated the upgrade test this morning with the invalid download source, and seeing the retries finishing in about 2 hours and the agent going back to
@amolnater-qasource When you do the upgrade and wait 2 hours, does the agent still remain in |
@juliaElastic Thanks for the detailed analysis of what the Elastic Agent is doing in the situation. Unless I am reading your analysis incorrectly, it seems that the Elastic Agent is working as designed. I think the confusion is the amount of time that retry can take, which is almost 2 hours before the final try is performed and the Updating is cleared. I see a couple of possible solutions:
I don't think any one of these are a perfect solution, more visibility into what is happening on the agent for an upgrade might be a better path forward. Adding the ability for the Elastic Agent to say that its going to retry, why it needs to retry, and when it will perform the retry would be an improvement to the visibility of what is happening. |
I think it would make sense to categorize some errors as unretryable like DNS lookup error, like your option 2. Also improving the visibility on the retrying status would help. There is already an improvement raised which is related: https://github.com/elastic/ingest-dev/issues/1621 |
That error could be transient in the case that the DNS is temporarily down, so marking that not retry-able is still questionable. Another possibility is that the new DNS entry has not fully propagated and a retry in a few minutes would result in it working. |
If the decision is made to go with option 1, I think the 2 hour window should be reflected in the Agent (upgrade) documentation, the Agent upgrade UI and/or both. Having worked through multiple bumpy upgrade processes with Agent in the past year, I would very much appreciate such a sanity check while upgrading. |
We need to make it so that the the UI shows that the upgrade is retrying, including when the next retry will occur and why the previous attempt failed. This fix can be included as part of the upgrade state reporting redesign linked earlier. DNS errors can be transient and are a perfect example of a retryable error that the agent action retry mechanism should be resolving automatically. We do not want the user to have to intervene here. The problem here is that UI doesn't tell the user what is happening or what is wrong. In the short term it might be best to add more info level logging to indicate that the agent is actually retrying. I don't think we have a way to communicate the retry back to the UI, as far as the UI knows today the agent upgrading is still pending a final result. We can add something like what Julia added for debugging:
The scheduled retry log above should include the time of the next scheduled retry and how many retries are remaining. |
We should also include the upgrade retry mechanism in the Fleet documentation in a more obvious way as suggested by Wiegar. |
We have revalidated this on latest 8.7.1 BC2 kibana cloud environment and found it still reproducible. Observations:
Agent Logs: Hosted Fleet Server Logs: Build details: Please let us know if anything else is required from our end. |
In the latest diagnostics, I only see the one log indicating that the agent is moved to
|
Hi Team, We have revalidated this issue on latest 8.9.0 BC4 kibana cloud environment and found it fixed now. Observations:
Build details: Hence we are closing this issue and marking as QA:Validated. Thanks! |
Kibana version: 8.7 BC3 kibana cloud environment
Host OS and Browser version: All, All
Build details:
Preconditions:
Steps to reproduce:
https://test-artifacts.elastic.co/downloads/
Updating
throughout.Expected Result:
Agent should fail while upgrade with invalid agent binary and should reset agent for re-upgrade after sometime, say 10 minutes.
Logs:
elastic-agent-diagnostics-2023-02-24T03-37-26Z-00.zip
Note:
Updating
state for over 12 hours.Screenshots:
The text was updated successfully, but these errors were encountered: