Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Agent gets stuck in the updating state if the upgrade action fails #2508

Closed
amolnater-qasource opened this issue Feb 24, 2023 · 25 comments
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Elastic-Agent Label for the Agent team

Comments

@amolnater-qasource
Copy link

Kibana version: 8.7 BC3 kibana cloud environment

Host OS and Browser version: All, All

Build details:

VERSION: 8.7 BC3
BUILD: 60804
COMMIT: abbdcf43f4e89f6fc085360252514aed7c032b4b

Preconditions:

  1. 8.7 BC3 kibana cloud environment should be available.
  2. Lower version 8.6.2 agent should be installed.

Steps to reproduce:

  1. Navigate to Fleet>Settings and add invalid agent binary: https://test-artifacts.elastic.co/downloads/
  2. Navigate to Fleet>Agents tab.
  3. Trigger agent upgrade for lower version agent.
  4. Observe agent never gets failed and remains stuck in Updating throughout.

Expected Result:
Agent should fail while upgrade with invalid agent binary and should reset agent for re-upgrade after sometime, say 10 minutes.

Logs:
elastic-agent-diagnostics-2023-02-24T03-37-26Z-00.zip

Note:

  • We have observed that agent remained in Updating state for over 12 hours.

Screenshots:
image
image
9
image

@amolnater-qasource amolnater-qasource added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team impact:medium labels Feb 24, 2023
@amolnater-qasource
Copy link
Author

@manishgupta-qasource Please review.

@manishgupta-qasource
Copy link

Secondary review for this ticket is Done

@cmacknz
Copy link
Member

cmacknz commented Feb 27, 2023

The upgrade on the agent side starts and immediately fails with:

{"log.level":"info","@timestamp":"2023-02-23T19:12:04.841Z","log.origin":{"file.name":"upgrade/upgrade.go","file.line":116},"message":"Upgrading agent","log":{"source":"elastic-agent"},"version":"8.7.0","source_uri":"","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-02-23T19:12:04.849Z","log.logger":"transport","log.origin":{"file.name":"transport/tcp.go","file.line":52},"message":"DNS lookup failure \"test-artifacts.elastic.co\": lookup test-artifacts.elastic.co: no such host","ecs.version":"1.6.0"}

@amolnater-qasource can you try this again with debug logging enabled and upload the diagnostics? This is a real bug I just want as much detail about it as possible.

@cmacknz
Copy link
Member

cmacknz commented Feb 27, 2023

I think this is an easy way to reproduce that the agent gets stuck in the Updating state if the upgrade fails for any reason. Using an invalid URL for the artifact is just a very convenient way to trigger this problem.

@amolnater-qasource
Copy link
Author

Hi @cmacknz

Thank you for looking into this.
We have revalidated this issue on latest 8.7.0 BC3 and triggered upgrade after setting logging level debug for the agent.

Screenshot:
image

Agent Logs:
elastic-agent-diagnostics-2023-02-28T07-22-23Z-00.zip

Please let us know if anything else is required from our end.
Thanks!

@cmacknz cmacknz changed the title Agent gets stuck in updating throughout while upgrade with invalid agent binary. Agent gets stuck in the updating state if the upgrade action fails Feb 28, 2023
@juliaElastic
Copy link
Contributor

juliaElastic commented Mar 1, 2023

@cmacknz Could you help investigate this? I looked at the logs, and I see the error with the artifacts url, but there is nothing related to upgrade after that. Is is possible that the agent didn't ack the action after the error?

{"log.level":"info","@timestamp":"2023-02-28T07:18:14.565Z","log.origin":{"file.name":"upgrade/upgrade.go","file.line":116},"message":"Upgrading agent","log":{"source":"elastic-agent"},"version":"8.7.0","source_uri":"","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-02-28T07:18:14.571Z","log.logger":"transport","log.origin":{"file.name":"transport/tcp.go","file.line":52},"message":"DNS lookup failure \"test-artifacts.elastic.co\": lookup test-artifacts.elastic.co: no such host","ecs.version":"1.6.0"}

@cmacknz
Copy link
Member

cmacknz commented Mar 1, 2023

Yes I'll take another look, it is possible the agent isn't acknowledging the upgrade here.

@cmacknz
Copy link
Member

cmacknz commented Mar 2, 2023

I definitely think there are ways that the upgrade acknowledgment can get lost. The situation in this issue is almost the best case though where we can detect the upgrade will never complete as soon as we attempt to download the upgrade artifact.

We should be acknowledging the action in this case, so I am inclined to think the problem is on the agent side of things here.

@cmacknz
Copy link
Member

cmacknz commented Mar 2, 2023

Well, we aren't automatically setting the Error field of an action acknowledgement for non-retryable errors (see the one call to SetError for retryable errors):

var reportedErr error
for _, action := range actions {
if err = ctx.Err(); err != nil {
ad.errCh <- err
return
}
if err := ad.dispatchAction(ctx, action, acker); err != nil {
rAction, ok := action.(fleetapi.RetryableAction)
if ok {
rAction.SetError(err) // set the retryable action error to what the dispatcher returned
ad.scheduleRetry(ctx, rAction, acker)
continue
}
ad.log.Debugf("Failed to dispatch action '%+v', error: %+v", action, err)
reportedErr = err
continue
}
ad.log.Debugf("Successfully dispatched action: '%+v'", action)
}

The upgrade action is a retryable error however, as it satisfies the retryable action interface:

// RetryableAction is an Action that may be scheduled for a retry.
type RetryableAction interface {
ScheduledAction
// RetryAttempt returns the retry-attempt number of the action
// the retry_attempt number is meant to be an interal counter for the elastic-agent and not communicated to fleet-server or ES.
// If RetryAttempt returns > 1, and GetError is not nil the acker should signal that the action is being retried.
// If RetryAttempt returns < 1, and GetError is not nil the acker should signal that the action has failed.
RetryAttempt() int
// SetRetryAttempt sets the retry-attempt number of the action
// the retry_attempt number is meant to be an interal counter for the elastic-agent and not communicated to fleet-server or ES.
SetRetryAttempt(int)
// SetStartTime sets the start_time of the action to the specified value.
// this is used by the action-retry mechanism.
SetStartTime(t time.Time)
// GetError returns the error that is associated with the retry.
// If it is a retryable action fleet-server should mark it as such.
// Otherwise fleet-server should mark the action as failed.
GetError() error
// SetError sets the retryable action error
SetError(error)
}

// RetryAttempt will return the retry_attempt of the action
func (a *ActionUpgrade) RetryAttempt() int {

So we should be acking it here:

func (ad *ActionDispatcher) scheduleRetry(ctx context.Context, action fleetapi.RetryableAction, acker acker.Acker) {
attempt := action.RetryAttempt()
d, err := ad.rt.GetWait(attempt)
if err != nil {
ad.log.Errorf("No more reties for action id %s: %v", action.ID(), err)
action.SetRetryAttempt(-1)
if err := acker.Ack(ctx, action); err != nil {
ad.log.Errorf("Unable to ack action failure (id %s) to fleet-server: %v", action.ID(), err)
return
}
if err := acker.Commit(ctx); err != nil {
ad.log.Errorf("Unable to commit action failure (id %s) to fleet-server: %v", action.ID(), err)
}
return
}

@cmacknz
Copy link
Member

cmacknz commented Mar 31, 2023

Based on the similar symptoms in #2433 I can observe that the agent does acknowledge upgrade failures but the UI continues to show the agent as updating. Converting this to a Fleet issue.

@cmacknz cmacknz changed the title Agent gets stuck in the updating state if the upgrade action fails [Fleet] Agent gets stuck in the updating state if the upgrade action fails Mar 31, 2023
@cmacknz cmacknz removed the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Mar 31, 2023
@cmacknz cmacknz added the Team:Fleet Label for the Fleet team label Mar 31, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@cmacknz cmacknz transferred this issue from elastic/elastic-agent Mar 31, 2023
@dikshachauhan-qasource
Copy link

Bug Conversion

Test-Case not required as this particular checkpoint is already covered in the following testcase:

Thanks

@jlind23
Copy link
Contributor

jlind23 commented Apr 5, 2023

@kpollich @juliaElastic I raised this one's priority as it is causing many SDHs

@renzedj
Copy link

renzedj commented Apr 5, 2023

I'm also still experiencing the same issue I reported in elastic/kibana#2343, except that the first 5-10 agents in the batch are updating, where they weren't before. E.g., I'll schedule an upgrade for 200+ agents and only the first 7 will upgrade. They upgrade as expected if I schedule agents for "immediately."

@juliaElastic
Copy link
Contributor

juliaElastic commented Apr 17, 2023

I could reproduce this issue locally, this is what I'm seeing in the agent doc. The retrying status was added here: elastic/fleet-server#1887
I'm going to look into why the retry is not happening and clearing upgrade_started_at field.

          "last_checkin_message": "Running",
          "last_checkin_status": "online",
          "last_checkin": "2023-04-17T13:10:29Z",
          "upgrade_started_at": "2023-04-17T12:55:46.563Z",
          "upgraded_at": null,
          "upgrade_status": "retrying"

When looking at the agent logs, I don't see anything related to retry, only a warn log with the failure.

14:55:47.746 [elastic_agent][info] Upgrading agent
14:55:47.826 [elastic_agent][warn] DNS lookup failure "test-artifacts.elastic.co": lookup test-artifacts.elastic.co: no such host

I'm trying to add more agent logs to see what's going on.
@elastic/elastic-agent-control-plane Hey, can anyone help investigate why the retry is not happening on agent side?

EDIT: added some logging and scheduleRetry is being called.

17:15:26.571 [elastic_agent][info] Upgrading agent
17:15:27.700 [elastic_agent][error] Unable to download artifact
17:15:27.700 [elastic_agent][info] Schedule retry
17:20:30.793 [elastic_agent][info] Upgrading agent
17:20:32.066 [elastic_agent][error] Unable to download artifact
17:20:32.067 [elastic_agent][info] Schedule retry

Checking the agent this morning, I found that it went back to Healthy overnight. It seems the retry logic worked and eventually reached a max retry, though it took a few hours.

image

image

          "last_checkin_message": "Running",
          "last_checkin_status": "online",
          "last_checkin": "2023-04-18T06:42:18Z",
          "upgrade_started_at": null,
          "upgraded_at": null,
          "upgrade_status": "failed"

According to the defaultRetryConfig, the retries should finish in about 2 hours, so I am not sure why it took half a day.

I'll assign this back to elastic-agent repo as the issue seems to be on agent side. cc @pierrehilbert

@juliaElastic juliaElastic transferred this issue from elastic/kibana Apr 18, 2023
@juliaElastic
Copy link
Contributor

juliaElastic commented Apr 18, 2023

Repeated the upgrade test this morning with the invalid download source, and seeing the retries finishing in about 2 hours and the agent going back to Healthy. Not sure if the previous delay was a one-off, but it seems inconsistent.

�[90m06:55:46.992�[0m �[32mINF�[0m marking agent upgrade as retrying �[36mecs.version=�[0m1.6.0 �[36mfleet.access.apikey.id=�[0mUcLIj4cBBj2648jJnHXy �[36mfleet.agent.id=�[0m9fae3c0e-0998-4b69-b1a0-dbefe801fa0c �[36mhttp.request.id=�[0m01GY9JBR0BZBJ9ME0WNJ9VDZYM �[36mnEvents=�[0m1 �[36mretry_attempt=�[0m1 �[36mserver.address=�[0m �[36mservice.name=�[0mfleet-server
�[90m07:00:51.506�[0m �[32mINF�[0m marking agent upgrade as retrying �[36mecs.version=�[0m1.6.0 �[36mfleet.access.apikey.id=�[0mUcLIj4cBBj2648jJnHXy �[36mfleet.agent.id=�[0m9fae3c0e-0998-4b69-b1a0-dbefe801fa0c �[36mhttp.request.id=�[0m01GY9JN18VB5Q4ETV9D9G0NPKV �[36mnEvents=�[0m1 �[36mretry_attempt=�[0m2 �[36mserver.address=�[0m �[36mservice.name=�[0mfleet-server
�[90m07:05:54.345�[0m �[32mINF�[0m marking agent upgrade as retrying �[36mecs.version=�[0m1.6.0 �[36mfleet.access.apikey.id=�[0mUcLIj4cBBj2648jJnHXy �[36mfleet.agent.id=�[0m9fae3c0e-0998-4b69-b1a0-dbefe801fa0c �[36mhttp.request.id=�[0m01GY9JY98V8E8TC1086H23QRR3 �[36mnEvents=�[0m1 �[36mretry_attempt=�[0m3 �[36mserver.address=�[0m �[36mservice.name=�[0mfleet-server
�[90m07:15:57.303�[0m �[32mINF�[0m marking agent upgrade as retrying �[36mecs.version=�[0m1.6.0 �[36mfleet.access.apikey.id=�[0mUcLIj4cBBj2648jJnHXy �[36mfleet.agent.id=�[0m9fae3c0e-0998-4b69-b1a0-dbefe801fa0c �[36mhttp.request.id=�[0m01GY9KGP248AHWJCR7V87Y0YW2 �[36mnEvents=�[0m1 �[36mretry_attempt=�[0m4 �[36mserver.address=�[0m �[36mservice.name=�[0mfleet-server
�[90m07:31:01.500�[0m �[32mINF�[0m marking agent upgrade as retrying �[36mecs.version=�[0m1.6.0 �[36mfleet.access.apikey.id=�[0mUcLIj4cBBj2648jJnHXy �[36mfleet.agent.id=�[0m9fae3c0e-0998-4b69-b1a0-dbefe801fa0c �[36mhttp.request.id=�[0m01GY9MC8PFFSMX016JCKFHGQH8 �[36mnEvents=�[0m1 �[36mretry_attempt=�[0m5 �[36mserver.address=�[0m �[36mservice.name=�[0mfleet-server
�[90m08:05:17.359�[0m �[32mINF�[0m marking agent upgrade as retrying �[36mecs.version=�[0m1.6.0 �[36mfleet.access.apikey.id=�[0mUcLIj4cBBj2648jJnHXy �[36mfleet.agent.id=�[0m9fae3c0e-0998-4b69-b1a0-dbefe801fa0c �[36mhttp.request.id=�[0m01GY9PB11M8TJYGKYEYR23PEK1 �[36mnEvents=�[0m1 �[36mretry_attempt=�[0m6 �[36mserver.address=�[0m �[36mservice.name=�[0mfleet-server
[90m09:05:21.192�[0m �[32mINF�[0m marking agent upgrade as failed, agent logs contain failure message �[36mecs.version=�[0m1.6.0 �[36mfleet.access.apikey.id=�[0mUcLIj4cBBj2648jJnHXy �[36mfleet.agent.id=�[0m9fae3c0e-0998-4b69-b1a0-dbefe801fa0c �[36mhttp.request.id=�[0m01GY9SS0A3G8RKQ3JJFQPS7MYW �[36mnEvents=�[0m1 �[36mretry_attempt=�[0m-1 �[36mserver.address=�[0m �[36mservice.name=�[0mfleet-server

@amolnater-qasource When you do the upgrade and wait 2 hours, does the agent still remain in Updating? We can look at the FS logs to see if the retries are happening in your case.

@blakerouse
Copy link
Contributor

@juliaElastic Thanks for the detailed analysis of what the Elastic Agent is doing in the situation. Unless I am reading your analysis incorrectly, it seems that the Elastic Agent is working as designed. I think the confusion is the amount of time that retry can take, which is almost 2 hours before the final try is performed and the Updating is cleared.

I see a couple of possible solutions:

  1. Do nothing, keep it how it is. 2 hours for it fail.
  2. Catch the DNS lookup error and not mark that as an error to retry. So it will quickly fail to upgrade.
  3. Lower the retry configuration to not be so long.

I don't think any one of these are a perfect solution, more visibility into what is happening on the agent for an upgrade might be a better path forward. Adding the ability for the Elastic Agent to say that its going to retry, why it needs to retry, and when it will perform the retry would be an improvement to the visibility of what is happening.

@juliaElastic
Copy link
Contributor

juliaElastic commented Apr 18, 2023

I think it would make sense to categorize some errors as unretryable like DNS lookup error, like your option 2.

Also improving the visibility on the retrying status would help. There is already an improvement raised which is related: https://github.com/elastic/ingest-dev/issues/1621

@blakerouse
Copy link
Contributor

I think it would make sense to categorize some errors as unretryable like DNS lookup error, like your option 2.

That error could be transient in the case that the DNS is temporarily down, so marking that not retry-able is still questionable. Another possibility is that the new DNS entry has not fully propagated and a retry in a few minutes would result in it working.

@WiegerElastic
Copy link

@juliaElastic Thanks for the detailed analysis of what the Elastic Agent is doing in the situation. Unless I am reading your analysis incorrectly, it seems that the Elastic Agent is working as designed. I think the confusion is the amount of time that retry can take, which is almost 2 hours before the final try is performed and the Updating is cleared.

I see a couple of possible solutions:

1. Do nothing, keep it how it is. 2 hours for it fail.

2. Catch the DNS lookup error and not mark that as an error to retry. So it will quickly fail to upgrade.

3. Lower the retry configuration to not be so long.

I don't think any one of these are a perfect solution, more visibility into what is happening on the agent for an upgrade might be a better path forward. Adding the ability for the Elastic Agent to say that its going to retry, why it needs to retry, and when it will perform the retry would be an improvement to the visibility of what is happening.

If the decision is made to go with option 1, I think the 2 hour window should be reflected in the Agent (upgrade) documentation, the Agent upgrade UI and/or both.

Having worked through multiple bumpy upgrade processes with Agent in the past year, I would very much appreciate such a sanity check while upgrading.

@cmacknz
Copy link
Member

cmacknz commented Apr 19, 2023

We need to make it so that the the UI shows that the upgrade is retrying, including when the next retry will occur and why the previous attempt failed. This fix can be included as part of the upgrade state reporting redesign linked earlier.

DNS errors can be transient and are a perfect example of a retryable error that the agent action retry mechanism should be resolving automatically. We do not want the user to have to intervene here. The problem here is that UI doesn't tell the user what is happening or what is wrong.

In the short term it might be best to add more info level logging to indicate that the agent is actually retrying. I don't think we have a way to communicate the retry back to the UI, as far as the UI knows today the agent upgrading is still pending a final result.

We can add something like what Julia added for debugging:

17:15:26.571 [elastic_agent][info] Upgrading agent
17:15:27.700 [elastic_agent][error] Unable to download artifact
17:15:27.700 [elastic_agent][info] Schedule retry
17:20:30.793 [elastic_agent][info] Upgrading agent
17:20:32.066 [elastic_agent][error] Unable to download artifact
17:20:32.067 [elastic_agent][info] Schedule retry

The scheduled retry log above should include the time of the next scheduled retry and how many retries are remaining.

@cmacknz
Copy link
Member

cmacknz commented Apr 19, 2023

We should also include the upgrade retry mechanism in the Fleet documentation in a more obvious way as suggested by Wiegar.

@cmacknz cmacknz added Team:Elastic-Agent Label for the Agent team and removed Team:Fleet Label for the Fleet team labels Apr 19, 2023
@amolnater-qasource
Copy link
Author

Hi @juliaElastic

We have revalidated this on latest 8.7.1 BC2 kibana cloud environment and found it still reproducible.

Observations:

  • Agent gets stuck in the updating state if the upgrade action fails even after 02 hours.

Agent Logs:
elastic-agent-diagnostics-2023-04-19T06-48-23Z-00.zip

Hosted Fleet Server Logs:
elastic-agent-diagnostics-2023-04-19T06-49-10Z-00.zip

Build details:
VERSION: 8.7.1 BC2
BUILD: 61192
COMMIT: e33260d70181feecc89dc4f623c694856de380d8

Screenshot:
image

Please let us know if anything else is required from our end.
Thanks!

@juliaElastic
Copy link
Contributor

In the latest diagnostics, I only see the one log indicating that the agent is moved to retrying state.

{"log.level":"info","@timestamp":"2023-04-19T04:48:42.368Z","message":"marking agent upgrade as retrying","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"ecs.version":"1.6.0","fleet.agent.id":"f9292ab4-b2fd-4ce8-a77f-65f382ad4d49","fleet.access.apikey.id":"OIjXl4cBJg7YE9ZcGuYA","retry_attempt":1,"service.name":"fleet-server","http.request.id":"01GYBXFSCN629S74HJ4V8YWR16","nEvents":1,"ecs.version":"1.6.0"}

@amolnater-qasource
Copy link
Author

Hi Team,

We have revalidated this issue on latest 8.9.0 BC4 kibana cloud environment and found it fixed now.

Observations:

  • Agent gets Unhealthy and doesn't get stuck in the updating state if the upgrade action fails.
  • After sometime gets back Healthy, allowing to restart the agent upgrade.

Screenshot:
23

Build details:
VERSION: 8.9.0 BC4
BUILD: 64661
COMMIT: ddf0c1972e43898b6890ddb38f4c016e96538239

Hence we are closing this issue and marking as QA:Validated.

Thanks!

@amolnater-qasource amolnater-qasource added the QA:Validated Validated by the QA Team label Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests