Possible to not ACK upgrade action #760

blakerouse · 2022-07-21T14:04:56Z

Seems it is possible that an upgrade action can result in the Elastic Agent successfully upgrading, but never fully ACKing the action.

Seems to happen when an Elastic Agent is having connectivity issues with Fleet Server on successful upgrade. I have seen this issue report in an SDH and worked around the issue, but we need to track down how and why this happens. The upgrade process needs to be full proof and this seems to be an issue that needs to be tracked down and fixed.

For confirmed bugs, please report:

Version: 7.14.3 -> 7.14.4
Operating System: All Supported

Known causes:

Bulker swallows errors from elasticsearch when the error is specific to the document fleet-server#1894

joshdover · 2022-08-24T11:53:01Z

Agreed we should identify the root cause. I also opened this PR which could serve as a stop-gap to mark the agent as upgraded when we see that the version number changed: elastic/fleet-server#1731

joshdover · 2022-08-26T15:55:26Z

@scunningham and I took a look at this today and produced this summary of how Elastic Agent and Horde's implementation of action acking work as well as some some potential Fleet Server issues.

Fleet Server accepts ack requests with multiple action results in the same request
If writing any of those to ES fails, Fleet Server will mark that single action result as failed and still attempt the other action results. It will return non-200 response here and set the errors key as true.

Elastic Agent appears to ignore the HTTP status code here

elastic-agent/internal/pkg/fleetapi/ack_cmd.go

Line 114 in be3c48e

resp, err := e.client.Send(ctx, "POST", ap, nil, nil, bytes.NewBuffer(b))

but will reject the entire ack batch if there is no actions key:

elastic-agent/internal/pkg/fleetapi/ack_cmd.go

Lines 138 to 140 in be3c48e

    
           if ackResponse.Action != "acks" { 
        
           	return nil, client.ExtractError(bytes.NewReader(body)) 
        
           }

If the response contains errors: true, Agent will iterate through the list of action responses and re-queue any where the status code on the individual ack is >= 400:

elastic-agent/internal/pkg/fleetapi/acker/lazy/lazy_acker.go

Lines 99 to 112 in 9c0cb45

    
           if f.retrier != nil && resp != nil && resp.Errors { 
        
           	f.log.Error("lazy acker: partially failed ack batch") 
        
           	failed := make([]fleetapi.Action, 0) 
        
           	for i, res := range resp.Items { 
        
           		if res.Status >= http.StatusBadRequest { 
        
           			if i < len(actions) { 
        
           				failed = append(failed, actions[i]) 
        
           			} 
        
           		} 
        
           	} 
        
           	if len(failed) > 0 { 
        
           		f.log.Infof("lazy acker: partially failed ack batch, enqueue for retry: %#v", failed) 
        
           		f.retrier.Enqueue(failed) 
        
           	}

Horde doesn't look at the response body at all and will just retry the entire request, note that it does not do any batching so this is probably fine: https://github.com/elastic/horde/blob/6196b58f7cca06a0d7bf1d8bfb1ec1474cb2aa08/internal/pkg/drone/action.go#L116-L119

There's potentially an issue where Fleet Server:

Writes an action result to the index (we use explicit IDs)
A subsequent write operation fails, such as updating the agent document to make the agent as upgraded
We return an error to the client, so the client retries acking all of the actions in the batch
On the retry, the action ack fails due to their being a conflict on the actions that were already ack'd successfully. This is due to the fact that we use an explicit ID. We should probably change these to use partial updates to emulate an upsert. Now the agent document update never gets executed.

joshdover · 2022-09-12T13:09:03Z

@blakerouse I think this is where could use the most help from you right now. I'm working with @pjbertels to get a cluster where this has been reproduced in a scale test with debug logs enbled. In those cases we've been seeing agents POST'ing a successful 200 ack, but the ES document isn't really updated correctly.

joshdover · 2022-09-22T10:12:16Z

@blakerouse does it make sense to close this issue now or do we suspect there are other cuases basied elastic/fleet-server#1894?

blakerouse · 2022-09-22T14:43:47Z

@joshdover I am looking to keep this open until I can get all the 75k agents to upgrade successfully in scale testing. Once it has been confirmed it is completely fixed then I will close this.

I don't want to take an early victory just yet. Other than adding the version conflict retries and the proper return of errors, no other fix has been implemented. The change might expose other errors that need to be handled/fixed.

pjbertels · 2022-09-23T20:26:03Z

Had agents stuck in updating again in latest 8.5 SNAPSHOT 07ad0494. Upgraded 10021 and 849 were stuck in updating. All 10021 had the upgraded version of 8.2.1.

[16:08:59]           trend=online is ascending                                                                                                                                     perf_lib.py:164
                     status=FleetAgentStatus(total=10021, inactive=10021, online=9172, error=0, offline=0, updating=849, other=10870, events=0, doc_id=None, run_id=None,                         
                     timestamp=None, kuery='local_metadata.elastic.agent.version : 8.2.1', cluster_name=None)                                                                                     
                     prev_status=FleetAgentStatus(total=9459, inactive=10021, online=8610, error=0, offline=0, updating=849, other=10870, events=0, doc_id=None, run_id=None,                     
                     timestamp=None, kuery='local_metadata.elastic.agent.version : 8.2.1', cluster_name=None)                                                                                     
[16:14:20] ERROR    poll_agents_status reached max_iterations

blakerouse · 2022-09-27T14:03:19Z

The snapshot used in @pjbertels previous comment did not include the fix in elastic/fleet-server#1894.

The latest snapshot build does contain the fix, but BC2 and snapshot for 8.5 is current broken because of https://github.com/elastic/ingest-dev/issues/1282. This is preventing the ability to actually test the fix performed in elastic/fleet-server#1894.

Hopefully a quick fix to https://github.com/elastic/ingest-dev/issues/1282 is near, so we can get back to testing upgrade.

blakerouse · 2022-09-27T16:01:05Z

@pjbertels Was able to successfully get latest snapshot running with 8.5 and perform an upgrade. Resulted in ~5k still being stuck in upgrading. Digging into the logs now.

blakerouse · 2022-09-27T16:54:57Z

I was unable to find any errors being reported by Fleet Server. From the logs all upgrade actions where dispatched to the Elastic Agent's and all upgrade actions where ack'd.

The issue is that update request to elasticsearch to update the document seems to be only updating 1 of the fields.

	now := time.Now().UTC().Format(time.RFC3339)
	doc := bulk.UpdateFields{
		dl.FieldUpgradeStartedAt: nil,
		dl.FieldUpgradedAt:       now,
		dl.FieldUpgradeStatus:    "completed",
	}

	body, err := doc.Marshal()
	if err != nil {
		return errors.Wrap(err, "handleUpgrade marshal")
	}

	if err = ack.bulk.Update(ctx, dl.FleetAgents, agent.Id, body, bulk.WithRefresh(), bulk.WithRetryOnConflict(3)); err != nil {
		return errors.Wrap(err, "handleUpgrade update")
	}

Only the upgraded_at field is set (I have correlated the log message for the ACK to match the timestamp that gets set in this field). The other 2 fields upgrade_started_at and upgrade_status are not updated on the document.

At the moment I am in need of help from elasticsearch team, because I believe this could be an issue internal to elasticsearch.

jlind23 · 2022-09-27T17:46:51Z

Hey @dakrone we definitely need Elasticsearch team help here - If not you, would you please help us find someone to help on the issue above ^ which seems to potentially be an Elasticsearch bug

dakrone · 2022-09-27T18:01:51Z

I'm having a hard time parsing exactly what this is doing (being not at all familiar with agent's internals), from what I can guess, it's doing something like this:

PUT /myindex/_doc/1
{
  "foo":"bar"
}
  
POST /myindex/_update/1
{
  "doc": {
    "upgrade_started_at": null,
    "upgraded_at": 1234,
    "upgrade_status": "completed"
  }
}
  
GET /myindex/_doc/1

Which works fine for Elasticsearch.

If that's incorrect, can someone translate this into an Elasticsearch reproduction so I can try this out?

joshdover · 2022-09-27T18:28:30Z

I was unable to find any errors being reported by Fleet Server. From the logs all upgrade actions where dispatched to the Elastic Agent's and all upgrade actions where ack'd.

What's the confidence level that we're not getting any errors returned by ES? My understanding is that we may still be dropping logs in Cloud, but that may not matter to prove this depending on the logging.

+1 a simple repro would be helpful. Could we use two tight loops of concurrent updates similar to what is happening with upgrade acks and the checkin 30s heartbeat to emulate this?

blakerouse · 2022-09-27T18:50:26Z

@dakrone That is not what Fleet Server is doing. It is using _bulk to perform actions in bulk through its bulker system. So what actually would be happening is the following:

POST _bulk
{ "update" : {"_id" : "023e83e4-e3c7-46f6-9424-adb03e7ba980", "retry_on_conflict": "3", "_index" : ".fleet-agents"} }
{ "doc" : {"upgrade_started_at": null,"upgraded_at": 1234,"upgrade_status": "completed"} }

This bulk update action can occur with any other action that needs to occur with Fleet Server at an interval of 250ms or when the bulk buffer is full.

I am seeing that it works sometimes to update all the fields and in other cases it is not updating all fields.

dakrone · 2022-09-27T19:52:47Z

Okay, even doing something like this, which creates and then updates the document:

DELETE /myindex
  
GET /myindex/_doc/1
  
POST _bulk
{ "index" : {"_id" : "2", "retry_on_conflict": "3", "_index" : "myindex"} }
{ "upgraded_at": 1234 }
{ "update" : {"_id" : "2", "retry_on_conflict": "3", "_index" : "myindex"} }
{ "doc" : {"upgrade_started_at": null,"upgraded_at": 1234,"upgrade_status": "completed"} }

GET /myindex/_doc/2

Succeeds locally. Perhaps a reproduction with the tight loops would be necessary to see if this can be reproduced (if it can, please open an issue in the Elasticsearch repository).

nicpenning · 2022-09-27T21:38:01Z

Hey y'all, I stumbled across this issue because today in production I have 1 agent that is stuck "Updating". (It has been like this for a little while. This device likes to roam on and off the network very frequently)

Rebooting, stopping/starting service or assigning to other policies will not get the status away from Updating.

Furthermore, the Agent is at 8.4.0 (Kibana and .\elastic-agent.exe status say so) and I can't upgrade it because it is grayed out in some spots and in others where it tries to let me upgrade fails immediately. So I can't tell the health of this agent via the UI, however, the .\elastic-agent.exe status reveals everything is healthy.

I will be happy to share anything that might help the situation. I am sure removing and re-enrolling the agent will do the trick but I very much do not like doing that to fix these issues.

nicpenning · 2022-09-28T00:07:08Z

Additional detail: 8.4.1 on-premise stack, virtualized on modern Windows Server OS.

nicpenning · 2022-09-28T13:27:26Z

Further info:

When doing a search on the specific endpoint to .fleet-agent here is the redacted result:

{
        "_index": ".fleet-agents-7",
        "_id": "redacted-b019-4c98-8e99-d99b5a76cb93",
        "_score": 5.936952,
        "_source": {
          "access_api_key_id": "redacted",
          "action_seq_no": [
            623
          ],
          "active": true,
          "agent": {
            "id": "redacted-b019-4c98-8e99-d99b5a76cb93",
            "version": "8.4.0"
          },
          "enrolled_at": "2022-08-19T20:37:19Z",
          "local_metadata": {
            "elastic": {
              "agent": {
                "build.original": "8.4.0 (build: 56a002dc5c96d3a79decbf28bb7994cf49f3a4da at 2022-08-19 22:35:28 +0000 UTC)",
                "id": "redacted-b019-4c98-8e99-d99b5a76cb93",
                "log_level": "info",
                "snapshot": false,
                "upgradeable": true,
                "version": "8.4.0"
              }
            },
            "host": {
              "architecture": "x86_64",
              "hostname": "redacted",
              "id": "redacted-5f69-411e-b954-e603dc586f26",
              "ip": [
                "169.254.53.207/16",
                "172.31.32.1/20",
                "169.254.73.42/16",
                "169.254.124.209/16",
                "192.168.209.1/24",
                "192.168.30.1/24",
                "redacted/25",
                "::1/128",
                "127.0.0.1/8",
                "172.25.240.1/20",
                "172.27.32.1/20",
                "172.29.208.1/20"
              ],
              "mac": [
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted"
              ],
              "name": "redacted"
            },
            "os": {
              "family": "windows",
              "full": "Windows 10 Enterprise(10.0)",
              "kernel": "10.0.22000.978 (WinBuild.160101.0800)",
              "name": "Windows 10 Enterprise",
              "platform": "windows",
              "version": "10.0"
            }
          },
          "policy_id": "eb2a4e10-e2f3-11ea-8e88-e52736baa840",
          "type": "PERMANENT",
          "default_api_key_id": "redacted",
          "policy_output_permissions_hash": "5af4296cdba1c61fb98221f7996c8f80f6745cf00d521d8c3ecada41cefcc6c2",
          "default_api_key": "redacted:redacted",
          "updated_at": "2022-09-28T13:17:42Z",
          "last_checkin_status": "online",
          "last_checkin": "2022-09-28T13:17:40Z",
          "policy_revision_idx": 31,
          "policy_coordinator_idx": 1,
          "upgrade_started_at": "2022-09-02T14:45:21.497Z",
          "upgraded_at": null
        }

Two things in this output stick out to me (however, I have no clue how these should read). The first is that the device is Windows 11 but the data shows Windows 10. The second, which I think might be the issue, is that the upgraded_at is set to null, even though it started the upgrade on 9/2. Perhaps the null in the upgraded_at is why it is shown stuck at updating?

Please me know if this is helpful info or just annoying and I can buzz off :)

joshdover · 2022-09-28T13:37:20Z

Hi there @nicpenning, you do indeed seem to be encountering this same issue. I recommend trying this as a workaround, you can run this from Dev Tools in Kibana:

POST /api/fleet/agents/<id>/upgrade
{"force": true, "version": "8.4.1"}

nicpenning · 2022-09-28T13:39:33Z

So, I went ahead and updated the upgraded_at to 2022-09-02T14:55:21.497Z and it fixed the issue of showing updated.

POST .fleet-agents/_update/3e9b94cb-b019-4c98-8e99-d99b5a76cb93
{
  "doc": {
    "access_api_key_id": "redacted",
    "action_seq_no": [
      623
    ],
    "active": true,
    "agent": {
      "id": "redacted-b019-4c98-8e99-d99b5a76cb93",
      "version": "8.4.0"
    },
    "enrolled_at": "2022-08-19T20:37:19Z",
    "local_metadata": {
      "elastic": {
        "agent": {
          "build.original": "8.4.0 (build: 56a002dc5c96d3a79decbf28bb7994cf49f3a4da at 2022-08-19 22:35:28 +0000 UTC)",
          "id": "redacted-b019-4c98-8e99-d99b5a76cb93",
          "log_level": "info",
          "snapshot": false,
          "upgradeable": true,
          "version": "8.4.0"
        }
      },
      "host": {
        "architecture": "x86_64",
        "hostname": "redacted",
        "id": "redacted-5f69-411e-b954-e603dc586f26",
        "ip": [
          "169.254.53.207/16",
          "172.31.32.1/20",
          "169.254.73.42/16",
          "169.254.124.209/16",
          "192.168.209.1/24",
          "192.168.30.1/24",
          "redacted",
          "::1/128",
          "127.0.0.1/8",
          "172.25.240.1/20",
          "172.27.32.1/20",
          "172.29.208.1/20"
        ],
        "mac": [
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted"
        ],
        "name": "redacted"
      },
      "os": {
        "family": "windows",
        "full": "Windows 10 Enterprise(10.0)",
        "kernel": "10.0.22000.978 (WinBuild.160101.0800)",
        "name": "Windows 10 Enterprise",
        "platform": "windows",
        "version": "10.0"
      }
    },
    "policy_id": "eb2a4e10-e2f3-11ea-8e88-e52736baa840",
    "type": "PERMANENT",
    "default_api_key_id": "redacted",
    "policy_output_permissions_hash": "5af4296cdba1c61fb98221f7996c8f80f6745cf00d521d8c3ecada41cefcc6c2",
    "default_api_key": "redacted:redacted",
    "updated_at": "2022-09-28T13:23:32Z",
    "last_checkin_status": "online",
    "last_checkin": "2022-09-28T13:23:31Z",
    "policy_revision_idx": 31,
    "policy_coordinator_idx": 1,
    "upgrade_started_at": "2022-09-02T14:45:21.497Z",
    "upgraded_at": "2022-09-02T14:55:21.497Z"
  }
}

I will try the upgrade again.

nicpenning · 2022-09-28T13:53:08Z

I got the Upgrading Agent notification after I pushed the upgrade so that single field seemed to have been the culprit.

Furthermore, the upgrade went on to work. Not sure if any of this info helps your case but figured I would share. You have helped me take a look under the hood to see what could be broken and a simple field update was a work around. Hopefully this will be addressed to help improve the reliability of such awesome capabilities.

jlind23 · 2022-09-28T14:07:35Z

@joshdover @nicpenning then I assume that the following issue might resolve the UI display - elastic/kibana#139704

joshdover · 2022-09-28T14:10:44Z

@joshdover @nicpenning then I assume that the following issue might resolve the UI display - elastic/kibana#139704

No I don't think that will fix the issue. The issue is that Fleet Server (or maybe ES) is losing the writes to make the upgrade as completed. This is why updating the upgraded_at field makes the agent healthy again in the UI.

We still need to get to the bottom of where and why these updates are not being persisted in Elasticsearch when the agent ack's it's upgrade.

amitkanfer · 2022-09-29T14:27:29Z

@nicpenning @blakerouse , i had a short slack conversation with @tomcallahan and he suggested to enable audit logging in ES while this is happening. Did we try that?

dakrone · 2022-09-29T15:04:42Z

Better than audit logging would be the indexing slowlog: https://www.elastic.co/guide/en/elasticsearch/reference/8.4/index-modules-slowlog.html#index-slow-log

PUT /<your-index>/_settings
{
  "index.indexing.slowlog.threshold.index.warn": "0ms",
  "index.indexing.slowlog.source": true
}

blakerouse · 2022-09-30T14:15:54Z

Going to keep this open until we get confirmation that all 75k upgrade during scale testing.

joshdover · 2022-09-30T14:25:55Z

What if a customer tries to upgrade the same set of agents at the same time from two different Kibanas.

What if Kibana crashes after setting some or all of the status fields, but before rolling out the action.

We may need to move away from a model where both Kibana and Fleet Server are writing to the .fleet-agents index. Ideally, it's only a single Fleet Server instance writing to to each agent document at a time, while the agent is connected to it. Kibana could interact with the agents via pushing new documents to the actions index which Fleet Server then consumes and uses to update the agent state.

We would need a way to easily query agents based on their pending, delivered, and ack'd actions so we could determine if an agent has a pending upgrade and dispaly this in the UI.

This wouldn't completely solve the problem though if you consider actions like unenrollment and policy reassignments. There are still potentially multiple actors that would want to update a single .fleet-agents document at the same time.

juliaElastic · 2022-09-30T14:31:50Z

It'd be good to get an isolated backport of that change to 8.4.x too in case there's another patch release

Raised elastic/kibana#142364

AndersonQ · 2022-09-30T15:50:20Z

We may need to move away from a model where both Kibana and Fleet Server are writing to the .fleet-agents index.

I agree 100%, please let's avoid more than one application writing to the same document.

amitkanfer · 2022-10-02T06:24:27Z

please let's avoid more than one application writing to the same document.

+1

joshdover · 2022-10-03T09:32:12Z

FYI I'm working on a draft proposal that will encompass addressing these issues

pjbertels · 2022-10-05T19:55:32Z

Tests at 50K and 75K confirm we no longer have agents stuck in 'updating' at the end of an upgrade.
Tested with image_tag:8.5.0-58624073-SNAPSHOT.

joshdover · 2022-10-06T09:42:05Z

Great news. @blakerouse I think we should go ahead and backport elastic/fleet-server#1896 to 8.4 in case we have a an additional 8.4 release

amitkanfer · 2022-10-08T11:23:09Z

@nicpenning can you please email me at [email protected] ?

nicpenning · 2022-10-12T19:44:30Z

FYI: Upgrading from 8.4.1 to 8.4.3 - We had 1 asset that got stuck updating and had a null value for upgraded_at out of the 15 we upgraded.

Will this be resolved in 8.5.0 or was it supposed to be resolved in 8.4.3?

nicpenning · 2022-10-12T19:54:48Z

POST /api/fleet/agents/<id>/upgrade
{"force": true, "version": "8.4.1"}

I tried this except for using 8.4.3 but I got back no handler found error:
{ "error": "no handler found for uri [/api/fleet/agents/redacted/upgrade?pretty=true] and method [POST]" }

jlind23 · 2022-10-13T04:52:20Z

@nicpenning this will be resolved in 8.5.0 as it was not backported to 8.4.3

dmgeurts · 2023-02-03T10:44:05Z

I am still having the same issue upgrading agents from 8.6.0 to 8.6.1. Initially, most agents were stuck in updating, then after a policy update most returned healthy, but some are still stubbornly stating updating even though the version listed is 8.6.1.

nicpenning · 2023-02-03T11:59:16Z

Confirmed we saw this in 8.6.0 -> 8.6.1.

2 out of 24 agents successfully upgraded, all others were "stuck" updating. I will note that this only occurred on the scheduled upgrades by setting it to an hour into the future and setting the time frame for over an hour. Any upgrades set to execute immediately worked without any hesitation.

In the end, we had to force the upgrades via the API to get upgraded. I believe then that there is an issue with the scheduled aspect of the upgrades.

dmgeurts · 2023-02-03T12:29:40Z

I updated the agents manually from the CLI, but that didn't resolve the updating status either. I'm not sure how to reset the status now, the agents appear happy.

How did you force things via the API?

nicpenning · 2023-02-03T12:45:04Z

elastic/kibana#135539 (comment)

POST kbn:/api/fleet/agents/<agent id>/upgrade
{ "force": true, "version": "8.6.1" }

dmgeurts · 2023-02-03T12:58:23Z

elastic/kibana#135539 (comment)

POST kbn:/api/fleet/agents/<agent id>/upgrade
{ "force": true, "version": "8.6.1" }

This changed the status to unhealthy until I restarted the agent and then it went back to updating. I should stress that our agents are already upgraded to 8.6.1.

Agent log:

12:55:52.674 elastic_agent [elastic_agent][warn] Skipping addition to action-queue, issue gathering start time from action id af1e004c-cf91-4de1-96ed-477d74c4cc53: action has no start time
12:55:52.689 elastic_agent [elastic_agent][info] Upgrading agent
12:55:52.689 elastic_agent [elastic_agent][error] Unable to clean downloads before update
12:55:55.698 elastic_agent [elastic_agent][info] download from https://internal.server.com/elk/beats/elastic-agent/elastic-agent-8.6.1-linux-x86_64.tar.gz completed in 2 seconds @ 191.9MBps
12:55:55.705 elastic_agent [elastic_agent][info] download from https://internal.server.com/elk/beats/elastic-agent/elastic-agent-8.6.1-linux-x86_64.tar.gz.sha512 completed in Less than a second @ +InfYBps
12:55:57.770 elastic_agent [elastic_agent][error] Failed to unpack upgrade artifact
Showing entries until Feb 3, 12:55:57

nicpenning · 2023-02-03T13:21:22Z

Okay, the difference with my agents is that they were still reporting 8.5.3 and stuck updating. After the force upgrade with API they upgraded with no issues and got out of the Updating state.

dmgeurts · 2023-02-03T13:27:07Z

I probably should have been a bit more patient, and not jumped the gun on manually updating the agents.

For the agents reporting to the fleet server I can unenroll and redo the installation, but having to do this for an increasing number of agents each time we upgrade is far from ideal. And as for the Fleet server itself, this is not so easy, it's also stuck in updating.

cmacknz · 2023-02-03T14:37:46Z

I am not sure if these reports are all experiencing the same root cause, without agent logs it is impossible to tell. It is unlikely to be the same bug described in this issue, just a similar symptom.

12:55:57.770 elastic_agent [elastic_agent][error] Failed to unpack upgrade artifact

This is coming from the code below. Is there an accompany error.message with the reason why the unpack failed? This means the agent successfully downloaded the artifact but could not untar or unzip it.

elastic-agent/internal/pkg/agent/application/upgrade/step_unpack.go

Lines 31 to 41 in 973af90

    
           if runtime.GOOS == windows { 
        
           	hash, err = unzip(u.log, archivePath) 
        
           } else { 
        
           	hash, err = untar(u.log, version, archivePath) 
        
           } 
        
           if err != nil { 
        
           	u.log.Errorw("Failed to unpack upgrade artifact", "error.message", err, "version", version, "file.path", archivePath, "hash", hash) 
        
           	return "", err 
        
           }

dmgeurts · 2023-02-03T14:59:51Z

The error is given after a reboot, I'm not sure if the agent or the fleet server would for some reason try to rerun the upgrade. Or if it fails because it's trying to reinstall the same version.

There are no further accompanying error messages. The rest of the messages show a normal restart of the agent and everything eventually returning healthy. Yet Fleet will still show the agent as updating.

I'm tempted to leave things as they are with 15 healthy agents and 8 agents showing as updating but healthy when checked locally. And then see how things go with 8.6.2.

cmacknz · 2023-02-03T15:24:16Z

I'm tempted to leave things as they are with 15 healthy agents and 8 agents showing as updating but healthy when checked locally.

Just to clarify, did the agents that show as updating complete the upgrade to the desired version successfully? If so, then I suspect this is a synchronization issue in the UI and it would be fine to leave the agents like this although we would like to know why this is. I'll need to find someone from the Fleet UI team to think about why this might be the case if this is what is happening.

dmgeurts · 2023-02-03T15:32:14Z

No, they rolled back to 8.6.0 after a while but remained as updating in Fleet. I couldn't retry as Fleet was stuck in updating for these agents. So I did what at the time I thought was the only thing I could do and did sudo elastic-agent update 8.6.1 from the agent machine. This showed me that the upgrade itself wasn't the issue as it completed fine. Fleet would then detect and show the new version, but the status never changed from updating to anything else. I wasn't aware I could do a force update from the Elastic console.

TBH, I must have had two different issues across my agents as some agents came back as healthy when I update the policy applied to them.

What is certain though is that it would be nice to have a way to force Fleet to check the status of the agent and if the agent is happy and on the right version then surely the status should be updated in Fleet.

As for the Fleet server itself showing as updating, I'm at a loss on how to fix that other than removing the Fleet server and reinstalling the agent there.

cmacknz · 2023-02-03T16:14:49Z

OK thanks we have one bug we just discovered where if an upgrade fails and then rolls back, the local agent downloads directory needs to be restored or the next upgrade will fail: #2222. The data/elastic-agent-$hash/downloads directory needs to be restored before retrying.

That would not have been the original problem but it may be affecting you now. If you can get us the logs for these failed upgrades we would love to look at them to confirm what is happening. If you have failed upgrades you will have multiple data/elastic-agent-$hash/logs directories at the root of the agent installation directory, grabbing those logs directories is what we would want. If you run the normal diagnostics command it will only get the logs for the currently running agent, not the one that failed to upgrade. Obtaining failed upgrade logs needs to be done manually for now, we are planning to automate this process eventually.

dmgeurts · 2023-02-03T16:19:46Z

I found that by issuing the upgrade command locally, the upgrade did succeed.

There's only one data/elastic-agent-$hash/logs folder:

user@host02:~$ sudo ls -al /opt/Elastic/Agent/data/
total 16
drwxrwx--- 4 root root 4096 Feb  2 23:35 .
drwxrwx--- 4 root root 4096 Feb  3 14:00 ..
-rw------- 1 root root    0 Jan 16 11:55 agent.lock
drwxr-xr-x 6 root root 4096 Feb  3 14:00 elastic-agent-b8553c
drwxr-x--- 2 root root 4096 Feb  3 14:00 tmp

user@host02:~$ sudo ls -al /opt/Elastic/Agent/data/elastic-agent-b8553c/logs/
total 22552
drwx------ 2 root root     4096 Feb  3 14:00 .
drwxr-xr-x 6 root root     4096 Feb  3 14:00 ..
-rw------- 1 root root 10485694 Feb  3 07:32 elastic-agent-20230202.ndjson
-rw------- 1 root root  4338164 Feb  3 17:17 elastic-agent-20230203-1.ndjson
-rw------- 1 root root  8242770 Feb  3 14:00 elastic-agent-20230203.ndjson

So my current state is that the agent upgrade succeeded, but the Fleet server remains unaware of it.

joshdover · 2023-02-06T10:06:08Z

So my current state is that the agent upgrade succeeded, but the Fleet server remains unaware of it.

@dmgeurts thanks for reporting this. I agree it's unlikely to be the same root cause. A few questions that may help us narrow down this part of the problem:

What exactly is the state in the UI? Is Fleet UI showing the correct version number after the manual upgrade but still shows the agent as "updating"?
Do you have any agent logs from this time period that you could share? We added an automatic retry on upgrade failures in 8.6.0. It may have emitted some helpful information in the logs.
Do you have a Logstash output configured on the agent policy you're using?

dmgeurts · 2023-02-06T10:53:55Z

What exactly is the state in the UI? Is Fleet UI showing the correct version number after the manual upgrade but still shows the agent as "updating"?

Yes, Fleet shows the version number as 8.6.1 and the status as updating.

Do you have any agent logs from this time period that you could share? We added an automatic retry on upgrade failures in 8.6.0. It may have emitted some helpful information in the logs.

Aren't the logs deleted when an upgrade succeeds? This is my assumption as the logs folder is inside the agent folder and I only have one agent folder under /opt/Elastic/Agent/data/, which is for v8.6.1. Or would they be the initial logs of the agent?

Do you have a Logstash output configured on the agent policy you're using?

No, I haven't created a LogStash output yet. All agents are still using the default (direct to Elasticsearch).

blakerouse added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Jul 21, 2022

cmacknz added the 8.6-candidate label Aug 31, 2022

joshdover assigned blakerouse Sep 12, 2022

jlind23 mentioned this issue Sep 19, 2022

Elastic Agent can get stuck in Updating state on Fleet upgrade elastic/fleet-server#828

Closed

blakerouse mentioned this issue Sep 21, 2022

Bulker swallows errors from elasticsearch when the error is specific to the document elastic/fleet-server#1894

Closed

juliaElastic mentioned this issue Sep 30, 2022

changed order of update agents and actions elastic/kibana#142364

Merged

cmacknz mentioned this issue Oct 3, 2022

Elastic-agent gets stuck in Updating - How to troubleshoot and where to look for logs? elastic/fleet-server#1034

Closed

pjbertels closed this as completed Oct 5, 2022

Possible to not ACK upgrade action #760

Possible to not ACK upgrade action #760

Comments

blakerouse commented Jul 21, 2022 • edited by joshdover Loading

joshdover commented Aug 24, 2022

joshdover commented Aug 26, 2022 • edited Loading

joshdover commented Sep 12, 2022

joshdover commented Sep 22, 2022

blakerouse commented Sep 22, 2022

pjbertels commented Sep 23, 2022

blakerouse commented Sep 27, 2022 • edited Loading

blakerouse commented Sep 27, 2022

blakerouse commented Sep 27, 2022

jlind23 commented Sep 27, 2022

dakrone commented Sep 27, 2022

joshdover commented Sep 27, 2022

blakerouse commented Sep 27, 2022

dakrone commented Sep 27, 2022

nicpenning commented Sep 27, 2022 • edited Loading

nicpenning commented Sep 28, 2022

nicpenning commented Sep 28, 2022 • edited Loading

joshdover commented Sep 28, 2022

nicpenning commented Sep 28, 2022

nicpenning commented Sep 28, 2022

jlind23 commented Sep 28, 2022

joshdover commented Sep 28, 2022

amitkanfer commented Sep 29, 2022

dakrone commented Sep 29, 2022 • edited Loading

blakerouse commented Sep 30, 2022

joshdover commented Sep 30, 2022

juliaElastic commented Sep 30, 2022

AndersonQ commented Sep 30, 2022

amitkanfer commented Oct 2, 2022

joshdover commented Oct 3, 2022

pjbertels commented Oct 5, 2022

joshdover commented Oct 6, 2022

amitkanfer commented Oct 8, 2022

nicpenning commented Oct 12, 2022

nicpenning commented Oct 12, 2022 • edited Loading

jlind23 commented Oct 13, 2022

dmgeurts commented Feb 3, 2023

nicpenning commented Feb 3, 2023

dmgeurts commented Feb 3, 2023

nicpenning commented Feb 3, 2023 • edited Loading

dmgeurts commented Feb 3, 2023 • edited Loading

nicpenning commented Feb 3, 2023

dmgeurts commented Feb 3, 2023

cmacknz commented Feb 3, 2023

dmgeurts commented Feb 3, 2023

cmacknz commented Feb 3, 2023

dmgeurts commented Feb 3, 2023 • edited Loading

cmacknz commented Feb 3, 2023 • edited Loading

dmgeurts commented Feb 3, 2023 • edited Loading

joshdover commented Feb 6, 2023

dmgeurts commented Feb 6, 2023

blakerouse commented Jul 21, 2022 •

edited by joshdover

Loading

joshdover commented Aug 26, 2022 •

edited

Loading

blakerouse commented Sep 27, 2022 •

edited

Loading

nicpenning commented Sep 27, 2022 •

edited

Loading

nicpenning commented Sep 28, 2022 •

edited

Loading

dakrone commented Sep 29, 2022 •

edited

Loading

nicpenning commented Oct 12, 2022 •

edited

Loading

nicpenning commented Feb 3, 2023 •

edited

Loading

dmgeurts commented Feb 3, 2023 •

edited

Loading

dmgeurts commented Feb 3, 2023 •

edited

Loading

cmacknz commented Feb 3, 2023 •

edited

Loading

dmgeurts commented Feb 3, 2023 •

edited

Loading