Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to not ACK upgrade action #760

Closed
1 task done
blakerouse opened this issue Jul 21, 2022 · 58 comments
Closed
1 task done

Possible to not ACK upgrade action #760

blakerouse opened this issue Jul 21, 2022 · 58 comments
Assignees
Labels
8.6-candidate bug Something isn't working Project:FleetScaling Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@blakerouse
Copy link
Contributor

blakerouse commented Jul 21, 2022

Seems it is possible that an upgrade action can result in the Elastic Agent successfully upgrading, but never fully ACKing the action.

Seems to happen when an Elastic Agent is having connectivity issues with Fleet Server on successful upgrade. I have seen this issue report in an SDH and worked around the issue, but we need to track down how and why this happens. The upgrade process needs to be full proof and this seems to be an issue that needs to be tracked down and fixed.

For confirmed bugs, please report:

  • Version: 7.14.3 -> 7.14.4
  • Operating System: All Supported

Known causes:

@blakerouse blakerouse added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Jul 21, 2022
@joshdover
Copy link
Contributor

Agreed we should identify the root cause. I also opened this PR which could serve as a stop-gap to mark the agent as upgraded when we see that the version number changed: elastic/fleet-server#1731

@joshdover
Copy link
Contributor

joshdover commented Aug 26, 2022

@scunningham and I took a look at this today and produced this summary of how Elastic Agent and Horde's implementation of action acking work as well as some some potential Fleet Server issues.

  • Fleet Server accepts ack requests with multiple action results in the same request
  • If writing any of those to ES fails, Fleet Server will mark that single action result as failed and still attempt the other action results. It will return non-200 response here and set the errors key as true.
  • Elastic Agent appears to ignore the HTTP status code here
    resp, err := e.client.Send(ctx, "POST", ap, nil, nil, bytes.NewBuffer(b))
    but will reject the entire ack batch if there is no actions key:
    if ackResponse.Action != "acks" {
    return nil, client.ExtractError(bytes.NewReader(body))
    }
  • If the response contains errors: true, Agent will iterate through the list of action responses and re-queue any where the status code on the individual ack is >= 400:
    if f.retrier != nil && resp != nil && resp.Errors {
    f.log.Error("lazy acker: partially failed ack batch")
    failed := make([]fleetapi.Action, 0)
    for i, res := range resp.Items {
    if res.Status >= http.StatusBadRequest {
    if i < len(actions) {
    failed = append(failed, actions[i])
    }
    }
    }
    if len(failed) > 0 {
    f.log.Infof("lazy acker: partially failed ack batch, enqueue for retry: %#v", failed)
    f.retrier.Enqueue(failed)
    }
  • Horde doesn't look at the response body at all and will just retry the entire request, note that it does not do any batching so this is probably fine: https://github.com/elastic/horde/blob/6196b58f7cca06a0d7bf1d8bfb1ec1474cb2aa08/internal/pkg/drone/action.go#L116-L119

There's potentially an issue where Fleet Server:

  • Writes an action result to the index (we use explicit IDs)
  • A subsequent write operation fails, such as updating the agent document to make the agent as upgraded
  • We return an error to the client, so the client retries acking all of the actions in the batch
  • On the retry, the action ack fails due to their being a conflict on the actions that were already ack'd successfully. This is due to the fact that we use an explicit ID. We should probably change these to use partial updates to emulate an upsert. Now the agent document update never gets executed.

@joshdover
Copy link
Contributor

@blakerouse I think this is where could use the most help from you right now. I'm working with @pjbertels to get a cluster where this has been reproduced in a scale test with debug logs enbled. In those cases we've been seeing agents POST'ing a successful 200 ack, but the ES document isn't really updated correctly.

@joshdover
Copy link
Contributor

@blakerouse does it make sense to close this issue now or do we suspect there are other cuases basied elastic/fleet-server#1894?

@blakerouse
Copy link
Contributor Author

@joshdover I am looking to keep this open until I can get all the 75k agents to upgrade successfully in scale testing. Once it has been confirmed it is completely fixed then I will close this.

I don't want to take an early victory just yet. Other than adding the version conflict retries and the proper return of errors, no other fix has been implemented. The change might expose other errors that need to be handled/fixed.

@pjbertels
Copy link
Contributor

Had agents stuck in updating again in latest 8.5 SNAPSHOT 07ad0494. Upgraded 10021 and 849 were stuck in updating. All 10021 had the upgraded version of 8.2.1.

[16:08:59]           trend=online is ascending                                                                                                                                     perf_lib.py:164
                     status=FleetAgentStatus(total=10021, inactive=10021, online=9172, error=0, offline=0, updating=849, other=10870, events=0, doc_id=None, run_id=None,                         
                     timestamp=None, kuery='local_metadata.elastic.agent.version : 8.2.1', cluster_name=None)                                                                                     
                     prev_status=FleetAgentStatus(total=9459, inactive=10021, online=8610, error=0, offline=0, updating=849, other=10870, events=0, doc_id=None, run_id=None,                     
                     timestamp=None, kuery='local_metadata.elastic.agent.version : 8.2.1', cluster_name=None)                                                                                     
[16:14:20] ERROR    poll_agents_status reached max_iterations 

@blakerouse
Copy link
Contributor Author

blakerouse commented Sep 27, 2022

The snapshot used in @pjbertels previous comment did not include the fix in elastic/fleet-server#1894.

The latest snapshot build does contain the fix, but BC2 and snapshot for 8.5 is current broken because of https://github.com/elastic/ingest-dev/issues/1282. This is preventing the ability to actually test the fix performed in elastic/fleet-server#1894.

Hopefully a quick fix to https://github.com/elastic/ingest-dev/issues/1282 is near, so we can get back to testing upgrade.

@blakerouse
Copy link
Contributor Author

@pjbertels Was able to successfully get latest snapshot running with 8.5 and perform an upgrade. Resulted in ~5k still being stuck in upgrading. Digging into the logs now.

@blakerouse
Copy link
Contributor Author

I was unable to find any errors being reported by Fleet Server. From the logs all upgrade actions where dispatched to the Elastic Agent's and all upgrade actions where ack'd.

The issue is that update request to elasticsearch to update the document seems to be only updating 1 of the fields.

	now := time.Now().UTC().Format(time.RFC3339)
	doc := bulk.UpdateFields{
		dl.FieldUpgradeStartedAt: nil,
		dl.FieldUpgradedAt:       now,
		dl.FieldUpgradeStatus:    "completed",
	}

	body, err := doc.Marshal()
	if err != nil {
		return errors.Wrap(err, "handleUpgrade marshal")
	}

	if err = ack.bulk.Update(ctx, dl.FleetAgents, agent.Id, body, bulk.WithRefresh(), bulk.WithRetryOnConflict(3)); err != nil {
		return errors.Wrap(err, "handleUpgrade update")
	}

Only the upgraded_at field is set (I have correlated the log message for the ACK to match the timestamp that gets set in this field). The other 2 fields upgrade_started_at and upgrade_status are not updated on the document.

At the moment I am in need of help from elasticsearch team, because I believe this could be an issue internal to elasticsearch.

@jlind23
Copy link
Contributor

jlind23 commented Sep 27, 2022

Hey @dakrone we definitely need Elasticsearch team help here - If not you, would you please help us find someone to help on the issue above ^ which seems to potentially be an Elasticsearch bug

@dakrone
Copy link
Member

dakrone commented Sep 27, 2022

I'm having a hard time parsing exactly what this is doing (being not at all familiar with agent's internals), from what I can guess, it's doing something like this:

PUT /myindex/_doc/1
{
  "foo":"bar"
}
  
POST /myindex/_update/1
{
  "doc": {
    "upgrade_started_at": null,
    "upgraded_at": 1234,
    "upgrade_status": "completed"
  }
}
  
GET /myindex/_doc/1

Which works fine for Elasticsearch.

If that's incorrect, can someone translate this into an Elasticsearch reproduction so I can try this out?

@joshdover
Copy link
Contributor

I was unable to find any errors being reported by Fleet Server. From the logs all upgrade actions where dispatched to the Elastic Agent's and all upgrade actions where ack'd.

What's the confidence level that we're not getting any errors returned by ES? My understanding is that we may still be dropping logs in Cloud, but that may not matter to prove this depending on the logging.

+1 a simple repro would be helpful. Could we use two tight loops of concurrent updates similar to what is happening with upgrade acks and the checkin 30s heartbeat to emulate this?

@blakerouse
Copy link
Contributor Author

@dakrone That is not what Fleet Server is doing. It is using _bulk to perform actions in bulk through its bulker system. So what actually would be happening is the following:

POST _bulk
{ "update" : {"_id" : "023e83e4-e3c7-46f6-9424-adb03e7ba980", "retry_on_conflict": "3", "_index" : ".fleet-agents"} }
{ "doc" : {"upgrade_started_at": null,"upgraded_at": 1234,"upgrade_status": "completed"} }

This bulk update action can occur with any other action that needs to occur with Fleet Server at an interval of 250ms or when the bulk buffer is full.

I am seeing that it works sometimes to update all the fields and in other cases it is not updating all fields.

@dakrone
Copy link
Member

dakrone commented Sep 27, 2022

Okay, even doing something like this, which creates and then updates the document:

DELETE /myindex
  
GET /myindex/_doc/1
  
POST _bulk
{ "index" : {"_id" : "2", "retry_on_conflict": "3", "_index" : "myindex"} }
{ "upgraded_at": 1234 }
{ "update" : {"_id" : "2", "retry_on_conflict": "3", "_index" : "myindex"} }
{ "doc" : {"upgrade_started_at": null,"upgraded_at": 1234,"upgrade_status": "completed"} }

GET /myindex/_doc/2

Succeeds locally. Perhaps a reproduction with the tight loops would be necessary to see if this can be reproduced (if it can, please open an issue in the Elasticsearch repository).

@nicpenning
Copy link
Contributor

nicpenning commented Sep 27, 2022

Hey y'all, I stumbled across this issue because today in production I have 1 agent that is stuck "Updating". (It has been like this for a little while. This device likes to roam on and off the network very frequently)

Rebooting, stopping/starting service or assigning to other policies will not get the status away from Updating.

Furthermore, the Agent is at 8.4.0 (Kibana and .\elastic-agent.exe status say so) and I can't upgrade it because it is grayed out in some spots and in others where it tries to let me upgrade fails immediately. So I can't tell the health of this agent via the UI, however, the .\elastic-agent.exe status reveals everything is healthy.

image

image

image

image

image

I will be happy to share anything that might help the situation. I am sure removing and re-enrolling the agent will do the trick but I very much do not like doing that to fix these issues.

@nicpenning
Copy link
Contributor

Additional detail: 8.4.1 on-premise stack, virtualized on modern Windows Server OS.

@nicpenning
Copy link
Contributor

nicpenning commented Sep 28, 2022

Further info:

When doing a search on the specific endpoint to .fleet-agent here is the redacted result:

{
        "_index": ".fleet-agents-7",
        "_id": "redacted-b019-4c98-8e99-d99b5a76cb93",
        "_score": 5.936952,
        "_source": {
          "access_api_key_id": "redacted",
          "action_seq_no": [
            623
          ],
          "active": true,
          "agent": {
            "id": "redacted-b019-4c98-8e99-d99b5a76cb93",
            "version": "8.4.0"
          },
          "enrolled_at": "2022-08-19T20:37:19Z",
          "local_metadata": {
            "elastic": {
              "agent": {
                "build.original": "8.4.0 (build: 56a002dc5c96d3a79decbf28bb7994cf49f3a4da at 2022-08-19 22:35:28 +0000 UTC)",
                "id": "redacted-b019-4c98-8e99-d99b5a76cb93",
                "log_level": "info",
                "snapshot": false,
                "upgradeable": true,
                "version": "8.4.0"
              }
            },
            "host": {
              "architecture": "x86_64",
              "hostname": "redacted",
              "id": "redacted-5f69-411e-b954-e603dc586f26",
              "ip": [
                "169.254.53.207/16",
                "172.31.32.1/20",
                "169.254.73.42/16",
                "169.254.124.209/16",
                "192.168.209.1/24",
                "192.168.30.1/24",
                "redacted/25",
                "::1/128",
                "127.0.0.1/8",
                "172.25.240.1/20",
                "172.27.32.1/20",
                "172.29.208.1/20"
              ],
              "mac": [
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted"
              ],
              "name": "redacted"
            },
            "os": {
              "family": "windows",
              "full": "Windows 10 Enterprise(10.0)",
              "kernel": "10.0.22000.978 (WinBuild.160101.0800)",
              "name": "Windows 10 Enterprise",
              "platform": "windows",
              "version": "10.0"
            }
          },
          "policy_id": "eb2a4e10-e2f3-11ea-8e88-e52736baa840",
          "type": "PERMANENT",
          "default_api_key_id": "redacted",
          "policy_output_permissions_hash": "5af4296cdba1c61fb98221f7996c8f80f6745cf00d521d8c3ecada41cefcc6c2",
          "default_api_key": "redacted:redacted",
          "updated_at": "2022-09-28T13:17:42Z",
          "last_checkin_status": "online",
          "last_checkin": "2022-09-28T13:17:40Z",
          "policy_revision_idx": 31,
          "policy_coordinator_idx": 1,
          "upgrade_started_at": "2022-09-02T14:45:21.497Z",
          "upgraded_at": null
        }

Two things in this output stick out to me (however, I have no clue how these should read). The first is that the device is Windows 11 but the data shows Windows 10. The second, which I think might be the issue, is that the upgraded_at is set to null, even though it started the upgrade on 9/2. Perhaps the null in the upgraded_at is why it is shown stuck at updating?

Please me know if this is helpful info or just annoying and I can buzz off :)

@joshdover
Copy link
Contributor

Hi there @nicpenning, you do indeed seem to be encountering this same issue. I recommend trying this as a workaround, you can run this from Dev Tools in Kibana:

POST /api/fleet/agents/<id>/upgrade
{"force": true, "version": "8.4.1"}

@nicpenning
Copy link
Contributor

So, I went ahead and updated the upgraded_at to 2022-09-02T14:55:21.497Z and it fixed the issue of showing updated.

POST .fleet-agents/_update/3e9b94cb-b019-4c98-8e99-d99b5a76cb93
{
  "doc": {
    "access_api_key_id": "redacted",
    "action_seq_no": [
      623
    ],
    "active": true,
    "agent": {
      "id": "redacted-b019-4c98-8e99-d99b5a76cb93",
      "version": "8.4.0"
    },
    "enrolled_at": "2022-08-19T20:37:19Z",
    "local_metadata": {
      "elastic": {
        "agent": {
          "build.original": "8.4.0 (build: 56a002dc5c96d3a79decbf28bb7994cf49f3a4da at 2022-08-19 22:35:28 +0000 UTC)",
          "id": "redacted-b019-4c98-8e99-d99b5a76cb93",
          "log_level": "info",
          "snapshot": false,
          "upgradeable": true,
          "version": "8.4.0"
        }
      },
      "host": {
        "architecture": "x86_64",
        "hostname": "redacted",
        "id": "redacted-5f69-411e-b954-e603dc586f26",
        "ip": [
          "169.254.53.207/16",
          "172.31.32.1/20",
          "169.254.73.42/16",
          "169.254.124.209/16",
          "192.168.209.1/24",
          "192.168.30.1/24",
          "redacted",
          "::1/128",
          "127.0.0.1/8",
          "172.25.240.1/20",
          "172.27.32.1/20",
          "172.29.208.1/20"
        ],
        "mac": [
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted"
        ],
        "name": "redacted"
      },
      "os": {
        "family": "windows",
        "full": "Windows 10 Enterprise(10.0)",
        "kernel": "10.0.22000.978 (WinBuild.160101.0800)",
        "name": "Windows 10 Enterprise",
        "platform": "windows",
        "version": "10.0"
      }
    },
    "policy_id": "eb2a4e10-e2f3-11ea-8e88-e52736baa840",
    "type": "PERMANENT",
    "default_api_key_id": "redacted",
    "policy_output_permissions_hash": "5af4296cdba1c61fb98221f7996c8f80f6745cf00d521d8c3ecada41cefcc6c2",
    "default_api_key": "redacted:redacted",
    "updated_at": "2022-09-28T13:23:32Z",
    "last_checkin_status": "online",
    "last_checkin": "2022-09-28T13:23:31Z",
    "policy_revision_idx": 31,
    "policy_coordinator_idx": 1,
    "upgrade_started_at": "2022-09-02T14:45:21.497Z",
    "upgraded_at": "2022-09-02T14:55:21.497Z"
  }
}

image

I will try the upgrade again.

@nicpenning
Copy link
Contributor

I got the Upgrading Agent notification after I pushed the upgrade so that single field seemed to have been the culprit.

Furthermore, the upgrade went on to work. Not sure if any of this info helps your case but figured I would share. You have helped me take a look under the hood to see what could be broken and a simple field update was a work around. Hopefully this will be addressed to help improve the reliability of such awesome capabilities.

image

@jlind23
Copy link
Contributor

jlind23 commented Sep 28, 2022

@joshdover @nicpenning then I assume that the following issue might resolve the UI display - elastic/kibana#139704

@joshdover
Copy link
Contributor

@joshdover @nicpenning then I assume that the following issue might resolve the UI display - elastic/kibana#139704

No I don't think that will fix the issue. The issue is that Fleet Server (or maybe ES) is losing the writes to make the upgrade as completed. This is why updating the upgraded_at field makes the agent healthy again in the UI.

We still need to get to the bottom of where and why these updates are not being persisted in Elasticsearch when the agent ack's it's upgrade.

@amitkanfer
Copy link
Contributor

@nicpenning @blakerouse , i had a short slack conversation with @tomcallahan and he suggested to enable audit logging in ES while this is happening. Did we try that?

@dakrone
Copy link
Member

dakrone commented Sep 29, 2022

Better than audit logging would be the indexing slowlog: https://www.elastic.co/guide/en/elasticsearch/reference/8.4/index-modules-slowlog.html#index-slow-log

PUT /<your-index>/_settings
{
  "index.indexing.slowlog.threshold.index.warn": "0ms",
  "index.indexing.slowlog.source": true
}

@blakerouse
Copy link
Contributor Author

Going to keep this open until we get confirmation that all 75k upgrade during scale testing.

@joshdover
Copy link
Contributor

  1. What if a customer tries to upgrade the same set of agents at the same time from two different Kibanas.

  2. What if Kibana crashes after setting some or all of the status fields, but before rolling out the action.

We may need to move away from a model where both Kibana and Fleet Server are writing to the .fleet-agents index. Ideally, it's only a single Fleet Server instance writing to to each agent document at a time, while the agent is connected to it. Kibana could interact with the agents via pushing new documents to the actions index which Fleet Server then consumes and uses to update the agent state.

We would need a way to easily query agents based on their pending, delivered, and ack'd actions so we could determine if an agent has a pending upgrade and dispaly this in the UI.

This wouldn't completely solve the problem though if you consider actions like unenrollment and policy reassignments. There are still potentially multiple actors that would want to update a single .fleet-agents document at the same time.

@juliaElastic
Copy link
Contributor

It'd be good to get an isolated backport of that change to 8.4.x too in case there's another patch release

Raised elastic/kibana#142364

@AndersonQ
Copy link
Member

We may need to move away from a model where both Kibana and Fleet Server are writing to the .fleet-agents index.

I agree 100%, please let's avoid more than one application writing to the same document.

@amitkanfer
Copy link
Contributor

please let's avoid more than one application writing to the same document.

+1

@joshdover
Copy link
Contributor

FYI I'm working on a draft proposal that will encompass addressing these issues

@pjbertels
Copy link
Contributor

Tests at 50K and 75K confirm we no longer have agents stuck in 'updating' at the end of an upgrade.
Tested with image_tag:8.5.0-58624073-SNAPSHOT.

@joshdover
Copy link
Contributor

Great news. @blakerouse I think we should go ahead and backport elastic/fleet-server#1896 to 8.4 in case we have a an additional 8.4 release

@amitkanfer
Copy link
Contributor

@nicpenning can you please email me at [email protected] ?

@nicpenning
Copy link
Contributor

FYI: Upgrading from 8.4.1 to 8.4.3 - We had 1 asset that got stuck updating and had a null value for upgraded_at out of the 15 we upgraded.

Will this be resolved in 8.5.0 or was it supposed to be resolved in 8.4.3?

@nicpenning
Copy link
Contributor

nicpenning commented Oct 12, 2022

POST /api/fleet/agents/<id>/upgrade
{"force": true, "version": "8.4.1"}

I tried this except for using 8.4.3 but I got back no handler found error:
{ "error": "no handler found for uri [/api/fleet/agents/redacted/upgrade?pretty=true] and method [POST]" }

@jlind23
Copy link
Contributor

jlind23 commented Oct 13, 2022

@nicpenning this will be resolved in 8.5.0 as it was not backported to 8.4.3

@dmgeurts
Copy link

dmgeurts commented Feb 3, 2023

I am still having the same issue upgrading agents from 8.6.0 to 8.6.1. Initially, most agents were stuck in updating, then after a policy update most returned healthy, but some are still stubbornly stating updating even though the version listed is 8.6.1.

@nicpenning
Copy link
Contributor

Confirmed we saw this in 8.6.0 -> 8.6.1.

2 out of 24 agents successfully upgraded, all others were "stuck" updating. I will note that this only occurred on the scheduled upgrades by setting it to an hour into the future and setting the time frame for over an hour. Any upgrades set to execute immediately worked without any hesitation.

In the end, we had to force the upgrades via the API to get upgraded. I believe then that there is an issue with the scheduled aspect of the upgrades.

@dmgeurts
Copy link

dmgeurts commented Feb 3, 2023

I updated the agents manually from the CLI, but that didn't resolve the updating status either. I'm not sure how to reset the status now, the agents appear happy.

How did you force things via the API?

@nicpenning
Copy link
Contributor

nicpenning commented Feb 3, 2023

elastic/kibana#135539 (comment)

POST kbn:/api/fleet/agents/<agent id>/upgrade
{ "force": true, "version": "8.6.1" }

@dmgeurts
Copy link

dmgeurts commented Feb 3, 2023

elastic/kibana#135539 (comment)

POST kbn:/api/fleet/agents/<agent id>/upgrade
{ "force": true, "version": "8.6.1" }

This changed the status to unhealthy until I restarted the agent and then it went back to updating. I should stress that our agents are already upgraded to 8.6.1.

Agent log:

12:55:52.674 elastic_agent [elastic_agent][warn] Skipping addition to action-queue, issue gathering start time from action id af1e004c-cf91-4de1-96ed-477d74c4cc53: action has no start time
12:55:52.689 elastic_agent [elastic_agent][info] Upgrading agent
12:55:52.689 elastic_agent [elastic_agent][error] Unable to clean downloads before update
12:55:55.698 elastic_agent [elastic_agent][info] download from https://internal.server.com/elk/beats/elastic-agent/elastic-agent-8.6.1-linux-x86_64.tar.gz completed in 2 seconds @ 191.9MBps
12:55:55.705 elastic_agent [elastic_agent][info] download from https://internal.server.com/elk/beats/elastic-agent/elastic-agent-8.6.1-linux-x86_64.tar.gz.sha512 completed in Less than a second @ +InfYBps
12:55:57.770 elastic_agent [elastic_agent][error] Failed to unpack upgrade artifact
Showing entries until Feb 3, 12:55:57

@nicpenning
Copy link
Contributor

Okay, the difference with my agents is that they were still reporting 8.5.3 and stuck updating. After the force upgrade with API they upgraded with no issues and got out of the Updating state.

@dmgeurts
Copy link

dmgeurts commented Feb 3, 2023

I probably should have been a bit more patient, and not jumped the gun on manually updating the agents.

For the agents reporting to the fleet server I can unenroll and redo the installation, but having to do this for an increasing number of agents each time we upgrade is far from ideal. And as for the Fleet server itself, this is not so easy, it's also stuck in updating.

@cmacknz
Copy link
Member

cmacknz commented Feb 3, 2023

I am not sure if these reports are all experiencing the same root cause, without agent logs it is impossible to tell. It is unlikely to be the same bug described in this issue, just a similar symptom.

12:55:57.770 elastic_agent [elastic_agent][error] Failed to unpack upgrade artifact

This is coming from the code below. Is there an accompany error.message with the reason why the unpack failed? This means the agent successfully downloaded the artifact but could not untar or unzip it.

if runtime.GOOS == windows {
hash, err = unzip(u.log, archivePath)
} else {
hash, err = untar(u.log, version, archivePath)
}
if err != nil {
u.log.Errorw("Failed to unpack upgrade artifact", "error.message", err, "version", version, "file.path", archivePath, "hash", hash)
return "", err
}

@dmgeurts
Copy link

dmgeurts commented Feb 3, 2023

The error is given after a reboot, I'm not sure if the agent or the fleet server would for some reason try to rerun the upgrade. Or if it fails because it's trying to reinstall the same version.

There are no further accompanying error messages. The rest of the messages show a normal restart of the agent and everything eventually returning healthy. Yet Fleet will still show the agent as updating.

I'm tempted to leave things as they are with 15 healthy agents and 8 agents showing as updating but healthy when checked locally. And then see how things go with 8.6.2.

@cmacknz
Copy link
Member

cmacknz commented Feb 3, 2023

I'm tempted to leave things as they are with 15 healthy agents and 8 agents showing as updating but healthy when checked locally.

Just to clarify, did the agents that show as updating complete the upgrade to the desired version successfully? If so, then I suspect this is a synchronization issue in the UI and it would be fine to leave the agents like this although we would like to know why this is. I'll need to find someone from the Fleet UI team to think about why this might be the case if this is what is happening.

@dmgeurts
Copy link

dmgeurts commented Feb 3, 2023

No, they rolled back to 8.6.0 after a while but remained as updating in Fleet. I couldn't retry as Fleet was stuck in updating for these agents. So I did what at the time I thought was the only thing I could do and did sudo elastic-agent update 8.6.1 from the agent machine. This showed me that the upgrade itself wasn't the issue as it completed fine. Fleet would then detect and show the new version, but the status never changed from updating to anything else. I wasn't aware I could do a force update from the Elastic console.

TBH, I must have had two different issues across my agents as some agents came back as healthy when I update the policy applied to them.

What is certain though is that it would be nice to have a way to force Fleet to check the status of the agent and if the agent is happy and on the right version then surely the status should be updated in Fleet.

As for the Fleet server itself showing as updating, I'm at a loss on how to fix that other than removing the Fleet server and reinstalling the agent there.

@cmacknz
Copy link
Member

cmacknz commented Feb 3, 2023

OK thanks we have one bug we just discovered where if an upgrade fails and then rolls back, the local agent downloads directory needs to be restored or the next upgrade will fail: #2222. The data/elastic-agent-$hash/downloads directory needs to be restored before retrying.

That would not have been the original problem but it may be affecting you now. If you can get us the logs for these failed upgrades we would love to look at them to confirm what is happening. If you have failed upgrades you will have multiple data/elastic-agent-$hash/logs directories at the root of the agent installation directory, grabbing those logs directories is what we would want. If you run the normal diagnostics command it will only get the logs for the currently running agent, not the one that failed to upgrade. Obtaining failed upgrade logs needs to be done manually for now, we are planning to automate this process eventually.

@dmgeurts
Copy link

dmgeurts commented Feb 3, 2023

I found that by issuing the upgrade command locally, the upgrade did succeed.

There's only one data/elastic-agent-$hash/logs folder:

user@host02:~$ sudo ls -al /opt/Elastic/Agent/data/
total 16
drwxrwx--- 4 root root 4096 Feb  2 23:35 .
drwxrwx--- 4 root root 4096 Feb  3 14:00 ..
-rw------- 1 root root    0 Jan 16 11:55 agent.lock
drwxr-xr-x 6 root root 4096 Feb  3 14:00 elastic-agent-b8553c
drwxr-x--- 2 root root 4096 Feb  3 14:00 tmp

user@host02:~$ sudo ls -al /opt/Elastic/Agent/data/elastic-agent-b8553c/logs/
total 22552
drwx------ 2 root root     4096 Feb  3 14:00 .
drwxr-xr-x 6 root root     4096 Feb  3 14:00 ..
-rw------- 1 root root 10485694 Feb  3 07:32 elastic-agent-20230202.ndjson
-rw------- 1 root root  4338164 Feb  3 17:17 elastic-agent-20230203-1.ndjson
-rw------- 1 root root  8242770 Feb  3 14:00 elastic-agent-20230203.ndjson

So my current state is that the agent upgrade succeeded, but the Fleet server remains unaware of it.

@joshdover
Copy link
Contributor

So my current state is that the agent upgrade succeeded, but the Fleet server remains unaware of it.

@dmgeurts thanks for reporting this. I agree it's unlikely to be the same root cause. A few questions that may help us narrow down this part of the problem:

  • What exactly is the state in the UI? Is Fleet UI showing the correct version number after the manual upgrade but still shows the agent as "updating"?
  • Do you have any agent logs from this time period that you could share? We added an automatic retry on upgrade failures in 8.6.0. It may have emitted some helpful information in the logs.
  • Do you have a Logstash output configured on the agent policy you're using?

@dmgeurts
Copy link

dmgeurts commented Feb 6, 2023

  • What exactly is the state in the UI? Is Fleet UI showing the correct version number after the manual upgrade but still shows the agent as "updating"?

Yes, Fleet shows the version number as 8.6.1 and the status as updating.

  • Do you have any agent logs from this time period that you could share? We added an automatic retry on upgrade failures in 8.6.0. It may have emitted some helpful information in the logs.

Aren't the logs deleted when an upgrade succeeds? This is my assumption as the logs folder is inside the agent folder and I only have one agent folder under /opt/Elastic/Agent/data/, which is for v8.6.1. Or would they be the initial logs of the agent?

  • Do you have a Logstash output configured on the agent policy you're using?

No, I haven't created a LogStash output yet. All agents are still using the default (direct to Elasticsearch).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.6-candidate bug Something isn't working Project:FleetScaling Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests