[Ingest Manager] Update procedure: rollback on failure #21518

michalpristas · 2020-10-05T13:24:09Z

With update procedure in place we need a mechanism for the case when update procedure wont succeed.

For this we need additional subprocess which will be triggered and will watch for several indicators during a grace period:

health of agent
health of agents subprocesses (beats, endpoint)
unusual restarts/crashed of agent or subprocesses

what is healthy or not is defined on the level of agent/subprocess and provided to watcher using some form of API.
In case one component turns out failing during the grace period rollback procedure is initiated which means:

symlinks are switched
paths are updated
configuration is recovered
active.commit.id is updated
update marker removed
installation files remove
failing binaries removed

Food for thought: maybe we should keep log files of failing version for possible investigation

elasticmachine · 2020-10-05T13:24:10Z

Pinging @elastic/ingest-management (Team:Ingest Management)

ph · 2020-11-04T15:30:22Z

Any progress on this front?

michalpristas · 2020-11-05T14:11:34Z

Update: i got rollback working on macos, killing agent was detected and agent was rolled back to previous version
TODO: windows/linux

ph · 2020-11-05T14:16:52Z

@EricDavisX I've talked with @michalpristas to how we can include that in end2end testing. Any idea of how we should deal with that with QA?

michalpristas · 2020-11-05T14:19:09Z

Scenario in mind is like follows

get version X
enroll
see fleet says version X
upgrade
see fleet says version Y
loop N times
- get pid
- kill -9 pid
endloop
wait for rollback to kick in (sleep 5 s)
see fleet says version X

EricDavisX · 2020-11-05T17:25:15Z

I'd really like to get the upgrade feature in, based on 7.10 Agent as soon as we can... explicitly before we try to check in changes that enhance (may break) the existing functionality. that ticket is: elastic/e2e-testing#341

@ph I'm not sure what else you may mean? We can talk off-line or you can expand a bit what your concern is please?

ph · 2020-11-05T19:08:33Z

Yes it will be great to have it in++ @michalpristas is looking into it.

@EricDavisX I did not express myself correctly. The AC is "On upgrade error the Agent should rollback", my question is in manual QA do we want to trigger that scenario manual to assert the working behavior of that feature? I am not sure how easy is to trigger based on #21518 (comment)

EricDavisX · 2020-11-05T23:00:04Z

oh thanks! I assert this can be safely covered without manual UI testing, presuming the code paths in Agent are not in any OS specific paths. That is, if we get the e2e-testing scenario implemented as above, and the main upgrade test.

I will ask the QA team to document the expectation for roll-back but we can mark it as not-needed manually.

michalpristas · 2020-11-06T09:16:16Z

actually there's quite OS specific code, mainly related to monitoring agent process PID and kicking of watching subprocess which we dont want to die with an agent.

ph · 2020-11-06T12:28:07Z

@EricDavisX @michalpristas once it's in working space we can discuss how we test that feature? I would prefer to have automated test here. But we only cover a fraction of our supported Oses.

EricDavisX · 2020-11-06T15:13:26Z

Happy to discuss anytime.

EricDavisX · 2020-11-18T23:25:26Z

@michalpristas Hi, I know you were working on automated tests, did you have a WIP PR you wanted to discuss or a branch to point to? Happy to review and pull and work it with ya.

ph · 2020-11-23T13:29:39Z

PR is at #22537

michalpristas · 2020-12-10T11:11:04Z

closing as PR #22537 is merged

michalpristas added enhancement Team:Ingest Management labels Oct 5, 2020

ph added v7.11.0 and removed enhancement labels Oct 14, 2020

michalpristas self-assigned this Oct 26, 2020

EricDavisX mentioned this issue Dec 8, 2020

[Elastic-Agent] Testing over Agent Roll-back on Failure #22993

Closed

michalpristas closed this as completed Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ingest Manager] Update procedure: rollback on failure #21518

[Ingest Manager] Update procedure: rollback on failure #21518

michalpristas commented Oct 5, 2020

elasticmachine commented Oct 5, 2020

ph commented Nov 4, 2020

michalpristas commented Nov 5, 2020

ph commented Nov 5, 2020

michalpristas commented Nov 5, 2020 •

edited

Loading

EricDavisX commented Nov 5, 2020

ph commented Nov 5, 2020

EricDavisX commented Nov 5, 2020

michalpristas commented Nov 6, 2020

ph commented Nov 6, 2020

EricDavisX commented Nov 6, 2020

EricDavisX commented Nov 18, 2020

ph commented Nov 23, 2020

michalpristas commented Dec 10, 2020

[Ingest Manager] Update procedure: rollback on failure #21518

[Ingest Manager] Update procedure: rollback on failure #21518

Comments

michalpristas commented Oct 5, 2020

elasticmachine commented Oct 5, 2020

ph commented Nov 4, 2020

michalpristas commented Nov 5, 2020

ph commented Nov 5, 2020

michalpristas commented Nov 5, 2020 • edited Loading

EricDavisX commented Nov 5, 2020

ph commented Nov 5, 2020

EricDavisX commented Nov 5, 2020

michalpristas commented Nov 6, 2020

ph commented Nov 6, 2020

EricDavisX commented Nov 6, 2020

EricDavisX commented Nov 18, 2020

ph commented Nov 23, 2020

michalpristas commented Dec 10, 2020

michalpristas commented Nov 5, 2020 •

edited

Loading