Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest Manager] Update procedure: rollback on failure #21518

Closed
michalpristas opened this issue Oct 5, 2020 · 14 comments
Closed

[Ingest Manager] Update procedure: rollback on failure #21518

michalpristas opened this issue Oct 5, 2020 · 14 comments
Assignees
Labels

Comments

@michalpristas
Copy link
Contributor

With update procedure in place we need a mechanism for the case when update procedure wont succeed.

For this we need additional subprocess which will be triggered and will watch for several indicators during a grace period:

  • health of agent
  • health of agents subprocesses (beats, endpoint)
  • unusual restarts/crashed of agent or subprocesses

what is healthy or not is defined on the level of agent/subprocess and provided to watcher using some form of API.
In case one component turns out failing during the grace period rollback procedure is initiated which means:

  • symlinks are switched
  • paths are updated
  • configuration is recovered
  • active.commit.id is updated
  • update marker removed
  • installation files remove
  • failing binaries removed

Food for thought: maybe we should keep log files of failing version for possible investigation

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ingest-management (Team:Ingest Management)

@ph ph added v7.11.0 and removed enhancement labels Oct 14, 2020
@michalpristas michalpristas self-assigned this Oct 26, 2020
@ph
Copy link
Contributor

ph commented Nov 4, 2020

Any progress on this front?

@michalpristas
Copy link
Contributor Author

Update: i got rollback working on macos, killing agent was detected and agent was rolled back to previous version
TODO: windows/linux

@ph
Copy link
Contributor

ph commented Nov 5, 2020

@EricDavisX I've talked with @michalpristas to how we can include that in end2end testing. Any idea of how we should deal with that with QA?

@michalpristas
Copy link
Contributor Author

michalpristas commented Nov 5, 2020

Scenario in mind is like follows

  • get version X
  • enroll
  • see fleet says version X
  • upgrade
  • see fleet says version Y
  • loop N times
    • get pid
    • kill -9 pid
  • endloop
  • wait for rollback to kick in (sleep 5 s)
  • see fleet says version X

@EricDavisX
Copy link
Contributor

I'd really like to get the upgrade feature in, based on 7.10 Agent as soon as we can... explicitly before we try to check in changes that enhance (may break) the existing functionality. that ticket is: elastic/e2e-testing#341

@ph I'm not sure what else you may mean? We can talk off-line or you can expand a bit what your concern is please?

@ph
Copy link
Contributor

ph commented Nov 5, 2020

Yes it will be great to have it in++ @michalpristas is looking into it.

@EricDavisX I did not express myself correctly. The AC is "On upgrade error the Agent should rollback", my question is in manual QA do we want to trigger that scenario manual to assert the working behavior of that feature? I am not sure how easy is to trigger based on #21518 (comment)

@EricDavisX
Copy link
Contributor

oh thanks! I assert this can be safely covered without manual UI testing, presuming the code paths in Agent are not in any OS specific paths. That is, if we get the e2e-testing scenario implemented as above, and the main upgrade test.

I will ask the QA team to document the expectation for roll-back but we can mark it as not-needed manually.

@michalpristas
Copy link
Contributor Author

actually there's quite OS specific code, mainly related to monitoring agent process PID and kicking of watching subprocess which we dont want to die with an agent.

@ph
Copy link
Contributor

ph commented Nov 6, 2020

@EricDavisX @michalpristas once it's in working space we can discuss how we test that feature? I would prefer to have automated test here. But we only cover a fraction of our supported Oses.

@EricDavisX
Copy link
Contributor

Happy to discuss anytime.

@EricDavisX
Copy link
Contributor

@michalpristas Hi, I know you were working on automated tests, did you have a WIP PR you wanted to discuss or a branch to point to? Happy to review and pull and work it with ya.

@ph
Copy link
Contributor

ph commented Nov 23, 2020

PR is at #22537

@michalpristas
Copy link
Contributor Author

closing as PR #22537 is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants