Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest Manager] Support for upgrade rollback #22537

Merged
merged 43 commits into from
Dec 9, 2020

Conversation

michalpristas
Copy link
Contributor

@michalpristas michalpristas commented Nov 11, 2020

What does this PR do?

This PR introduces watch subcommand which is started as a subprocess at the point of upgrade.
This subcommand starts various watchers monitoring status of agent and apps as well as agent crashes.

In case nothing bad is reported subprocess cleans up older version of agent at terminates.
In case FAILED status is reported or agent crashes more than 2 times within a minute agent is rolled back to previous version and new version is removed.

This PR does not report UPGRADING state to fleet as state reporting will be changed during upcoming days and the change would make this PR unnecessarily more complicated.

Watcher makes decisions based on marker, if it does not exist it terminates right away, also more than 1 watcher is not allowed to run using file lock to ensure that.

To test it i built a snapshot version installed it and updated from fleet.
For positive scenario, user needs to wait 10 minutes for cleanup to occur.
For negative scenario kill -9 {PID} needs to be done several times where PID is a process ID of main agent process.

Why is it important?

More solid upgrade process.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@ph
Copy link
Contributor

ph commented Nov 11, 2020

@michalpristas This PRs doesn't have any testing or automated test to verify that behavior, I would be more comfortable to have a few tests to cover it?

@michalpristas
Copy link
Contributor Author

@ph yes i am thinking about a way how to test this, just wanted to publish this for review

@ph
Copy link
Contributor

ph commented Nov 12, 2020

Is that ready for early review?

@michalpristas
Copy link
Contributor Author

@ph yes ready for early review and manual tests. i need to discuss automated e3e testing with manu as this would be the best way to test it

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good. I added some comments on some more error cases that probably need to be handled.

I think the biggest issue at the moment is that it only supports systemd on Linux, and not sysv and upstart.

testCheckPeriod = 100 * time.Millisecond
)

func TestChecker(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Really like the unit testing on this.

case <-time.After(statusCheckPeriod):
err := ch.agentClient.Connect(ctx)
if err != nil {
ch.log.Error(err, "Failed communicating to running daemon", errors.TypeNetwork, errors.M("socket", control.Address()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the new Agent starts running but there is a bug that never brings up the control socket? This case would not be detected by the PID watcher.

I think the un-ability to communicate with the Agent is something the ErrorChecker should report.

Maybe we give more chances for this to fail, as the Agent could be starting, but I think it needs to be reported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this solved? Doesn't seem like the failed ability to connect, is handled.

ch.agentClient.Disconnect()
if err != nil {
ch.log.Error("failed retrieving agent status", err)
// agent is probably not running and this will be detected by pid watcher
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This falls under the comment above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done!

}

// NewLocker creates an Locker that locks the agent.lock file inside the provided directory.
func NewLocker(dir string) *Locker {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a locker already used by the application, to ensure that no more than one Agent is running. Could we generalize that code to share it? This seems very similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep it is pretty much the same i can generalize that

// cannot remove self, this is expected
// fails with remove {path}}\elastic-agent.exe: Access is denied
if runtime.GOOS == "windows" && strings.Contains(err.Error(), "elastic-agent.exe") && strings.Contains(err.Error(), "Access is denied") {
return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have code in the Uninstall command, that tries to handle this. It might be good to generalize that so it can be used here.

It uses a spawned cmd.exe to cleanup the elastic-agent.exe after uninstall. Same could be used here by the watcher.


// revert active commit
if err := UpdateActiveCommit(prevHash); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to rollback the symlink on failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was thinking about this in a way, that reseting symlink is most important thing on rollback, if active commit, or cleanup is not changed it should not prevent running older (correct) version instead of failed new one.
If restart fails correct agent should be started after Service Manager restarts agent either on machine restart or for whatever reason.


// Restart
if err := restartAgent(ctx); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to rollback the symlink and active commit hash on failure.


// Init initializes os dependent properties.
func (ch *CrashChecker) Init(ctx context.Context) error {
dbusConn, err := dbus.New()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the use of DBUS, but I think this is specific to systemd. We need to also support sysv and upstart. I think the service module I used for install/uninstall has the ability to get information about a service in a generalized way. I don't know if it will give PID, but if so that would ensure that it works on all linux init systems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i was hoping for that as well. but it only provides a status meaning Running, Stopped ...
i will check how i can provide PIDers for other service managers

// agent should be included in sudo one but in case it's not
// we're falling back to regular
p.piderFromCmd(ctx, "sudo", "launchctl", "list", install.ServiceName),
p.piderFromCmd(ctx, "launchctl", "list", install.ServiceName),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blakerouse @michalpristas I've seen the sudo reference is this really the way to go, is it possible that it fails?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ph I didn't notice the sudo. sudo should be removed, the Agent is already running at high-level permissions if its running under the service manager, which it must for self-upgrading.

Copy link
Contributor Author

@michalpristas michalpristas Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont know why but when i run it without sudo i dont see elastic-agent service listed, i added it there because i was not able to retrieve pid with regular one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit; workin when running as a service, without sudo it wont work when using DEV build and trying upgrades, i will add DEV check and perform sudo only in this case. otherwise sudoless command

@EricDavisX
Copy link
Contributor

can you talk to why it takes around 10 minutes on successful install to see the clean up, if that is still the case?

@michalpristas
Copy link
Contributor Author

@EricDavisX on sucessful install we still wait for grace period in case something is wrong with beats or agent which dont manifest right away.

@EricDavisX
Copy link
Contributor

ok thanks Michal.

@michalpristas
Copy link
Contributor Author

@blakerouse could you give it another look, fixed issues from comments

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good work, super excited to see this land and the overall quality of the failure cases is amazing.

I do just have that one question about if it's not able to connect to the control socket? I think that might be the only missing piece here.

case <-time.After(statusCheckPeriod):
err := ch.agentClient.Connect(ctx)
if err != nil {
ch.log.Error(err, "Failed communicating to running daemon", errors.TypeNetwork, errors.M("socket", control.Address()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this solved? Doesn't seem like the failed ability to connect, is handled.

ch.agentClient.Disconnect()
if err != nil {
ch.log.Error("failed retrieving agent status", err)
// agent is probably not running and this will be detected by pid watcher
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done!

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix on the connection handling! Looks great!

@michalpristas michalpristas merged commit 374ef1f into elastic:master Dec 9, 2020
michalpristas added a commit to michalpristas/beats that referenced this pull request Dec 9, 2020
[Ingest Manager] Support for upgrade rollback (elastic#22537)
michalpristas added a commit that referenced this pull request Dec 9, 2020
[Ingest Manager] Support for upgrade rollback (#22537)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants