[Ingest Manager] Support for upgrade rollback #22537

michalpristas · 2020-11-11T09:19:11Z

What does this PR do?

This PR introduces watch subcommand which is started as a subprocess at the point of upgrade.
This subcommand starts various watchers monitoring status of agent and apps as well as agent crashes.

In case nothing bad is reported subprocess cleans up older version of agent at terminates.
In case FAILED status is reported or agent crashes more than 2 times within a minute agent is rolled back to previous version and new version is removed.

This PR does not report UPGRADING state to fleet as state reporting will be changed during upcoming days and the change would make this PR unnecessarily more complicated.

Watcher makes decisions based on marker, if it does not exist it terminates right away, also more than 1 watcher is not allowed to run using file lock to ensure that.

To test it i built a snapshot version installed it and updated from fleet.
For positive scenario, user needs to wait 10 minutes for cleanup to occur.
For negative scenario kill -9 {PID} needs to be done several times where PID is a process ID of main agent process.

Why is it important?

More solid upgrade process.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

ph · 2020-11-11T13:24:18Z

@michalpristas This PRs doesn't have any testing or automated test to verify that behavior, I would be more comfortable to have a few tests to cover it?

michalpristas · 2020-11-11T14:42:04Z

@ph yes i am thinking about a way how to test this, just wanted to publish this for review

ph · 2020-11-12T19:23:05Z

Is that ready for early review?

michalpristas · 2020-11-16T09:33:39Z

@ph yes ready for early review and manual tests. i need to discuss automated e3e testing with manu as this would be the best way to test it

blakerouse

This is looking really good. I added some comments on some more error cases that probably need to be handled.

I think the biggest issue at the moment is that it only supports systemd on Linux, and not sysv and upstart.

blakerouse · 2020-11-16T13:44:01Z

x-pack/elastic-agent/pkg/agent/application/upgrade/crash_checker_test.go

+	testCheckPeriod = 100 * time.Millisecond
+)
+
+func TestChecker(t *testing.T) {


Nice! Really like the unit testing on this.

blakerouse · 2020-11-16T13:46:47Z

x-pack/elastic-agent/pkg/agent/application/upgrade/error_checker.go

+		case <-time.After(statusCheckPeriod):
+			err := ch.agentClient.Connect(ctx)
+			if err != nil {
+				ch.log.Error(err, "Failed communicating to running daemon", errors.TypeNetwork, errors.M("socket", control.Address()))


What happens if the new Agent starts running but there is a bug that never brings up the control socket? This case would not be detected by the PID watcher.

I think the un-ability to communicate with the Agent is something the ErrorChecker should report.

Maybe we give more chances for this to fail, as the Agent could be starting, but I think it needs to be reported.

Was this solved? Doesn't seem like the failed ability to connect, is handled.

blakerouse · 2020-11-16T13:46:56Z

x-pack/elastic-agent/pkg/agent/application/upgrade/error_checker.go

+			ch.agentClient.Disconnect()
+			if err != nil {
+				ch.log.Error("failed retrieving agent status", err)
+				// agent is probably not running and this will be detected by pid watcher


This falls under the comment above.

Nicely done!

blakerouse · 2020-11-16T13:48:09Z

x-pack/elastic-agent/pkg/agent/application/upgrade/locker.go

+}
+
+// NewLocker creates an Locker that locks the agent.lock file inside the provided directory.
+func NewLocker(dir string) *Locker {


We have a locker already used by the application, to ensure that no more than one Agent is running. Could we generalize that code to share it? This seems very similar.

yep it is pretty much the same i can generalize that

blakerouse · 2020-11-16T13:49:57Z

x-pack/elastic-agent/pkg/agent/application/upgrade/rollback.go

+	// cannot remove self, this is expected
+	// fails with  remove {path}}\elastic-agent.exe: Access is denied
+	if runtime.GOOS == "windows" && strings.Contains(err.Error(), "elastic-agent.exe") && strings.Contains(err.Error(), "Access is denied") {
+		return true


I have code in the Uninstall command, that tries to handle this. It might be good to generalize that so it can be used here.

It uses a spawned cmd.exe to cleanup the elastic-agent.exe after uninstall. Same could be used here by the watcher.

blakerouse · 2020-11-16T13:51:06Z

x-pack/elastic-agent/pkg/agent/application/upgrade/rollback.go

+
+	// revert active commit
+	if err := UpdateActiveCommit(prevHash); err != nil {
+		return err


I think you need to rollback the symlink on failure.

i was thinking about this in a way, that reseting symlink is most important thing on rollback, if active commit, or cleanup is not changed it should not prevent running older (correct) version instead of failed new one.
If restart fails correct agent should be started after Service Manager restarts agent either on machine restart or for whatever reason.

blakerouse · 2020-11-16T13:51:29Z

x-pack/elastic-agent/pkg/agent/application/upgrade/rollback.go

+
+	// Restart
+	if err := restartAgent(ctx); err != nil {
+		return err


I think you need to rollback the symlink and active commit hash on failure.

blakerouse · 2020-11-16T13:53:52Z

x-pack/elastic-agent/pkg/agent/application/upgrade/service.go

+
+// Init initializes os dependent properties.
+func (ch *CrashChecker) Init(ctx context.Context) error {
+	dbusConn, err := dbus.New()


I love the use of DBUS, but I think this is specific to systemd. We need to also support sysv and upstart. I think the service module I used for install/uninstall has the ability to get information about a service in a generalized way. I don't know if it will give PID, but if so that would ensure that it works on all linux init systems.

yes i was hoping for that as well. but it only provides a status meaning Running, Stopped ...
i will check how i can provide PIDers for other service managers

ph · 2020-11-18T13:38:45Z

x-pack/elastic-agent/pkg/agent/application/upgrade/service_darwin.go

+		// agent should be included in sudo one but in case it's not
+		// we're falling back to regular
+		p.piderFromCmd(ctx, "sudo", "launchctl", "list", install.ServiceName),
+		p.piderFromCmd(ctx, "launchctl", "list", install.ServiceName),


@blakerouse @michalpristas I've seen the sudo reference is this really the way to go, is it possible that it fails?

@ph I didn't notice the sudo. sudo should be removed, the Agent is already running at high-level permissions if its running under the service manager, which it must for self-upgrading.

i dont know why but when i run it without sudo i dont see elastic-agent service listed, i added it there because i was not able to retrieve pid with regular one.

edit; workin when running as a service, without sudo it wont work when using DEV build and trying upgrades, i will add DEV check and perform sudo only in this case. otherwise sudoless command

EricDavisX · 2020-11-23T23:20:02Z

can you talk to why it takes around 10 minutes on successful install to see the clean up, if that is still the case?

michalpristas · 2020-11-24T09:56:56Z

@EricDavisX on sucessful install we still wait for grace period in case something is wrong with beats or agent which dont manifest right away.

EricDavisX · 2020-11-24T21:48:47Z

ok thanks Michal.

michalpristas · 2020-12-08T11:25:22Z

@blakerouse could you give it another look, fixed issues from comments

blakerouse

This is really good work, super excited to see this land and the overall quality of the failure cases is amazing.

I do just have that one question about if it's not able to connect to the control socket? I think that might be the only missing piece here.

blakerouse · 2020-12-08T20:50:26Z

x-pack/elastic-agent/pkg/agent/application/upgrade/error_checker.go

+		case <-time.After(statusCheckPeriod):
+			err := ch.agentClient.Connect(ctx)
+			if err != nil {
+				ch.log.Error(err, "Failed communicating to running daemon", errors.TypeNetwork, errors.M("socket", control.Address()))


Was this solved? Doesn't seem like the failed ability to connect, is handled.

blakerouse · 2020-12-08T20:53:51Z

x-pack/elastic-agent/pkg/agent/application/upgrade/error_checker.go

+			ch.agentClient.Disconnect()
+			if err != nil {
+				ch.log.Error("failed retrieving agent status", err)
+				// agent is probably not running and this will be detected by pid watcher


Nicely done!

blakerouse

Thanks for the fix on the connection handling! Looks great!

[Ingest Manager] Support for upgrade rollback (elastic#22537)

[Ingest Manager] Support for upgrade rollback (#22537)

michalpristas added 30 commits October 22, 2020 15:49

skeleton

d350d91

Merge branch 'master' of github.com:elastic/beats into agent-watcher

f4b9bf7

step 1

7957f1a

Merge branch 'master' into agent-watcher

f19380f

untested complete

05b1811

Merge branch 'master' into agent-watcher

07454c2

consider application errors

0a1387f

Merge branch 'master' of github.com:elastic/beats into agent-watcher

228c620

testing

7295dae

macos watching + removeme

e55b925

linux service

2a16639

linux service

4c1e819

linux service

5665d41

linux service

cfb3b02

fix

764b977

demonise

4a27cdf

sig play

8cc38c0

comment

dca4f7e

os specific invoke

46ed099

windows

913f2b4

windows

e390ae9

teardown delay

4a9e82f

test

7f9e422

return ok

7d2f10b

log error

f4b01fe

cleanup windows on new cycle

59ccd59

error expected

3e55660

formatting

81664a6

Merge branch 'master' of github.com:elastic/beats into agent-watcher

7ab12fc

logs cleanup

ef96264

michalpristas added 2 commits November 11, 2020 16:21

crash checker tests

ce360fa

fmt

628c1ab

blakerouse reviewed Nov 16, 2020

View reviewed changes

ph reviewed Nov 18, 2020

View reviewed changes

michalpristas added 4 commits November 18, 2020 17:01

support for other sc

cdfbfa0

status check counter

6e0a679

comments

a653e7b

Merge branch 'master' of github.com:elastic/beats into agent-watcher

581142d

ph mentioned this pull request Nov 23, 2020

[Ingest Manager] Update procedure: rollback on failure #21518

Closed

michalpristas added 4 commits November 30, 2020 18:30

Merge branch 'master' of github.com:elastic/beats into agent-watcher

ea2b917

sudo only for DEV

d1703f2

Merge branch 'master' of github.com:elastic/beats into agent-watcher

549358b

Merge branch 'master' of github.com:elastic/beats into agent-watcher

7fe9b02

EricDavisX mentioned this pull request Dec 8, 2020

[Elastic-Agent] Testing over Agent Roll-back on Failure #22993

Closed

blakerouse reviewed Dec 8, 2020

View reviewed changes

michalpristas added 2 commits December 9, 2020 08:18

check connection errors

3b83ccd

Merge branch 'master' of github.com:elastic/beats into agent-watcher

a09f28a

blakerouse approved these changes Dec 9, 2020

View reviewed changes

michalpristas merged commit 374ef1f into elastic:master Dec 9, 2020

michalpristas added a commit to michalpristas/beats that referenced this pull request Dec 9, 2020

[Ingest Manager] Support for upgrade rollback (elastic#22537)

2e62fec

[Ingest Manager] Support for upgrade rollback (elastic#22537)

michalpristas mentioned this pull request Dec 9, 2020

Cherry-pick #22537 to 7.x: Support for upgrade rollback #23031

Merged

6 tasks

michalpristas added a commit that referenced this pull request Dec 9, 2020

[Ingest Manager] Support for upgrade rollback (#22537) (#23031)

16a550b

[Ingest Manager] Support for upgrade rollback (#22537)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ingest Manager] Support for upgrade rollback #22537

[Ingest Manager] Support for upgrade rollback #22537

michalpristas commented Nov 11, 2020 •

edited

Loading

ph commented Nov 11, 2020

michalpristas commented Nov 11, 2020

ph commented Nov 12, 2020

michalpristas commented Nov 16, 2020

blakerouse left a comment

blakerouse Nov 16, 2020

blakerouse Nov 16, 2020

blakerouse Dec 8, 2020

blakerouse Nov 16, 2020

blakerouse Dec 8, 2020

blakerouse Nov 16, 2020

michalpristas Nov 16, 2020

blakerouse Nov 16, 2020

blakerouse Nov 16, 2020

michalpristas Nov 19, 2020

blakerouse Nov 16, 2020

blakerouse Nov 16, 2020

michalpristas Nov 16, 2020

ph Nov 18, 2020

blakerouse Nov 18, 2020

michalpristas Nov 19, 2020 •

edited

Loading

michalpristas Dec 1, 2020

EricDavisX commented Nov 23, 2020

michalpristas commented Nov 24, 2020

EricDavisX commented Nov 24, 2020

michalpristas commented Dec 8, 2020

blakerouse left a comment

blakerouse Dec 8, 2020

blakerouse Dec 8, 2020

blakerouse left a comment

[Ingest Manager] Support for upgrade rollback #22537

[Ingest Manager] Support for upgrade rollback #22537

Conversation

michalpristas commented Nov 11, 2020 • edited Loading

What does this PR do?

Why is it important?

Checklist

ph commented Nov 11, 2020

michalpristas commented Nov 11, 2020

ph commented Nov 12, 2020

michalpristas commented Nov 16, 2020

blakerouse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michalpristas Nov 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EricDavisX commented Nov 23, 2020

michalpristas commented Nov 24, 2020

EricDavisX commented Nov 24, 2020

michalpristas commented Dec 8, 2020

blakerouse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blakerouse left a comment

Choose a reason for hiding this comment

michalpristas commented Nov 11, 2020 •

edited

Loading

michalpristas Nov 19, 2020 •

edited

Loading