-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ingest Manager] Support for upgrade rollback #22537
Conversation
@michalpristas This PRs doesn't have any testing or automated test to verify that behavior, I would be more comfortable to have a few tests to cover it? |
@ph yes i am thinking about a way how to test this, just wanted to publish this for review |
Is that ready for early review? |
@ph yes ready for early review and manual tests. i need to discuss automated e3e testing with manu as this would be the best way to test it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good. I added some comments on some more error cases that probably need to be handled.
I think the biggest issue at the moment is that it only supports systemd on Linux, and not sysv
and upstart
.
testCheckPeriod = 100 * time.Millisecond | ||
) | ||
|
||
func TestChecker(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Really like the unit testing on this.
case <-time.After(statusCheckPeriod): | ||
err := ch.agentClient.Connect(ctx) | ||
if err != nil { | ||
ch.log.Error(err, "Failed communicating to running daemon", errors.TypeNetwork, errors.M("socket", control.Address())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the new Agent starts running but there is a bug that never brings up the control socket? This case would not be detected by the PID watcher.
I think the un-ability to communicate with the Agent is something the ErrorChecker
should report.
Maybe we give more chances for this to fail, as the Agent could be starting, but I think it needs to be reported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this solved? Doesn't seem like the failed ability to connect, is handled.
ch.agentClient.Disconnect() | ||
if err != nil { | ||
ch.log.Error("failed retrieving agent status", err) | ||
// agent is probably not running and this will be detected by pid watcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This falls under the comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done!
} | ||
|
||
// NewLocker creates an Locker that locks the agent.lock file inside the provided directory. | ||
func NewLocker(dir string) *Locker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a locker already used by the application, to ensure that no more than one Agent is running. Could we generalize that code to share it? This seems very similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep it is pretty much the same i can generalize that
// cannot remove self, this is expected | ||
// fails with remove {path}}\elastic-agent.exe: Access is denied | ||
if runtime.GOOS == "windows" && strings.Contains(err.Error(), "elastic-agent.exe") && strings.Contains(err.Error(), "Access is denied") { | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have code in the Uninstall command, that tries to handle this. It might be good to generalize that so it can be used here.
It uses a spawned cmd.exe
to cleanup the elastic-agent.exe
after uninstall. Same could be used here by the watcher.
|
||
// revert active commit | ||
if err := UpdateActiveCommit(prevHash); err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to rollback the symlink on failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was thinking about this in a way, that reseting symlink is most important thing on rollback, if active commit, or cleanup is not changed it should not prevent running older (correct) version instead of failed new one.
If restart fails correct agent should be started after Service Manager restarts agent either on machine restart or for whatever reason.
|
||
// Restart | ||
if err := restartAgent(ctx); err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to rollback the symlink and active commit hash on failure.
|
||
// Init initializes os dependent properties. | ||
func (ch *CrashChecker) Init(ctx context.Context) error { | ||
dbusConn, err := dbus.New() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the use of DBUS, but I think this is specific to systemd. We need to also support sysv and upstart. I think the service module I used for install/uninstall
has the ability to get information about a service in a generalized way. I don't know if it will give PID, but if so that would ensure that it works on all linux init systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i was hoping for that as well. but it only provides a status meaning Running, Stopped ...
i will check how i can provide PIDers for other service managers
// agent should be included in sudo one but in case it's not | ||
// we're falling back to regular | ||
p.piderFromCmd(ctx, "sudo", "launchctl", "list", install.ServiceName), | ||
p.piderFromCmd(ctx, "launchctl", "list", install.ServiceName), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blakerouse @michalpristas I've seen the sudo
reference is this really the way to go, is it possible that it fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ph I didn't notice the sudo
. sudo
should be removed, the Agent is already running at high-level permissions if its running under the service manager, which it must for self-upgrading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont know why but when i run it without sudo i dont see elastic-agent service listed, i added it there because i was not able to retrieve pid with regular one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit; workin when running as a service, without sudo it wont work when using DEV build and trying upgrades, i will add DEV check and perform sudo only in this case. otherwise sudoless command
can you talk to why it takes around 10 minutes on successful install to see the clean up, if that is still the case? |
@EricDavisX on sucessful install we still wait for grace period in case something is wrong with beats or agent which dont manifest right away. |
ok thanks Michal. |
@blakerouse could you give it another look, fixed issues from comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really good work, super excited to see this land and the overall quality of the failure cases is amazing.
I do just have that one question about if it's not able to connect to the control socket? I think that might be the only missing piece here.
case <-time.After(statusCheckPeriod): | ||
err := ch.agentClient.Connect(ctx) | ||
if err != nil { | ||
ch.log.Error(err, "Failed communicating to running daemon", errors.TypeNetwork, errors.M("socket", control.Address())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this solved? Doesn't seem like the failed ability to connect, is handled.
ch.agentClient.Disconnect() | ||
if err != nil { | ||
ch.log.Error("failed retrieving agent status", err) | ||
// agent is probably not running and this will be detected by pid watcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix on the connection handling! Looks great!
[Ingest Manager] Support for upgrade rollback (elastic#22537)
What does this PR do?
This PR introduces watch subcommand which is started as a subprocess at the point of upgrade.
This subcommand starts various watchers monitoring status of agent and apps as well as agent crashes.
In case nothing bad is reported subprocess cleans up older version of agent at terminates.
In case FAILED status is reported or agent crashes more than 2 times within a minute agent is rolled back to previous version and new version is removed.
This PR does not report UPGRADING state to fleet as state reporting will be changed during upcoming days and the change would make this PR unnecessarily more complicated.
Watcher makes decisions based on marker, if it does not exist it terminates right away, also more than 1 watcher is not allowed to run using file lock to ensure that.
To test it i built a snapshot version installed it and updated from fleet.
For positive scenario, user needs to wait 10 minutes for cleanup to occur.
For negative scenario
kill -9 {PID}
needs to be done several times where PID is a process ID of main agent process.Why is it important?
More solid upgrade process.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.