-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ingest Manager] Make installer atomic on windows #24253
Conversation
Pinging @elastic/agent (Team:Agent) |
Pinging @elastic/ingest-management (Team:Ingest Management) |
/package |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I don't understand how this fixes it, but see my inline comments on what I think this is doing.
I also wonder if all this ensuring that its fully extracted before shutdown is really even worth it. I think we should just think about always extracting before executing the program.
That would ensure:
- Elastic Agent is always running what it expects (because the binary was not replaced in the directory)
- Because of the point above the window of time to inject a bad binary from extract to starting the program is a extremely small window. (currently because we don't re-extract the binary can be changed as we only check the .asc and .sha512 of the compessed artifact vs the binary itself).
@@ -120,6 +120,11 @@ func (i *Installer) unzip(artifactPath string) error { | |||
} | |||
|
|||
for _, f := range r.File { | |||
// if we were cancelled in between | |||
if err := ctx.Err(); err != nil { | |||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this still leave it broken? If the context is cancelled, then this will stop extracting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will propagate error and prevent install to continue
also speeds up cancellation as otherwise it tries to finish loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will prevent all of the files in the zip file from being extracted. So that mean that if a SIGTERM occurs it will cancel the context. When the context is cancelled and its in this loop then only part of the zip file will be extracted.
What happens when this returns a context.Cancelled
? Does the zip installer then remove the directory because the context was cancelled during the extraction? If that is the case, then this is okay. If it accepts context.Cancelled
as an acceptable error, or does not remove the directory then it would still be left in a bad state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that's the case, atomic installer removes the extracted files because error was returned
i.wg.Add(1) | ||
defer i.wg.Done() | ||
|
||
return i.installer.Install(ctx, spec, version, installDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the context cancelling won't this still leave it in a bad state? Because if the context is cancelled then the waitgroup would still be marked with Done()
and the Wait()
would not actually wait for this to finish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the context is cancelled we should finish install and return error here, then wg should be mark with done.
if context is not cancelled this should work as usuual.
please elaborate a bit more, i'm not seeing what you see atm
@blakerouse there are 2 things happening now at this PR how it works atm:
how this pr changes the flow
the second one is sync of FS operations in Install dir i'm for debating about complete remove before first install but we need to think it through and solve even for endpoint usecase, i see benefits there definitely, but as this needs to be backported to 7.12 i would rather postpone such a change to x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking the time to explain how this is working. I just wanted to be sure that it worked as you where expecting. Glad to hear it is.
/package |
/package |
finding out whether test failures are related |
Hi - @dikshachauhan-qasource used an early version of the artifact from our GCP Observability builds and tried out the PR against current 8.0 master cloud build, and reports they did not see the problem (so that is good, if not a 100% definitive confirmation): Her words: Policies used separately:
It may not fix all of the e2e-testing failures but may be worthwhile to push in and iterate over. @michalpristas thanks. |
[Ingest Manager] Make installer atomic on windows (elastic#24253)
[Ingest Manager] Make installer atomic on windows (elastic#24253)
[Ingest Manager] Make installer atomic on windows (#24253)
What does this PR do?
PR fixes issue on windows when on restart while installing beats we end up with partial data.
awaitable installler was introduced which forces app wait for installer finish its job
and
sync is forced for windows after rename is called during install.
Why is it important?
Fixes #24180
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.