Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest Manager] Make installer atomic on windows #24253

Merged
merged 12 commits into from
Mar 2, 2021

Conversation

michalpristas
Copy link
Contributor

@michalpristas michalpristas commented Feb 25, 2021

What does this PR do?

PR fixes issue on windows when on restart while installing beats we end up with partial data.
awaitable installler was introduced which forces app wait for installer finish its job

and

sync is forced for windows after rename is called during install.

Why is it important?

Fixes #24180

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@michalpristas michalpristas added bug needs_backport PR is waiting to be backported to other branches. Team:Ingest Management Team:Elastic-Agent Label for the Agent team labels Feb 25, 2021
@michalpristas michalpristas self-assigned this Feb 25, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ingest-management (Team:Ingest Management)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Feb 25, 2021
@michalpristas michalpristas changed the title Fix installer [Ingest Manager] Make installer atomic on windows Feb 25, 2021
@ph
Copy link
Contributor

ph commented Feb 25, 2021

/package

@elasticmachine
Copy link
Collaborator

elasticmachine commented Feb 25, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #24253 updated

  • Start Time: 2021-03-02T15:29:13.479+0000

  • Duration: 54 min 4 sec

  • Commit: 57116a1

Test stats 🧪

Test Results
Failed 0
Passed 6548
Skipped 16
Total 6564

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 6548
Skipped 16
Total 6564

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I don't understand how this fixes it, but see my inline comments on what I think this is doing.

I also wonder if all this ensuring that its fully extracted before shutdown is really even worth it. I think we should just think about always extracting before executing the program.

That would ensure:

  • Elastic Agent is always running what it expects (because the binary was not replaced in the directory)
  • Because of the point above the window of time to inject a bad binary from extract to starting the program is a extremely small window. (currently because we don't re-extract the binary can be changed as we only check the .asc and .sha512 of the compessed artifact vs the binary itself).

@@ -120,6 +120,11 @@ func (i *Installer) unzip(artifactPath string) error {
}

for _, f := range r.File {
// if we were cancelled in between
if err := ctx.Err(); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this still leave it broken? If the context is cancelled, then this will stop extracting.

Copy link
Contributor Author

@michalpristas michalpristas Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will propagate error and prevent install to continue
also speeds up cancellation as otherwise it tries to finish loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will prevent all of the files in the zip file from being extracted. So that mean that if a SIGTERM occurs it will cancel the context. When the context is cancelled and its in this loop then only part of the zip file will be extracted.

What happens when this returns a context.Cancelled? Does the zip installer then remove the directory because the context was cancelled during the extraction? If that is the case, then this is okay. If it accepts context.Cancelled as an acceptable error, or does not remove the directory then it would still be left in a bad state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's the case, atomic installer removes the extracted files because error was returned

i.wg.Add(1)
defer i.wg.Done()

return i.installer.Install(ctx, spec, version, installDir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the context cancelling won't this still leave it in a bad state? Because if the context is cancelled then the waitgroup would still be marked with Done() and the Wait() would not actually wait for this to finish.

Copy link
Contributor Author

@michalpristas michalpristas Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the context is cancelled we should finish install and return error here, then wg should be mark with done.
if context is not cancelled this should work as usuual.
please elaborate a bit more, i'm not seeing what you see atm

@michalpristas
Copy link
Contributor Author

michalpristas commented Feb 26, 2021

@blakerouse there are 2 things happening now at this PR
first one is proper cancellation of the task which does not ensure archive is fully unpacked but other way around, it ensures task is cancelled as soon as we know about cancellation and artifact is removed by atomic installer (because error is propagated during unpack and atomic removes on error). for this we're awaiting.

how it works atm:

  • installer unpacking
  • shutdown is proposed
  • installer unpacking
  • operator shutdown is called, context cancelled, operator stopped
  • installer still installing because it;s not even aware it should stop
  • agent shuts down
  • installer interupted

how this pr changes the flow

  • installer unpacking
  • shutdown proposed
  • installer unpacking
  • operator shutdown called, context cancelled, wait for installer
  • zip installed registers cancelled context, exits loop with err
  • atomic installer sees error, cleans up
  • awaitable installer returns done when cleanup done
  • operator shutdown continues
  • agent exits

the second one is sync of FS operations in Install dir
with unpack we sync file by file
but with remove we need to sync entire directory because when agent exits close to remove, it gets corrupted.
this is done in atomic installer and agent using await makes sure it runs ok before exiting agent

i'm for debating about complete remove before first install but we need to think it through and solve even for endpoint usecase, i see benefits there definitely, but as this needs to be backported to 7.12 i would rather postpone such a change to x

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking the time to explain how this is working. I just wanted to be sure that it worked as you where expecting. Glad to hear it is.

@michalpristas
Copy link
Contributor Author

/package

@michalpristas
Copy link
Contributor Author

/package

@michalpristas
Copy link
Contributor Author

finding out whether test failures are related

@EricDavisX
Copy link
Contributor

Hi - @dikshachauhan-qasource used an early version of the artifact from our GCP Observability builds and tried out the PR against current 8.0 master cloud build, and reports they did not see the problem (so that is good, if not a 100% definitive confirmation): Her words:

Policies used separately:

  1. default having only system
  2. new policy having all core 4 integrations [ system, linux, windows, Endpoint]

It may not fix all of the e2e-testing failures but may be worthwhile to push in and iterate over. @michalpristas thanks.

@michalpristas michalpristas merged commit 09ac584 into elastic:master Mar 2, 2021
michalpristas added a commit to michalpristas/beats that referenced this pull request Mar 2, 2021
[Ingest Manager] Make installer atomic on windows (elastic#24253)
michalpristas added a commit to michalpristas/beats that referenced this pull request Mar 2, 2021
[Ingest Manager] Make installer atomic on windows (elastic#24253)
EricDavisX pushed a commit that referenced this pull request Mar 2, 2021
[Ingest Manager] Make installer atomic on windows (#24253)
michalpristas added a commit that referenced this pull request Mar 3, 2021
Cherry-pick #24253 to 7.x: Make installer atomic on windows (#24297)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs_backport PR is waiting to be backported to other branches. Team:Elastic-Agent Label for the Agent team
Projects
None yet
5 participants