Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pid awareness to file locking #33169

Merged
merged 26 commits into from
Oct 12, 2022

Conversation

fearful-symmetry
Copy link
Contributor

@fearful-symmetry fearful-symmetry commented Sep 22, 2022

What does this PR do?

This PR seeks to add pid-awareness to the lockfile system that's initialized when a beat starts up. When beats attempts to create the lockfile, it checks for a pre-existing file, and if it exists, it checks to see if the process listed in the file exists or not. If the process does not exist, (in the case of a bad shutdown where the lockfile never got removed) it will attempt to recover, and create a new lockfile.

Why is it important?

In automated agent systems, the beat can sometimes take to long to shutdown, causing agent to hard-kill the process. In this instance, the beat won't restart, as the lockfile is still hanging around. This fixes that.

Checklist

  • Test while running under agent
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@fearful-symmetry fearful-symmetry added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team backport-v8.4.0 Automated backport with mergify backport-v8.5.0 Automated backport with mergify labels Sep 22, 2022
@fearful-symmetry fearful-symmetry self-assigned this Sep 22, 2022
@fearful-symmetry fearful-symmetry requested a review from a team as a code owner September 22, 2022 23:23
@fearful-symmetry fearful-symmetry requested review from cmacknz and faec and removed request for a team September 22, 2022 23:23
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Sep 22, 2022
@elasticmachine
Copy link
Collaborator

elasticmachine commented Sep 23, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-10-11T23:10:15.842+0000

  • Duration: 132 min 34 sec

Test stats 🧪

Test Results
Failed 0
Passed 23656
Skipped 1950
Total 25606

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to reproduce the problem? Did we test this manually using the entire beat process in addition to the unit tests?

libbeat/cmd/instance/beat.go Outdated Show resolved Hide resolved
libbeat/cmd/instance/locks/lock.go Outdated Show resolved Hide resolved
libbeat/cmd/instance/locks/lock.go Outdated Show resolved Hide resolved
libbeat/cmd/instance/locks/lock_test.go Outdated Show resolved Hide resolved
// emulate two beats trying to run from the same data path
locker := New(beatName)
// use pid 1 as another beat
_, err := locker.createPidfile(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can this be an arbitrary PID?

Copy link
Member

@cmacknz cmacknz Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right PID 1 on Linux is the init process that is guaranteed to exist.

Does this still work on Windows, or in general as a cross platform test?

I would also recommend making the 1 a constant for those that either don't know or are slow to remember the significance of PID 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, that was the simplest way I could think of to get a reproducible pid. Might have to change it for windows...

@cmacknz cmacknz requested a review from leehinman September 23, 2022 15:12
// we have internal metrics libraries we can use for this, but all those will use APIs
// dedicated to gathering extended process info and metrics, which can come with extra permissions hurdles,
// making those methods more likely to return an error.
exists, err := process.PidExistsWithContext(context.Background(), int32(pf.Pid))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle zombie processes? It seems the agent is able to leave some of them around elastic/elastic-agent#1215

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't even think of testing that.

libbeat/cmd/instance/locks/lock.go Outdated Show resolved Hide resolved
libbeat/cmd/instance/locks/lock.go Outdated Show resolved Hide resolved

// try to remove the lockfile
// May or may not work, depending on os-specific details with lockfiles
err = os.Remove(lock.fileLock.Path())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have at least one retry if the os.Remove() fails? The reason is that on Windows if a file was locked, and the process dies the OS will release the lock. But it isn't immediate and is subject to resources, so often waiting 1 sec and trying again it will work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh, interesting, alright.

@fearful-symmetry
Copy link
Contributor Author

@cmacknz

Were you able to reproduce the problem? Did we test this manually using the entire beat process in addition to the unit tests?

Yup, tested with metricbeat under the usual cases (remaining lockfile for dead process, two beats trying to start at once)

@fearful-symmetry
Copy link
Contributor Author

Okay, so, only issue that isn't dealt with is how to deal with zombie beats processes. It's definitely a problem, and I'm currently trying to find the most elegant way to handle this cross-platform.

@fearful-symmetry
Copy link
Contributor Author

This PR now depends on: elastic/elastic-agent-system-metrics#53

@fearful-symmetry
Copy link
Contributor Author

fearful-symmetry commented Sep 28, 2022

Alright. This now can handle zombie states for processes. Also managed to test most of the major use cases manually (sudden stop & start without removing the lockfile, starting two beats at once, and re-starting while a zombie process remains), which was a bit of a pain. We should be good for the zombie issue now, at least.

@fearful-symmetry
Copy link
Contributor Author

Alright, so the typecheck errors in CI are not related, but kinda baffling. I assume the issue (based on what I've hit before) is the fact that the github CI thing has cgo disabled, and the build constraints in elastic-agent-system-metrics require darwin&cgo:

//go:build (darwin && cgo) || freebsd || linux || windows || aix
// +build darwin,cgo freebsd linux windows aix

We'll probably need a separate fix for this, as I assume cgo is disabled for a reason.

@fearful-symmetry
Copy link
Contributor Author

So, made a few changes based on @andrewkroh 's feedback.

Two things to address still:

  • As @andrewkroh mentioned, we should look into fiddling with the shutdown_timeout settings under agent to make this problem less likely in the first place, although this falls outside the scope of this PR.
  • My hack to deal with github CI not liking cgo still really bugs me, and I'd like to find another way to handle it.

@fearful-symmetry
Copy link
Contributor Author

Ah, E2E tests are again being weird:

fatal: [18.218.130.4]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host", "unreachable": true}

@joshdover
Copy link
Contributor

  • As @andrewkroh mentioned, we should look into fiddling with the shutdown_timeout settings under agent to make this problem less likely in the first place, although this falls outside the scope of this PR.

Yeah this makes sense. We should try testing this shortly after getting this in. I think this change is still valuable since we do not know that the queue flush is the only reason the beat is failing to clean this up.

  • My hack to deal with github CI not liking cgo still really bugs me, and I'd like to find another way to handle it.

IMO we could solve this as a follow up after getting the PR in.

@fearful-symmetry
Copy link
Contributor Author

fearful-symmetry commented Oct 11, 2022

IMO we could solve this as a follow up after getting the PR in.

Agreed, this has been in a PR long enough. Gonna give the E2E tests one last chance to work, then merge.

@cmacknz
Copy link
Member

cmacknz commented Oct 11, 2022

As @andrewkroh mentioned, we should look into fiddling with the shutdown_timeout settings under agent to make this problem less likely in the first place, although this falls outside the scope of this PR.

As far as I can tell the Beats don't actually wait for the queue to empty by default (and this is what the documentation says as well). At least the ones I looked at do not obviously wait for the queue to drain before shutting down, since ShutdownTimeout defaults to 0.

waitPublished := fb.config.ShutdownTimeout > 0 || *once
if waitPublished {
// Wait for registrar to finish writing registry
waitEvents.Add(withLog(wgEvents.Wait,

if eb.config.ShutdownTimeout > 0 {
eb.log.Infof("Shutdown will wait max %v for the remaining %v events to publish.",
eb.config.ShutdownTimeout, acker.Active())
ctx, cancel := context.WithTimeout(context.Background(), eb.config.ShutdownTimeout)
defer cancel()
acker.Wait(ctx)
}

@cmacknz
Copy link
Member

cmacknz commented Oct 11, 2022

I think at this point in the 8.5 release cycle with only a few days before the release we should target 8.5.1 instead of 8.5.0 with this. This change needs some soak time before release considering the complexity and the chance to create some new unintended failure mode where the beats refuse to start.

This means we should defer merging the backport to the 8.5 branch until after 8.5.0 is released to be safe.

@fearful-symmetry fearful-symmetry removed backport-v8.4.0 Automated backport with mergify backport-v8.5.0 Automated backport with mergify labels Oct 11, 2022
@fearful-symmetry
Copy link
Contributor Author

This means we should defer merging the backport to the 8.5 branch until after 8.5.0 is released to be safe.

Agreed, this is kind of a scary change, would prefer not to backport anything last-minute.

@mergify
Copy link
Contributor

mergify bot commented Oct 11, 2022

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@fearful-symmetry fearful-symmetry merged commit 692172c into elastic:main Oct 12, 2022
fearful-symmetry added a commit that referenced this pull request Oct 13, 2022
* Fix errors for non-synth capable instances (#33310)

Fixes #32694 by making sure we use the lightweight wrapper code always when monitors cannot be initialized.

This also fixes an unrelated bug, where errors attached to non-summary events would not be indexed.

* [Automation] Update elastic stack version to 8.6.0-5a8d757d for testing (#33323)

Co-authored-by: apmmachine <[email protected]>

* add pid awareness to file locking (#33169)

* add pid awareness to file locking

* cleanup, logic for handling restarts with the same PID

* add zombie-state awareness

* fix file naming

* add retry for unlock

* was confused by unlock code, fix, cleanup

* update notice

* fix race with file creation, update deps

* clean up tests, spelling

* hack for cgo

* add lic headers

* notice

* try to fix windows issues

* fix typos

* small fixes

* use exclusive locks

* remove feature to start with a specially named pidfile

* clean up some error handling, fix test cleanup

* forgot changelog

* Fix sample config in log rotation docs (#33306)

* Add banner to deprecate functionbeat (#33297)

* fix unchecked stream_id

* packetbeat/protos/dns: clean up package (#33286)

* avoid magic numbers
* fix hashableDNSTuple size and offsets
* avoid use of String and Error methods in formatted print calls
* remove redundant conversions
* quieten linter
* use plugin-owned logp.Logger

* update elastic-agent-libs

* Revert "fix unchecked stream_id"

This reverts commit 26ef6da.

* [Automation] Update elastic stack version to 8.6.0-40086bc7 for testing (#33339)

Co-authored-by: apmmachine <[email protected]>

Co-authored-by: Andrew Cholakian <[email protected]>
Co-authored-by: apmmachine <[email protected]>
Co-authored-by: apmmachine <[email protected]>
Co-authored-by: Jaime Soriano Pastor <[email protected]>
Co-authored-by: DeDe Morton <[email protected]>
Co-authored-by: Dan Kortschak <[email protected]>
@lucabelluccini
Copy link
Contributor

Can we confirm if this will only make to 8.6.0? From what I see it will not be backported to 8.5

@amitkanfer
Copy link
Collaborator

@lucabelluccini i'm pretty sure this goes into 8.5.1. @jlind23 keep me honest

@cmacknz
Copy link
Member

cmacknz commented Oct 24, 2022

Yes, I had planned to put this in 8.5.1. We are waiting for the 8.5.0 release to occur before back porting, as we are still generating 8.5.0 build candidates due to blockers.

@cmacknz cmacknz added the backport-v8.5.0 Automated backport with mergify label Oct 31, 2022
@cmacknz
Copy link
Member

cmacknz commented Oct 31, 2022

@fearful-symmetry I've added the label to create the 8.5.0 backport, we are good to backport for inclusion in 8.5.1 now.

mergify bot pushed a commit that referenced this pull request Oct 31, 2022
* add pid awareness to file locking

* cleanup, logic for handling restarts with the same PID

* add zombie-state awareness

* fix file naming

* add retry for unlock

* was confused by unlock code, fix, cleanup

* update notice

* fix race with file creation, update deps

* clean up tests, spelling

* hack for cgo

* add lic headers

* notice

* try to fix windows issues

* fix typos

* small fixes

* use exclusive locks

* remove feature to start with a specially named pidfile

* clean up some error handling, fix test cleanup

* forgot changelog

(cherry picked from commit 692172c)
fearful-symmetry added a commit that referenced this pull request Nov 2, 2022
* add pid awareness to file locking (#33169)

* add pid awareness to file locking

* cleanup, logic for handling restarts with the same PID

* add zombie-state awareness

* fix file naming

* add retry for unlock

* was confused by unlock code, fix, cleanup

* update notice

* fix race with file creation, update deps

* clean up tests, spelling

* hack for cgo

* add lic headers

* notice

* try to fix windows issues

* fix typos

* small fixes

* use exclusive locks

* remove feature to start with a specially named pidfile

* clean up some error handling, fix test cleanup

* forgot changelog

(cherry picked from commit 692172c)

* fighting with CI

Co-authored-by: Alex K <[email protected]>
Co-authored-by: Alex Kristiansen <[email protected]>
cmacknz pushed a commit that referenced this pull request Nov 9, 2022
* Update Metricbeat, Filebeat, libbeat with elastic-agent V2 support (#32673)

* basic framework

* continued tinkering

* move away from ast code, use a struct

* get metricbeat working, starting on filebeat

* add notice update

* add basic config register

* move over processors to individual beats

* remove comments

* start to integrate V2 client changes

* finishing touches

* lint

* cleanup merge

* remove V1 controller

* stil tinkering with linter

* still fixing linter

* plz linter

* fmt x-pack files

* notice update

* fix output test

* refactor stop functions, refactor tests, some misc cleanup

* fix client version string

* add devguide

* linter

* expand filebeat test

* cleanup test

* fix docs, add tests, debuggin

* add signal handler

* fix mutex issue in register

* Fix osquerybeat configuration for V2

* clean up component registration

* spelling

* remove workaround for filebeat types

* try to fix filebeat tests

* add nil checks, fix test, fix unit stop

* continue tinkering with nil type checks

* add test for missing config datastreams, clean up nil handling

* change nil protections, use getter methods

* fix config access in output code

Co-authored-by: Aleksandr Maus <[email protected]>

* V2 packetbeat support (#33041)

* first attempt at auditbeat support

* add license header

* initial packetbeat support

* fix bad branch

* cleanup

* typo in comment

* clean up, move around files

* add new processors to streams

* First pass at auditbeat support (#33026)

* first attempt at auditbeat support

* add license header

* cleanup

* move files around

* Add heartbeat support for V2 (#33157)

* add v2 config

* fix name

* fix doc

* fix go.mod

* fix unchecked stream_id

* fix unchecked stream_id (#33335)

* Update elastic-agent-libs for output panic fix (#33336)

* Fix errors for non-synth capable instances (#33310)

Fixes #32694 by making sure we use the lightweight wrapper code always when monitors cannot be initialized.

This also fixes an unrelated bug, where errors attached to non-summary events would not be indexed.

* [Automation] Update elastic stack version to 8.6.0-5a8d757d for testing (#33323)

Co-authored-by: apmmachine <[email protected]>

* add pid awareness to file locking (#33169)

* add pid awareness to file locking

* cleanup, logic for handling restarts with the same PID

* add zombie-state awareness

* fix file naming

* add retry for unlock

* was confused by unlock code, fix, cleanup

* update notice

* fix race with file creation, update deps

* clean up tests, spelling

* hack for cgo

* add lic headers

* notice

* try to fix windows issues

* fix typos

* small fixes

* use exclusive locks

* remove feature to start with a specially named pidfile

* clean up some error handling, fix test cleanup

* forgot changelog

* Fix sample config in log rotation docs (#33306)

* Add banner to deprecate functionbeat (#33297)

* fix unchecked stream_id

* packetbeat/protos/dns: clean up package (#33286)

* avoid magic numbers
* fix hashableDNSTuple size and offsets
* avoid use of String and Error methods in formatted print calls
* remove redundant conversions
* quieten linter
* use plugin-owned logp.Logger

* update elastic-agent-libs

* Revert "fix unchecked stream_id"

This reverts commit 26ef6da.

* [Automation] Update elastic stack version to 8.6.0-40086bc7 for testing (#33339)

Co-authored-by: apmmachine <[email protected]>

Co-authored-by: Andrew Cholakian <[email protected]>
Co-authored-by: apmmachine <[email protected]>
Co-authored-by: apmmachine <[email protected]>
Co-authored-by: Jaime Soriano Pastor <[email protected]>
Co-authored-by: DeDe Morton <[email protected]>
Co-authored-by: Dan Kortschak <[email protected]>

* update elastic-agent-client (#33552)

Co-authored-by: Aleksandr Maus <[email protected]>
Co-authored-by: Andrew Cholakian <[email protected]>
Co-authored-by: apmmachine <[email protected]>
Co-authored-by: apmmachine <[email protected]>
Co-authored-by: Jaime Soriano Pastor <[email protected]>
Co-authored-by: DeDe Morton <[email protected]>
Co-authored-by: Dan Kortschak <[email protected]>
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
* add pid awareness to file locking

* cleanup, logic for handling restarts with the same PID

* add zombie-state awareness

* fix file naming

* add retry for unlock

* was confused by unlock code, fix, cleanup

* update notice

* fix race with file creation, update deps

* clean up tests, spelling

* hack for cgo

* add lic headers

* notice

* try to fix windows issues

* fix typos

* small fixes

* use exclusive locks

* remove feature to start with a specially named pidfile

* clean up some error handling, fix test cleanup

* forgot changelog
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
* Update Metricbeat, Filebeat, libbeat with elastic-agent V2 support (#32673)

* basic framework

* continued tinkering

* move away from ast code, use a struct

* get metricbeat working, starting on filebeat

* add notice update

* add basic config register

* move over processors to individual beats

* remove comments

* start to integrate V2 client changes

* finishing touches

* lint

* cleanup merge

* remove V1 controller

* stil tinkering with linter

* still fixing linter

* plz linter

* fmt x-pack files

* notice update

* fix output test

* refactor stop functions, refactor tests, some misc cleanup

* fix client version string

* add devguide

* linter

* expand filebeat test

* cleanup test

* fix docs, add tests, debuggin

* add signal handler

* fix mutex issue in register

* Fix osquerybeat configuration for V2

* clean up component registration

* spelling

* remove workaround for filebeat types

* try to fix filebeat tests

* add nil checks, fix test, fix unit stop

* continue tinkering with nil type checks

* add test for missing config datastreams, clean up nil handling

* change nil protections, use getter methods

* fix config access in output code

Co-authored-by: Aleksandr Maus <[email protected]>

* V2 packetbeat support (#33041)

* first attempt at auditbeat support

* add license header

* initial packetbeat support

* fix bad branch

* cleanup

* typo in comment

* clean up, move around files

* add new processors to streams

* First pass at auditbeat support (#33026)

* first attempt at auditbeat support

* add license header

* cleanup

* move files around

* Add heartbeat support for V2 (#33157)

* add v2 config

* fix name

* fix doc

* fix go.mod

* fix unchecked stream_id

* fix unchecked stream_id (#33335)

* Update elastic-agent-libs for output panic fix (#33336)

* Fix errors for non-synth capable instances (#33310)

Fixes #32694 by making sure we use the lightweight wrapper code always when monitors cannot be initialized.

This also fixes an unrelated bug, where errors attached to non-summary events would not be indexed.

* [Automation] Update elastic stack version to 8.6.0-5a8d757d for testing (#33323)

Co-authored-by: apmmachine <[email protected]>

* add pid awareness to file locking (#33169)

* add pid awareness to file locking

* cleanup, logic for handling restarts with the same PID

* add zombie-state awareness

* fix file naming

* add retry for unlock

* was confused by unlock code, fix, cleanup

* update notice

* fix race with file creation, update deps

* clean up tests, spelling

* hack for cgo

* add lic headers

* notice

* try to fix windows issues

* fix typos

* small fixes

* use exclusive locks

* remove feature to start with a specially named pidfile

* clean up some error handling, fix test cleanup

* forgot changelog

* Fix sample config in log rotation docs (#33306)

* Add banner to deprecate functionbeat (#33297)

* fix unchecked stream_id

* packetbeat/protos/dns: clean up package (#33286)

* avoid magic numbers
* fix hashableDNSTuple size and offsets
* avoid use of String and Error methods in formatted print calls
* remove redundant conversions
* quieten linter
* use plugin-owned logp.Logger

* update elastic-agent-libs

* Revert "fix unchecked stream_id"

This reverts commit 26ef6da.

* [Automation] Update elastic stack version to 8.6.0-40086bc7 for testing (#33339)

Co-authored-by: apmmachine <[email protected]>

Co-authored-by: Andrew Cholakian <[email protected]>
Co-authored-by: apmmachine <[email protected]>
Co-authored-by: apmmachine <[email protected]>
Co-authored-by: Jaime Soriano Pastor <[email protected]>
Co-authored-by: DeDe Morton <[email protected]>
Co-authored-by: Dan Kortschak <[email protected]>

* update elastic-agent-client (#33552)

Co-authored-by: Aleksandr Maus <[email protected]>
Co-authored-by: Andrew Cholakian <[email protected]>
Co-authored-by: apmmachine <[email protected]>
Co-authored-by: apmmachine <[email protected]>
Co-authored-by: Jaime Soriano Pastor <[email protected]>
Co-authored-by: DeDe Morton <[email protected]>
Co-authored-by: Dan Kortschak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.5.0 Automated backport with mergify bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

beats locker should be pid aware
9 participants