-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add pid awareness to file locking #33169
add pid awareness to file locking #33169
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you able to reproduce the problem? Did we test this manually using the entire beat process in addition to the unit tests?
// emulate two beats trying to run from the same data path | ||
locker := New(beatName) | ||
// use pid 1 as another beat | ||
_, err := locker.createPidfile(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can this be an arbitrary PID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right PID 1 on Linux is the init process that is guaranteed to exist.
Does this still work on Windows, or in general as a cross platform test?
I would also recommend making the 1
a constant for those that either don't know or are slow to remember the significance of PID 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah, that was the simplest way I could think of to get a reproducible pid. Might have to change it for windows...
libbeat/cmd/instance/locks/lock.go
Outdated
// we have internal metrics libraries we can use for this, but all those will use APIs | ||
// dedicated to gathering extended process info and metrics, which can come with extra permissions hurdles, | ||
// making those methods more likely to return an error. | ||
exists, err := process.PidExistsWithContext(context.Background(), int32(pf.Pid)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle zombie processes? It seems the agent is able to leave some of them around elastic/elastic-agent#1215
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, didn't even think of testing that.
libbeat/cmd/instance/locks/lock.go
Outdated
|
||
// try to remove the lockfile | ||
// May or may not work, depending on os-specific details with lockfiles | ||
err = os.Remove(lock.fileLock.Path()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have at least one retry if the os.Remove()
fails? The reason is that on Windows if a file was locked, and the process dies the OS will release the lock. But it isn't immediate and is subject to resources, so often waiting 1 sec and trying again it will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhh, interesting, alright.
Yup, tested with metricbeat under the usual cases (remaining lockfile for dead process, two beats trying to start at once) |
Okay, so, only issue that isn't dealt with is how to deal with zombie beats processes. It's definitely a problem, and I'm currently trying to find the most elegant way to handle this cross-platform. |
This PR now depends on: elastic/elastic-agent-system-metrics#53 |
Alright. This now can handle zombie states for processes. Also managed to test most of the major use cases manually (sudden stop & start without removing the lockfile, starting two beats at once, and re-starting while a zombie process remains), which was a bit of a pain. We should be good for the zombie issue now, at least. |
Alright, so the typecheck errors in CI are not related, but kinda baffling. I assume the issue (based on what I've hit before) is the fact that the github CI thing has cgo disabled, and the build constraints in elastic-agent-system-metrics require darwin&cgo:
We'll probably need a separate fix for this, as I assume cgo is disabled for a reason. |
So, made a few changes based on @andrewkroh 's feedback. Two things to address still:
|
Ah, E2E tests are again being weird:
|
Yeah this makes sense. We should try testing this shortly after getting this in. I think this change is still valuable since we do not know that the queue flush is the only reason the beat is failing to clean this up.
IMO we could solve this as a follow up after getting the PR in. |
Agreed, this has been in a PR long enough. Gonna give the E2E tests one last chance to work, then merge. |
As far as I can tell the Beats don't actually wait for the queue to empty by default (and this is what the documentation says as well). At least the ones I looked at do not obviously wait for the queue to drain before shutting down, since beats/filebeat/beater/filebeat.go Lines 409 to 412 in 27807aa
beats/winlogbeat/beater/winlogbeat.go Lines 170 to 176 in 27807aa
|
I think at this point in the 8.5 release cycle with only a few days before the release we should target 8.5.1 instead of 8.5.0 with this. This change needs some soak time before release considering the complexity and the chance to create some new unintended failure mode where the beats refuse to start. This means we should defer merging the backport to the 8.5 branch until after 8.5.0 is released to be safe. |
Agreed, this is kind of a scary change, would prefer not to backport anything last-minute. |
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
* Fix errors for non-synth capable instances (#33310) Fixes #32694 by making sure we use the lightweight wrapper code always when monitors cannot be initialized. This also fixes an unrelated bug, where errors attached to non-summary events would not be indexed. * [Automation] Update elastic stack version to 8.6.0-5a8d757d for testing (#33323) Co-authored-by: apmmachine <[email protected]> * add pid awareness to file locking (#33169) * add pid awareness to file locking * cleanup, logic for handling restarts with the same PID * add zombie-state awareness * fix file naming * add retry for unlock * was confused by unlock code, fix, cleanup * update notice * fix race with file creation, update deps * clean up tests, spelling * hack for cgo * add lic headers * notice * try to fix windows issues * fix typos * small fixes * use exclusive locks * remove feature to start with a specially named pidfile * clean up some error handling, fix test cleanup * forgot changelog * Fix sample config in log rotation docs (#33306) * Add banner to deprecate functionbeat (#33297) * fix unchecked stream_id * packetbeat/protos/dns: clean up package (#33286) * avoid magic numbers * fix hashableDNSTuple size and offsets * avoid use of String and Error methods in formatted print calls * remove redundant conversions * quieten linter * use plugin-owned logp.Logger * update elastic-agent-libs * Revert "fix unchecked stream_id" This reverts commit 26ef6da. * [Automation] Update elastic stack version to 8.6.0-40086bc7 for testing (#33339) Co-authored-by: apmmachine <[email protected]> Co-authored-by: Andrew Cholakian <[email protected]> Co-authored-by: apmmachine <[email protected]> Co-authored-by: apmmachine <[email protected]> Co-authored-by: Jaime Soriano Pastor <[email protected]> Co-authored-by: DeDe Morton <[email protected]> Co-authored-by: Dan Kortschak <[email protected]>
Can we confirm if this will only make to 8.6.0? From what I see it will not be backported to 8.5 |
@lucabelluccini i'm pretty sure this goes into 8.5.1. @jlind23 keep me honest |
Yes, I had planned to put this in 8.5.1. We are waiting for the 8.5.0 release to occur before back porting, as we are still generating 8.5.0 build candidates due to blockers. |
@fearful-symmetry I've added the label to create the 8.5.0 backport, we are good to backport for inclusion in 8.5.1 now. |
* add pid awareness to file locking * cleanup, logic for handling restarts with the same PID * add zombie-state awareness * fix file naming * add retry for unlock * was confused by unlock code, fix, cleanup * update notice * fix race with file creation, update deps * clean up tests, spelling * hack for cgo * add lic headers * notice * try to fix windows issues * fix typos * small fixes * use exclusive locks * remove feature to start with a specially named pidfile * clean up some error handling, fix test cleanup * forgot changelog (cherry picked from commit 692172c)
* add pid awareness to file locking (#33169) * add pid awareness to file locking * cleanup, logic for handling restarts with the same PID * add zombie-state awareness * fix file naming * add retry for unlock * was confused by unlock code, fix, cleanup * update notice * fix race with file creation, update deps * clean up tests, spelling * hack for cgo * add lic headers * notice * try to fix windows issues * fix typos * small fixes * use exclusive locks * remove feature to start with a specially named pidfile * clean up some error handling, fix test cleanup * forgot changelog (cherry picked from commit 692172c) * fighting with CI Co-authored-by: Alex K <[email protected]> Co-authored-by: Alex Kristiansen <[email protected]>
* Update Metricbeat, Filebeat, libbeat with elastic-agent V2 support (#32673) * basic framework * continued tinkering * move away from ast code, use a struct * get metricbeat working, starting on filebeat * add notice update * add basic config register * move over processors to individual beats * remove comments * start to integrate V2 client changes * finishing touches * lint * cleanup merge * remove V1 controller * stil tinkering with linter * still fixing linter * plz linter * fmt x-pack files * notice update * fix output test * refactor stop functions, refactor tests, some misc cleanup * fix client version string * add devguide * linter * expand filebeat test * cleanup test * fix docs, add tests, debuggin * add signal handler * fix mutex issue in register * Fix osquerybeat configuration for V2 * clean up component registration * spelling * remove workaround for filebeat types * try to fix filebeat tests * add nil checks, fix test, fix unit stop * continue tinkering with nil type checks * add test for missing config datastreams, clean up nil handling * change nil protections, use getter methods * fix config access in output code Co-authored-by: Aleksandr Maus <[email protected]> * V2 packetbeat support (#33041) * first attempt at auditbeat support * add license header * initial packetbeat support * fix bad branch * cleanup * typo in comment * clean up, move around files * add new processors to streams * First pass at auditbeat support (#33026) * first attempt at auditbeat support * add license header * cleanup * move files around * Add heartbeat support for V2 (#33157) * add v2 config * fix name * fix doc * fix go.mod * fix unchecked stream_id * fix unchecked stream_id (#33335) * Update elastic-agent-libs for output panic fix (#33336) * Fix errors for non-synth capable instances (#33310) Fixes #32694 by making sure we use the lightweight wrapper code always when monitors cannot be initialized. This also fixes an unrelated bug, where errors attached to non-summary events would not be indexed. * [Automation] Update elastic stack version to 8.6.0-5a8d757d for testing (#33323) Co-authored-by: apmmachine <[email protected]> * add pid awareness to file locking (#33169) * add pid awareness to file locking * cleanup, logic for handling restarts with the same PID * add zombie-state awareness * fix file naming * add retry for unlock * was confused by unlock code, fix, cleanup * update notice * fix race with file creation, update deps * clean up tests, spelling * hack for cgo * add lic headers * notice * try to fix windows issues * fix typos * small fixes * use exclusive locks * remove feature to start with a specially named pidfile * clean up some error handling, fix test cleanup * forgot changelog * Fix sample config in log rotation docs (#33306) * Add banner to deprecate functionbeat (#33297) * fix unchecked stream_id * packetbeat/protos/dns: clean up package (#33286) * avoid magic numbers * fix hashableDNSTuple size and offsets * avoid use of String and Error methods in formatted print calls * remove redundant conversions * quieten linter * use plugin-owned logp.Logger * update elastic-agent-libs * Revert "fix unchecked stream_id" This reverts commit 26ef6da. * [Automation] Update elastic stack version to 8.6.0-40086bc7 for testing (#33339) Co-authored-by: apmmachine <[email protected]> Co-authored-by: Andrew Cholakian <[email protected]> Co-authored-by: apmmachine <[email protected]> Co-authored-by: apmmachine <[email protected]> Co-authored-by: Jaime Soriano Pastor <[email protected]> Co-authored-by: DeDe Morton <[email protected]> Co-authored-by: Dan Kortschak <[email protected]> * update elastic-agent-client (#33552) Co-authored-by: Aleksandr Maus <[email protected]> Co-authored-by: Andrew Cholakian <[email protected]> Co-authored-by: apmmachine <[email protected]> Co-authored-by: apmmachine <[email protected]> Co-authored-by: Jaime Soriano Pastor <[email protected]> Co-authored-by: DeDe Morton <[email protected]> Co-authored-by: Dan Kortschak <[email protected]>
* add pid awareness to file locking * cleanup, logic for handling restarts with the same PID * add zombie-state awareness * fix file naming * add retry for unlock * was confused by unlock code, fix, cleanup * update notice * fix race with file creation, update deps * clean up tests, spelling * hack for cgo * add lic headers * notice * try to fix windows issues * fix typos * small fixes * use exclusive locks * remove feature to start with a specially named pidfile * clean up some error handling, fix test cleanup * forgot changelog
* Update Metricbeat, Filebeat, libbeat with elastic-agent V2 support (#32673) * basic framework * continued tinkering * move away from ast code, use a struct * get metricbeat working, starting on filebeat * add notice update * add basic config register * move over processors to individual beats * remove comments * start to integrate V2 client changes * finishing touches * lint * cleanup merge * remove V1 controller * stil tinkering with linter * still fixing linter * plz linter * fmt x-pack files * notice update * fix output test * refactor stop functions, refactor tests, some misc cleanup * fix client version string * add devguide * linter * expand filebeat test * cleanup test * fix docs, add tests, debuggin * add signal handler * fix mutex issue in register * Fix osquerybeat configuration for V2 * clean up component registration * spelling * remove workaround for filebeat types * try to fix filebeat tests * add nil checks, fix test, fix unit stop * continue tinkering with nil type checks * add test for missing config datastreams, clean up nil handling * change nil protections, use getter methods * fix config access in output code Co-authored-by: Aleksandr Maus <[email protected]> * V2 packetbeat support (#33041) * first attempt at auditbeat support * add license header * initial packetbeat support * fix bad branch * cleanup * typo in comment * clean up, move around files * add new processors to streams * First pass at auditbeat support (#33026) * first attempt at auditbeat support * add license header * cleanup * move files around * Add heartbeat support for V2 (#33157) * add v2 config * fix name * fix doc * fix go.mod * fix unchecked stream_id * fix unchecked stream_id (#33335) * Update elastic-agent-libs for output panic fix (#33336) * Fix errors for non-synth capable instances (#33310) Fixes #32694 by making sure we use the lightweight wrapper code always when monitors cannot be initialized. This also fixes an unrelated bug, where errors attached to non-summary events would not be indexed. * [Automation] Update elastic stack version to 8.6.0-5a8d757d for testing (#33323) Co-authored-by: apmmachine <[email protected]> * add pid awareness to file locking (#33169) * add pid awareness to file locking * cleanup, logic for handling restarts with the same PID * add zombie-state awareness * fix file naming * add retry for unlock * was confused by unlock code, fix, cleanup * update notice * fix race with file creation, update deps * clean up tests, spelling * hack for cgo * add lic headers * notice * try to fix windows issues * fix typos * small fixes * use exclusive locks * remove feature to start with a specially named pidfile * clean up some error handling, fix test cleanup * forgot changelog * Fix sample config in log rotation docs (#33306) * Add banner to deprecate functionbeat (#33297) * fix unchecked stream_id * packetbeat/protos/dns: clean up package (#33286) * avoid magic numbers * fix hashableDNSTuple size and offsets * avoid use of String and Error methods in formatted print calls * remove redundant conversions * quieten linter * use plugin-owned logp.Logger * update elastic-agent-libs * Revert "fix unchecked stream_id" This reverts commit 26ef6da. * [Automation] Update elastic stack version to 8.6.0-40086bc7 for testing (#33339) Co-authored-by: apmmachine <[email protected]> Co-authored-by: Andrew Cholakian <[email protected]> Co-authored-by: apmmachine <[email protected]> Co-authored-by: apmmachine <[email protected]> Co-authored-by: Jaime Soriano Pastor <[email protected]> Co-authored-by: DeDe Morton <[email protected]> Co-authored-by: Dan Kortschak <[email protected]> * update elastic-agent-client (#33552) Co-authored-by: Aleksandr Maus <[email protected]> Co-authored-by: Andrew Cholakian <[email protected]> Co-authored-by: apmmachine <[email protected]> Co-authored-by: apmmachine <[email protected]> Co-authored-by: Jaime Soriano Pastor <[email protected]> Co-authored-by: DeDe Morton <[email protected]> Co-authored-by: Dan Kortschak <[email protected]>
What does this PR do?
This PR seeks to add pid-awareness to the lockfile system that's initialized when a beat starts up. When beats attempts to create the lockfile, it checks for a pre-existing file, and if it exists, it checks to see if the process listed in the file exists or not. If the process does not exist, (in the case of a bad shutdown where the lockfile never got removed) it will attempt to recover, and create a new lockfile.
Why is it important?
In automated agent systems, the beat can sometimes take to long to shutdown, causing agent to hard-kill the process. In this instance, the beat won't restart, as the lockfile is still hanging around. This fixes that.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.