-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libbeat] Fix / clarify the module reload logic #16440
Conversation
Fix looks good to me. Maybe we can introduce an unit, integration, or system test that makes the old behavior fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a changelog entry?
libbeat/cfgfile/reload.go
Outdated
// Make sure the next run also updates because some runners were not properly loaded | ||
overwriteUpdate = true | ||
if err = list.Reload(configs); err != nil { | ||
logp.Err("Error reloading module configurations: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check if we should really log this on the error level. Could this reload return an error because a harvester was not finished yet? I just want to make sure we don't spam the error logs with this in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the test runs, the results are real errors, but they're also already logged by Reload
itself, so this does end up being noisy and redundant. Removed / commented.
@@ -185,7 +185,11 @@ func (rl *Reloader) Run(runnerFactory RunnerFactory) { | |||
rl.config.Reload.Period = 0 | |||
} | |||
|
|||
overwriteUpdate := true | |||
// If forceReload is set, the configuration should be reloaded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the renaming of the variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and ok to merge for me. I really like the tests you added but I'm a bit concerned that they might become flaky. We had many such issues with timing related tests in the past.
continue | ||
} | ||
configReloads.Add(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if the config reload counter should only increase when it was successful? And have an additional error counter for debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitant, because an error counter here wouldn't be at the right granularity -- we get an error here even if almost everything reloads successfully and there's one invalid file in the search path, and it would be misleading to report that as though the reload didn't happen.
The related monitor vars libbeat.config.module.{starts, stops, running}
are defined in this file also but they aren't used at all. I was planning to enable them in a followup (it requires changing the RunnerList
API and modifying other call sites or I'd have done it in this PR), but it might make sense to add errors
at the same time to track reload failures at module granularity instead of the whole configuration directory.
} | ||
// time interval is slightly more than a second so we don't slightly | ||
// undershoot the first iteration and wait a whole extra second. | ||
time.Sleep(1100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about tests with Sleep as they tend to get flaky earlier or later. Is there an other way to achieve this "wait"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know a better way, though I'm open to suggestions :-) A spin lock would presumably be even worse. The difficulty is that the behavior being tested is itself timer-based. I did structure the logic here so that the test is only affected by the state changes and not other timing issues -- the only "race condition" here is that the unit test might wait several seconds without the reloader's 1-second-interval timer going off, in which case it has no way to tell whether anything's happening. But, the retry timeout was shorter than I meant it to be, so I just increased it to 10 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm late on this one. Lets see if it shows up in flaky tests in the future ;-)
What we did as workarounds in the past is:
- Check for log messages
- Have an other data point we can check very frequently for a change.
In general I think it's problematic if a single unit tests increases the test suite by seconds and only spends the time waiting. To improve the above, perhaps there is a way to decrease the scan frequency and do checks every 10ms but retry 1000 times.
I finally isolated the problem with the flaky reloader tests in the CI ( Some details for posterity: the tests were waiting for lines like "Starting reload procedure, current runners: 1" to detect that one runner was started, but this is the log line at the beginning of the reload procedure, and so this line will generally not appear unless the configuration is changed after a runner has been started (we will instead see "current runners: 0" followed by "Start list: 1"). When we incorrectly triggered changes every iteration this was ok, and just resulted in a short delay before the target log entry appeared because of a redundant scan, but fixing the reloader prevented this. On the other hand, the test consistently passed locally, because (as mentioned elsewhere in this PR) our glob watcher rounds down to the next earlier second, so with short timer intervals some changes can be detected twice, and so local test runs were still getting the redundant log line they needed to pass -- however, the slower CI runs often didn't hit the next scan within exactly a second, and in that case would not trigger a redundant change. I've changed the tests to instead trigger on the less ambiguous log lines that specify how many runners were started / stopped, which is independent of the exact timer intervals and modification times. |
(cherry picked from commit 2743d51)
(cherry picked from commit 2743d51)
@faec do you have more info about the cases where this happens? I know that this is an issue at least when some processors that keep some connections are used at the module or input level. I have some WIP PRs (#16349) to look for a solution for this. |
Sort of -- that suspicion was based on a user issue that was seeing apparent memory leaks where the number of changed / loaded configurations steadily grew, which is what led to finding this bug. However, we haven't been able to reproduce the problem, because the loaded configurations (after However if there's an additional bug somewhere that makes the hash for a particular loaded configuration unstable, then this bug combined with that one could produce the behavior mentioned. But we don't have any direct evidence that such a bug exists (we're still waiting on more configuration details in the original discussion). |
What does this PR do?
The loop logic for the module reloader had a bug that would reload module configurations even if they hadn't changed. We suspect that in some cases this could cause unbounded memory consumption (we are still investigating the exact scope of the problem). This PR fixes the error and restructures the surrounding code to clarify the intended logic.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesHow to test this PR locally
With debug output on, any beat with module reloading enabled should output
Number of module configs found: ...
every time it reloads the configuration. Running a beat without this PR should show that this happens at every reload interval, whereas with it it should only reload when a file has changed.