Fix issues with output reloading #17381

ycombinator · 2020-04-01T02:38:10Z

What does this PR do?

Beats Central Management (today) and Agent (in the future) will need to reload outputs on the fly, whenever users change the Beat's configuration while the Beat is running. The current output reloading implementation has three bugs:

When outputs are reloaded several times, and very quickly at that, all events don't always getting eventually published,
Whenever an output is reloaded, a goroutine is leaked from the publisher pipeline's output handling code, and
If the output being reloaded is the Elasticsearch output, old TCP connections to Elasticsearch are hanging around.

This PR addresses all three issues.

Why is it important?

To ensure that all events will eventually be published, even if outputs are rapidly reloaded multiple times.
To prevent leakage of memory (goroutines consume memory).
To prevent exhaustion of file descriptors over time.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Start up Elasticsearch and Kibana and setup Beats Central Management.
1. Enroll Filebeat. This will backup your original filebeat.yml file and create a new one suitable for Beats Central Management.
2. Create a configuration tag in the Beats Central Management UI in Kibana with input paths containing /tmp/logs/*.log and the Elasticsearch output pointing to localhost:9200 (or wherever your local ES is listening). Make sure to assign this tag to your enrolled Filebeat instance in the UI.
Edit the new filebeat.yml file and set management.period: 2s to speed up output reloading.
Start up Filebeat with the -httpprof localhost:5050 flag. This will let us see if Filebeat is leaking goroutines.
In a new window, check TCP connections used by your Filebeat process. This will let us see if Filebeat is leaking file descriptors.
```
watch -n 2 lsof -a -cfilebeat -i4 -i6 -itcp
```

In another new window, make sure some data is constantly being ingested by Filebeat from the input path.

rm -rf /tmp/logs
mkdir -p /tmp/logs
while true; do echo "foo bar $RANDOM" >> /tmp/logs/test.log; sleep 1; done

In the Central Management UI, change the configuration for your tag slightly (maybe add an input path then later remove it and so on) to trigger output reloading in Filebeat.
Check the Filebeat log for messages like New configurations retrieved to confirm that your changes in Kibana are being picked up by Filebeat.
Check the output of the lsof window and make sure the list of file descriptors isn't growing over time.
Visit (or refresh) http://localhost:5050/debug/pprof/goroutine?debug=1 in a browser window and make sure the number of goroutines isn't growing over time.
Wait a few seconds, then repeat steps 6 onwards a few times.

Related issues

Fixes Libbeat output reloading cause a goroutine leak and a possibly a fd leak #10491

ycombinator · 2020-04-01T02:48:58Z

libbeat/esleg/eslegclient/connection.go

@@ -407,6 +407,9 @@ func (conn *Connection) execHTTPRequest(req *http.Request) (int, []byte, error)
 		req.Host = host
 	}

+	// TODO: workaround for output reloading leaking FDs until context.WithCancel is used on transport dialer instead
+	req.Header.Set("Connection", "close")


Before I made this workaround/hack, I tried two different approaches to fix the FD leak the "right way":

Just as there is an http.NewRequest function (that we currently use in the eslegclient code), there is also a http.NewRequestWithContext function. I replaced our usage with the latter one, created a context.WithCancel inside eslegclient.NewConnection and used the cancel function returned to cancel the context from the eslegclient.(*Connection).Close() method. This is essentially the same fix that @urso tried in Try to close HTTP connections and wait for completion on output reload/close #10599. However, when I used lsof to monitor FDs being used by my Filebeat process, I noticed that this change had no effect!

So then, thinking about it some more, I realized the problem wasn't so much on canceling the HTTP requests but it was more to do with closing the underlying TCP connections that were being established each time the output was reloaded. To fix this the "right way", I tried threading context.Context through the various Dialers (which would allow us to call DialContext instead of Dial in the right places). This turned out to be quite a complicated and large refactoring! So instead, I settled for this workaround here. I think we do need to undertake this refactoring at some point but I'm not sure it's worth blocking progress at the moment?

This would force the TCP connection to be dropped after each bulk request. Internally the go HTTP package uses a connection pool (actually the Transport type does) and keeps the HTTP session alive in order to send more requests. Would be nice if we could keep this.

This makes me wonder if this is really a leak, or just an open connection that eventually gets closed in a few minutes by the go runtime. The Transport type has a few settings for controlling the connection pool. For example IdleConnTimeout (which we must configure). On shutdown one can also call CloseIdleConnections.
I think we should always configure a timeout (and expose it as a setting) + call CloseIdleConnections (best effort) approach to close connections early.

We also have some helpers for working with contexts in this package: https://pkg.go.dev/github.com/elastic/go-concert/ctxtool. This allows us to hook up contexts with custom functions, plain go channels, or merge cancelation from multiple contexts.

Thanks @urso.

This makes me wonder if this is really a leak, or just an open connection that eventually gets closed in a few minutes by the go runtime.

Yes, good point. I will run my tests for longer and see if the connection FDs eventually get released.

I think we should always configure a timeout (and expose it as a setting) + call CloseIdleConnections (best effort) approach to close connections early.

++ to this approach, will implement.

We also have some helpers for working with contexts in this package: https://pkg.go.dev/github.com/elastic/go-concert/ctxtool. This allows us to hook up contexts with custom functions, plain go channels, or merge cancelation from multiple contexts.

I will look into this as well, but probably part of a follow up PR.

++ Will all the above that @urso mentioned :)

Implemented a new setting in eslegclient.ConnectionSettings named IdleConnTimeout. The name matches the setting onhttp.Transport. Default value is 1 minute.

Also implemented calling CloseIdleConnections() from eslegclient.(*Connection).Close().

Net result is that we don't see FDs for TCP connections to the ES node sticking around in the test.

elasticmachine · 2020-04-01T02:49:48Z

Pinging @elastic/integrations-services (Team:Services)

libbeat/publisher/pipeline/output.go

urso · 2020-04-02T13:43:43Z

Would be nice if we could introduce an unit test trying to replace an output and check all batches are eventually handled.

ycombinator · 2020-04-03T04:21:16Z

Would be nice if we could introduce an unit test trying to replace an output and check all batches are eventually handled.

I'm actually going to implement unit tests on the original code first to ensure they keep passing with the current PR. I've done this in #17460. Please take a look when you get a chance.

Once #1746 is reviewed and merged, I'll rebase this PR on it.

ycombinator · 2020-04-08T23:28:56Z

Unit tests (per the previous two comments) have now been added in #17460. I will now resume working on this PR, making sure the same tests continue to pass with the changes in this PR and also address review feedback.

ycombinator · 2020-04-10T00:20:47Z

I've addressed @urso's review feedback and also added another unit test in this PR specifically to test output reloading. That test is currently failing sometimes so I need to investigate why and fix.

ycombinator · 2020-04-10T01:33:02Z

... added another unit test in this PR specifically to test output reloading. That test is currently failing sometimes so I need to investigate why and fix.

While investigating this I found that this new unit test isn't passing on master either (i.e. with the original implementation). If the output is reloaded multiple times very quickly, not all events get published! To figure out what's going on on master and resolve it first, I've created a new PR just with this new unit test: #17657.

ph · 2020-04-16T12:12:33Z

@ycombinator Any idea how to fix it?

ycombinator · 2020-04-16T13:53:47Z

Any idea how to fix it?

@ph Not 100% but I made some progress in #17657. After going over it with @urso on Tuesday (I was off yesterday), my next steps here are to bring the unit test written in #17567 into this PR and then try to make it pass using the code fixes in this PR, maybe some that were experimented with in #17567, and maybe some others still that we haven't discovered yet.

The general problem seems to be that there's a race condition somewhere in the publisher code that's being exposed when the output controller's Set method is called quickly and repeatedly (which could be the case with Agent). This race condition prevents all events from eventually being published, but only some times.

ycombinator · 2020-04-16T23:09:23Z

libbeat/publisher/pipeline/consumer.go

@@ -154,7 +154,7 @@ func (c *eventConsumer) loop(consumer queue.Consumer) {
 		}

 		paused = c.paused()
-		if !paused && c.out != nil && batch != nil {
+		if c.out != nil && batch != nil {


This turned out to be a key change in getting the test to pass (i.e. ensuring that all batches are eventually published despite rapid reloading of outputs multiple times).

Consider the case where the consumer is paused (so paused == true here) but it still has a batch of events that it has consumed from the queue but not yet sent to the workqueue. In this scenario because of the !paused clause here, we would fall into the else clause, which would set out = nil. This would then cause the last select in this method to block (unless, by luck of timing, a signal to unpause the consumer just happened to come along).

Now, if we have a valid output group c.out and a batch, we don't pay attention to the paused state; this ensures that this batch will be send to that output group's work queue via that final select's final case.

Oh wow nice fix!

ycombinator · 2020-04-16T23:10:47Z

libbeat/publisher/pipeline/controller.go

@@ -34,7 +34,8 @@ type outputController struct {
 	monitors Monitors
 	observer outputObserver

-	queue queue.Queue
+	queue     queue.Queue
+	workQueue workQueue


This is another change I made, to simplify things. Now the output controller creates a work queue in it's constructor and the same one is used by all output workers across time. This removes the need to drain the old work queue to a new one when outputs are reloaded and new output workers are created in the process.

good change!

urso · 2020-04-20T14:06:43Z

libbeat/publisher/pipeline/controller.go

+		beat:      beat,
+		monitors:  monitors,
+		observer:  observer,
+		queue:     b,


that's funny. Why is the parameter named b ? As we've modified this piece of code, leats clean up the naming a little :)

Fixed in 5ff7219.

urso · 2020-04-20T14:25:48Z

libbeat/publisher/pipeline/controller.go

@@ -136,12 +137,13 @@ func (c *outputController) Set(outGrp outputs.Group) {

 	// restart consumer (potentially blocked by retryer)
 	c.consumer.sigContinue()
+	c.consumer.sigUnWait()


sigUnWait is used by the retryer, no? Why add it here?

I found that this was needed to make the TestReload test pass. Without it the retryer never un-waited the consumer by itself and so events stopped being consumed and sent to the output.

That said, I agree that this is something the retryer should automatically call internally rather than "leak" this responsibility to the controller. I'll look into why it's not doing that.

Yeah, this sounds like a potential bug to me.

Addressed in 6a6680c.

urso · 2020-04-20T14:28:31Z

libbeat/publisher/pipeline/output_test.go


 				var published atomic.Uint
+				var batchCount int


I used batchCount for debugging. We can remove it I think.

Removed in 15a1383.

urso · 2020-04-20T14:29:17Z

libbeat/publisher/pipeline/output_test.go


 	for name, ctor := range tests {
 		t.Run(name, func(t *testing.T) {
 			seedPRNG(t)

 			err := quick.Check(func(i uint) bool {
-				numBatches := 1000 + (i % 100) // between 1000 and 1099
+				numBatches := 10000 + (i % 100) // between 1000 and 1099


oops, I guess that was me. Want to to reduce it to 1000 again to reduce the running time?

Reduced in 6e10331.

urso · 2020-04-20T14:30:02Z

libbeat/publisher/pipeline/output_test.go


 				var publishedFirst atomic.Uint
 				blockCtrl := make(chan struct{})
+				var batchCount int


might be a debugging variable that can be removed.

Removed in 15a1383.

ph

Thanks @ycombinator for adding theses tests in. This was a good one to track down!

ph · 2020-04-20T14:36:24Z

libbeat/publisher/pipeline/controller.go

@@ -34,7 +34,8 @@ type outputController struct {
 	monitors Monitors
 	observer outputObserver

-	queue queue.Queue
+	queue     queue.Queue
+	workQueue workQueue


good change!

I wanted to comment to review :)

ycombinator · 2020-04-21T01:21:22Z

I've addressed the minor "cleanup" points from the last round of review. The major point that remains to be addressed is #17381 (comment). I will do that tomorrow.

ycombinator · 2020-04-21T11:46:52Z

@urso I've addressed all your latest feedback. Please re-review when you get a chance. Thanks!

ycombinator · 2020-04-21T15:02:04Z

CI failures are unrelated. Libbeat builds in both Jenkins and Travis CI are green. Merging.

* Refactoring: extracting common fields into worker struct * More refactoring * Address goroutine leak in publisher * Workaround: add Connection: close header to prevent FD leaks * Adding CHANGELOG entry * Adding IdleConnTimeout setting * Close idle connections when ES client is closed * When closing worker, make sure to cancel in-flight batches * Cancel batch + guard * [WIP] Adding output reload test * More WIP * Update test * Try to get test passing for client first * Make workqueue shared * Making tests pass * Clean up * Moving SeedFlag var to correct place * Clarifying comments * Reducing the number of quick iterations * Reducing quick iterations even more * Trying just 1 iteration * Setting out to nil after sending batch if paused * Restoring old order of operations in Set() * proposal * Do not copy mutex * Remove debugging statements * Bumping up testing/quick max count on TestOutputReload * Removing commented out helper function * Simplifying retryer now that workqueue is used across output loads * Renaming parameter * Removing debugging variable * Reducing lower bound of random # of batches * Removing sigUnWait from controller * Removing unused field Co-authored-by: urso <[email protected]> Co-authored-by: urso <[email protected]>

ycombinator commented Apr 1, 2020

View reviewed changes

ycombinator requested review from urso and ph April 1, 2020 02:49

ycombinator added bug libbeat Team:Services (Deprecated) Label for the former Integrations-Services team v7.8.0 v8.0.0 labels Apr 1, 2020

ycombinator added needs_backport PR is waiting to be backported to other branches. [zube]: In Review labels Apr 1, 2020

ycombinator force-pushed the bugfix-output-leaks branch from cef687e to 4852b7a Compare April 1, 2020 21:08

ycombinator changed the title ~~Fix publisher goroutine and ES output FD leaks~~ Fix publisher goroutine leaks Apr 2, 2020

ycombinator changed the title ~~Fix publisher goroutine leaks~~ Fix publisher goroutine leaks and timeout idle TCP connections used by ES client Apr 2, 2020

ph added the Ingest Management:alpha1 Group issues for ingest management alpha1 label Apr 2, 2020

urso reviewed Apr 2, 2020

View reviewed changes

libbeat/publisher/pipeline/output.go Show resolved Hide resolved

ycombinator mentioned this pull request Apr 3, 2020

Adding unit tests for publisher output #17460

Merged

6 tasks

ycombinator force-pushed the bugfix-output-leaks branch from 63e764d to b594a7b Compare April 8, 2020 23:26

ycombinator force-pushed the bugfix-output-leaks branch from b594a7b to f36d58d Compare April 9, 2020 17:25

ycombinator mentioned this pull request Apr 16, 2020

Output reloading: ensure all events get eventually published #17657

Closed

ycombinator force-pushed the bugfix-output-leaks branch from 3fba2db to ad33238 Compare April 16, 2020 23:02

ycombinator commented Apr 16, 2020

View reviewed changes

ycombinator added 4 commits April 19, 2020 20:09

Remove debugging statements

f09c5e0

Bumping up testing/quick max count on TestOutputReload

896505c

Removing commented out helper function

a59a56e

Simplifying retryer now that workqueue is used across output loads

041922b

ycombinator force-pushed the bugfix-output-leaks branch from 7c8de5c to 041922b Compare April 20, 2020 03:09

urso reviewed Apr 20, 2020

View reviewed changes

Renaming parameter

5ff7219

urso reviewed Apr 20, 2020

View reviewed changes

ph previously approved these changes Apr 20, 2020

View reviewed changes

ycombinator added 2 commits April 20, 2020 18:19

Removing debugging variable

15a1383

Reducing lower bound of random # of batches

6e10331

ycombinator added 2 commits April 21, 2020 04:02

Removing sigUnWait from controller

6a6680c

Removing unused field

09ccd8d

urso approved these changes Apr 21, 2020

View reviewed changes

ycombinator merged commit 5cc5654 into elastic:master Apr 21, 2020

zube bot added [zube]: Done and removed [zube]: In Review labels Apr 21, 2020

ycombinator deleted the bugfix-output-leaks branch April 21, 2020 15:07

This was referenced Apr 21, 2020

[7.x] Fix issues with output reloading (#17381) #17874

Merged

[CM] Reload not reloading always #10547

Closed

andresrc removed the [zube]: Done label Apr 27, 2020

ycombinator mentioned this pull request Apr 28, 2020

Output reloading flaky test: Log when test fails #18071

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issues with output reloading #17381

Fix issues with output reloading #17381

ycombinator commented Apr 1, 2020 •

edited

Loading

ycombinator Apr 1, 2020 •

edited

Loading

urso Apr 1, 2020

ycombinator Apr 1, 2020

ph Apr 1, 2020

ycombinator Apr 1, 2020 •

edited

Loading

elasticmachine commented Apr 1, 2020

urso commented Apr 2, 2020

ycombinator commented Apr 3, 2020

ycombinator commented Apr 8, 2020

ycombinator commented Apr 10, 2020

ycombinator commented Apr 10, 2020 •

edited

Loading

ph commented Apr 16, 2020

ycombinator commented Apr 16, 2020 •

edited

Loading

ycombinator Apr 16, 2020 •

edited

Loading

ph Apr 20, 2020

ycombinator Apr 16, 2020 •

edited

Loading

ph Apr 20, 2020

urso Apr 20, 2020

ycombinator Apr 21, 2020

urso Apr 20, 2020

ycombinator Apr 20, 2020

urso Apr 20, 2020

ycombinator Apr 21, 2020

urso Apr 20, 2020

ycombinator Apr 21, 2020

urso Apr 20, 2020

ycombinator Apr 21, 2020

urso Apr 20, 2020

ycombinator Apr 21, 2020

ph left a comment

ph Apr 20, 2020

ycombinator commented Apr 21, 2020

ycombinator commented Apr 21, 2020

ycombinator commented Apr 21, 2020

Fix issues with output reloading #17381

Fix issues with output reloading #17381

Conversation

ycombinator commented Apr 1, 2020 • edited Loading

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

ycombinator Apr 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ycombinator Apr 1, 2020 • edited Loading

Choose a reason for hiding this comment

elasticmachine commented Apr 1, 2020

urso commented Apr 2, 2020

ycombinator commented Apr 3, 2020

ycombinator commented Apr 8, 2020

ycombinator commented Apr 10, 2020

ycombinator commented Apr 10, 2020 • edited Loading

ph commented Apr 16, 2020

ycombinator commented Apr 16, 2020 • edited Loading

ycombinator Apr 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ycombinator Apr 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ph left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ycombinator commented Apr 21, 2020

ycombinator commented Apr 21, 2020

ycombinator commented Apr 21, 2020

ycombinator commented Apr 1, 2020 •

edited

Loading

ycombinator Apr 1, 2020 •

edited

Loading

ycombinator Apr 1, 2020 •

edited

Loading

ycombinator commented Apr 10, 2020 •

edited

Loading

ycombinator commented Apr 16, 2020 •

edited

Loading

ycombinator Apr 16, 2020 •

edited

Loading

ycombinator Apr 16, 2020 •

edited

Loading