Memory queue: cancel in-progress writes on queue closed, not producer closed #38094

faec · 2024-02-21T21:21:42Z

Proposed commit message

Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.

Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

This is hard to test reliably, because the original issue was a race condition that was already hard to reproduce. To maximize the chances of seeing the original problem:

Use an input that creates and closes many pipeline clients, like a Filestream input with many small files being created and removed, or Kubernetes autodiscover with many ephemeral inputs. The closing of the client ( / harvester) is what triggers the race condition.
Set queue.mem.events: 32 (the minimum) or some similarly small value. The queue needs to frequently be full, but still making progress. This way there is a higher chance that an event will be waiting in the queue's input channel buffer when its producer is cancelled, which is what leads to an incorrect event count.

Related issues

Fixes Race in memqueue leads to panic: sync: negative WaitGroup counter #37702

mergify · 2024-02-21T21:22:16Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @faec? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2024-02-21T21:29:21Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2024-02-29T21:44:14.283+0000
Duration: 132 min 50 sec

Test stats 🧪

Test	Results
Failed	0
Passed	29121
Skipped	2046
Total	31167

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

faec · 2024-02-21T21:29:56Z

libbeat/publisher/queue/memqueue/queue_test.go

@@ -138,8 +138,8 @@ func TestProducerDoesNotBlockWhenCancelled(t *testing.T) {
 		time.Millisecond,
 		"the first two events were not successfully published")

-	// Cancel the producer, this should unblock its Publish method
-	p.Cancel()
+	// Close the queue, this should unblock the pending Publish call


This test was checking the wrong thing, for the same reason that the select cases were wrong: the producer shouldn't unblock when closed if its queue publish request has already been sent, because the queue can't tell that the producer has given up, so both the producer and the queue would "own" the cleanup for the event. The correct behavior is to unblock when the queue itself is closed.

This would be a good comment to add to the code

elasticmachine · 2024-02-21T21:30:20Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

rdner

Great job tracking this down!

mergify · 2024-02-23T19:42:50Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b wg-panic-fix upstream/wg-panic-fix
git merge upstream/main
git push upstream wg-panic-fix

cmacknz · 2024-02-27T08:21:23Z

Use an input that creates and closes many pipeline clients, like a Filestream input with many small files being created and removed, or Kubernetes autodiscover with many ephemeral inputs. The closing of the client ( / harvester) is what triggers the race condition.

Set queue.mem.events: 32 (the minimum) or some similarly small value. The queue needs to frequently be full, but still making progress. This way there is a higher chance that an event will be waiting in the queue's input channel buffer when its producer is cancelled, which is what leads to an incorrect event count.

Can either of these be isolated into a test case? The first case seems like something that could be caught with a stress test, even if it doesn't reproduce every time. If our CI system could have caught it eventually we would've found this much sooner.

faec · 2024-02-29T20:43:38Z

Can either of these be isolated into a test case?

With considerable fiddling, yes, I've now added a unit test that will usually detect the error but (if I wrangled the logic right) never give false positives. (Fully deterministic checking isn't possible with the current queue API because the bug involved scheduler behavior on buffered channels.) However, the new test failed 100 out of 100 times on main, and passed 100 out of 100 times with this PR, so it seems pretty consistent.

elasticmachine · 2024-02-29T21:58:05Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: d6d19ce

Failed CI Steps

:windows:-family/core-windows-2022 Unit Tests

History

💚 Build #3362 succeeded 27f7ebd
💚 Build #3287 succeeded 2506829
💛 Build #3188 was flaky 61ff430

cc @faec

elasticmachine · 2024-02-29T22:00:26Z

💚 Build Succeeded

Buildkite Build
Commit: d6d19ce

History

💚 Build #1726 succeeded 27f7ebd
💚 Build #1651 succeeded 2506829
💚 Build #1552 succeeded 61ff430

cc @faec

elasticmachine · 2024-02-29T22:18:02Z

💚 Build Succeeded

Buildkite Build
Commit: d6d19ce

History

💚 Build #1733 succeeded 27f7ebd
💚 Build #1659 succeeded 2506829
💚 Build #1560 succeeded 61ff430

cc @faec

elasticmachine · 2024-02-29T22:21:10Z

💚 Build Succeeded

Buildkite Build
Commit: d6d19ce

History

💚 Build #889 succeeded 27f7ebd
💚 Build #815 succeeded 2506829
💚 Build #716 succeeded 61ff430

cc @faec

elasticmachine · 2024-02-29T22:31:38Z

💚 Build Succeeded

Buildkite Build
Commit: d6d19ce

History

💚 Build #2022 succeeded 27f7ebd
💚 Build #1948 succeeded 2506829
💚 Build #1849 succeeded 61ff430

cc @faec

elasticmachine · 2024-02-29T22:34:39Z

💚 Build Succeeded

Buildkite Build
Commit: d6d19ce

History

💚 Build #2940 succeeded 27f7ebd
💚 Build #2866 succeeded 2506829
💔 Build #2766 failed 61ff430

cc @faec

rdner

Although I would have liked to see require.Eventually instead of time.Sleep, I understand that there is no indicator we could use for the test.

cmacknz · 2024-03-04T16:52:32Z

Thanks for the test, I also don't love the sleeps but I don't have any quick suggestions for removing them. Would rather have a test with sleeps than no test.

… closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3)

… closed (#38094) (#38178) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3) Co-authored-by: Fae Charlton <[email protected]>

… closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3)

…eue closed, not producer closed (#38177) * Memory queue: cancel in-progress writes on queue closed, not producer closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3) * fix backport --------- Co-authored-by: Fae Charlton <[email protected]>

…eue closed, not producer closed (#38279) * Memory queue: cancel in-progress writes on queue closed, not producer closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3) * fix backport * fix NewQueue call --------- Co-authored-by: Fae Charlton <[email protected]>

Writes in progress should cancel on queue closed, not producer closed

2f522e4

faec added bug Team:Elastic-Agent Label for the Agent team labels Feb 21, 2024

faec self-assigned this Feb 21, 2024

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Feb 21, 2024

update changelog

8c1010c

update test name / comments

61ff430

faec commented Feb 21, 2024

View reviewed changes

faec marked this pull request as ready for review February 21, 2024 21:30

faec requested a review from a team as a code owner February 21, 2024 21:30

faec requested review from ycombinator and rdner February 21, 2024 21:30

cmacknz added backport-v8.12.0 Automated backport with mergify backport-v8.13.0 Automated backport with mergify labels Feb 21, 2024

cmacknz requested a review from belimawr February 21, 2024 21:46

rdner approved these changes Feb 22, 2024

View reviewed changes

Merge branch 'main' into wg-panic-fix

2506829

faec added 3 commits February 29, 2024 15:29

candidate test for wait group panic

461f8e1

fix wg panic test

166b6bd

Merge branch 'wg-panic-fix' of github.com:faec/beats into wg-panic-fix

27f7ebd

add more explanatory comments

d6d19ce

rdner approved these changes Mar 4, 2024

View reviewed changes

rdner requested a review from cmacknz March 4, 2024 16:21

cmacknz approved these changes Mar 4, 2024

View reviewed changes

faec merged commit d23b4d3 into elastic:main Mar 4, 2024
113 checks passed

faec deleted the wg-panic-fix branch March 4, 2024 17:51

mergify bot mentioned this pull request Mar 4, 2024

[8.12](backport #38094) Memory queue: cancel in-progress writes on queue closed, not producer closed #38177

Merged

mergify bot mentioned this pull request Mar 4, 2024

[8.13](backport #38094) Memory queue: cancel in-progress writes on queue closed, not producer closed #38178

Merged

faec added the backport-v8.11.0 Automated backport with mergify label Mar 12, 2024

mergify bot mentioned this pull request Mar 12, 2024

[8.11](backport #38094) Memory queue: cancel in-progress writes on queue closed, not producer closed #38279

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory queue: cancel in-progress writes on queue closed, not producer closed #38094

Memory queue: cancel in-progress writes on queue closed, not producer closed #38094

faec commented Feb 21, 2024 •

edited

Loading

mergify bot commented Feb 21, 2024

elasticmachine commented Feb 21, 2024 •

edited

Loading

Build stats

Test stats 🧪

faec Feb 21, 2024

cmacknz Feb 27, 2024

faec Feb 29, 2024

elasticmachine commented Feb 21, 2024

rdner left a comment

mergify bot commented Feb 23, 2024

cmacknz commented Feb 27, 2024 •

edited

Loading

faec commented Feb 29, 2024 •

edited

Loading

elasticmachine commented Feb 29, 2024 •

edited

Loading

elasticmachine commented Feb 29, 2024

elasticmachine commented Feb 29, 2024

elasticmachine commented Feb 29, 2024

elasticmachine commented Feb 29, 2024

elasticmachine commented Feb 29, 2024

rdner left a comment

cmacknz commented Mar 4, 2024

Memory queue: cancel in-progress writes on queue closed, not producer closed #38094

Memory queue: cancel in-progress writes on queue closed, not producer closed #38094

Conversation

faec commented Feb 21, 2024 • edited Loading

Proposed commit message

Checklist

How to test this PR locally

Related issues

mergify bot commented Feb 21, 2024

elasticmachine commented Feb 21, 2024 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

faec Feb 21, 2024

Choose a reason for hiding this comment

cmacknz Feb 27, 2024

Choose a reason for hiding this comment

faec Feb 29, 2024

Choose a reason for hiding this comment

elasticmachine commented Feb 21, 2024

rdner left a comment

Choose a reason for hiding this comment

mergify bot commented Feb 23, 2024

cmacknz commented Feb 27, 2024 • edited Loading

faec commented Feb 29, 2024 • edited Loading

elasticmachine commented Feb 29, 2024 • edited Loading

💛 Build succeeded, but was flaky

Failed CI Steps

History

elasticmachine commented Feb 29, 2024

💚 Build Succeeded

History

elasticmachine commented Feb 29, 2024

💚 Build Succeeded

History

elasticmachine commented Feb 29, 2024

💚 Build Succeeded

History

elasticmachine commented Feb 29, 2024

💚 Build Succeeded

History

elasticmachine commented Feb 29, 2024

💚 Build Succeeded

History

rdner left a comment

Choose a reason for hiding this comment

cmacknz commented Mar 4, 2024

faec commented Feb 21, 2024 •

edited

Loading

elasticmachine commented Feb 21, 2024 •

edited

Loading

cmacknz commented Feb 27, 2024 •

edited

Loading

faec commented Feb 29, 2024 •

edited

Loading

elasticmachine commented Feb 29, 2024 •

edited

Loading