Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory queue: cancel in-progress writes on queue closed, not producer closed #38094

Merged
merged 8 commits into from
Mar 4, 2024

Conversation

faec
Copy link
Contributor

@faec faec commented Feb 21, 2024

Proposed commit message

Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.

Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

This is hard to test reliably, because the original issue was a race condition that was already hard to reproduce. To maximize the chances of seeing the original problem:

  • Use an input that creates and closes many pipeline clients, like a Filestream input with many small files being created and removed, or Kubernetes autodiscover with many ephemeral inputs. The closing of the client ( / harvester) is what triggers the race condition.
  • Set queue.mem.events: 32 (the minimum) or some similarly small value. The queue needs to frequently be full, but still making progress. This way there is a higher chance that an event will be waiting in the queue's input channel buffer when its producer is cancelled, which is what leads to an incorrect event count.

Related issues

@faec faec added bug Team:Elastic-Agent Label for the Agent team labels Feb 21, 2024
@faec faec self-assigned this Feb 21, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Feb 21, 2024
Copy link
Contributor

mergify bot commented Feb 21, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @faec? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Feb 21, 2024

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2024-02-29T21:44:14.283+0000

  • Duration: 132 min 50 sec

Test stats 🧪

Test Results
Failed 0
Passed 29121
Skipped 2046
Total 31167

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@@ -138,8 +138,8 @@ func TestProducerDoesNotBlockWhenCancelled(t *testing.T) {
time.Millisecond,
"the first two events were not successfully published")

// Cancel the producer, this should unblock its Publish method
p.Cancel()
// Close the queue, this should unblock the pending Publish call
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was checking the wrong thing, for the same reason that the select cases were wrong: the producer shouldn't unblock when closed if its queue publish request has already been sent, because the queue can't tell that the producer has given up, so both the producer and the queue would "own" the cleanup for the event. The correct behavior is to unblock when the queue itself is closed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a good comment to add to the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@faec faec marked this pull request as ready for review February 21, 2024 21:30
@faec faec requested a review from a team as a code owner February 21, 2024 21:30
@faec faec requested review from ycombinator and rdner February 21, 2024 21:30
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz cmacknz added backport-v8.12.0 Automated backport with mergify backport-v8.13.0 Automated backport with mergify labels Feb 21, 2024
@cmacknz cmacknz requested a review from belimawr February 21, 2024 21:46
Copy link
Member

@rdner rdner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job tracking this down!

Copy link
Contributor

mergify bot commented Feb 23, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b wg-panic-fix upstream/wg-panic-fix
git merge upstream/main
git push upstream wg-panic-fix

@cmacknz
Copy link
Member

cmacknz commented Feb 27, 2024

Use an input that creates and closes many pipeline clients, like a Filestream input with many small files being created and removed, or Kubernetes autodiscover with many ephemeral inputs. The closing of the client ( / harvester) is what triggers the race condition.

Set queue.mem.events: 32 (the minimum) or some similarly small value. The queue needs to frequently be full, but still making progress. This way there is a higher chance that an event will be waiting in the queue's input channel buffer when its producer is cancelled, which is what leads to an incorrect event count.

Can either of these be isolated into a test case? The first case seems like something that could be caught with a stress test, even if it doesn't reproduce every time. If our CI system could have caught it eventually we would've found this much sooner.

@faec
Copy link
Contributor Author

faec commented Feb 29, 2024

Can either of these be isolated into a test case?

With considerable fiddling, yes, I've now added a unit test that will usually detect the error but (if I wrangled the logic right) never give false positives. (Fully deterministic checking isn't possible with the current queue API because the bug involved scheduler behavior on buffered channels.) However, the new test failed 100 out of 100 times on main, and passed 100 out of 100 times with this PR, so it seems pretty consistent.

@elasticmachine
Copy link
Collaborator

elasticmachine commented Feb 29, 2024

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @faec

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @faec

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @faec

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @faec

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @faec

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @faec

Copy link
Member

@rdner rdner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I would have liked to see require.Eventually instead of time.Sleep, I understand that there is no indicator we could use for the test.

@rdner rdner requested a review from cmacknz March 4, 2024 16:21
@cmacknz
Copy link
Member

cmacknz commented Mar 4, 2024

Thanks for the test, I also don't love the sleeps but I don't have any quick suggestions for removing them. Would rather have a test with sleeps than no test.

@faec faec merged commit d23b4d3 into elastic:main Mar 4, 2024
113 checks passed
@faec faec deleted the wg-panic-fix branch March 4, 2024 17:51
mergify bot pushed a commit that referenced this pull request Mar 4, 2024
… closed (#38094)

Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.

Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.

(cherry picked from commit d23b4d3)
mergify bot pushed a commit that referenced this pull request Mar 4, 2024
… closed (#38094)

Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.

Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.

(cherry picked from commit d23b4d3)
faec added a commit that referenced this pull request Mar 5, 2024
… closed (#38094) (#38178)

Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.

Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.

(cherry picked from commit d23b4d3)

Co-authored-by: Fae Charlton <[email protected]>
@faec faec added the backport-v8.11.0 Automated backport with mergify label Mar 12, 2024
mergify bot pushed a commit that referenced this pull request Mar 12, 2024
… closed (#38094)

Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.

Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.

(cherry picked from commit d23b4d3)
faec added a commit that referenced this pull request Mar 12, 2024
…eue closed, not producer closed (#38177)

* Memory queue: cancel in-progress writes on queue closed, not producer closed (#38094)

Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.

Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.

(cherry picked from commit d23b4d3)

* fix backport

---------

Co-authored-by: Fae Charlton <[email protected]>
faec added a commit that referenced this pull request Mar 12, 2024
…eue closed, not producer closed (#38279)

* Memory queue: cancel in-progress writes on queue closed, not producer closed (#38094)

Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.

Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.

(cherry picked from commit d23b4d3)

* fix backport

* fix NewQueue call

---------

Co-authored-by: Fae Charlton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.11.0 Automated backport with mergify backport-v8.12.0 Automated backport with mergify backport-v8.13.0 Automated backport with mergify bug Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race in memqueue leads to panic: sync: negative WaitGroup counter
4 participants