-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory queue: cancel in-progress writes on queue closed, not producer closed #38094
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
@@ -138,8 +138,8 @@ func TestProducerDoesNotBlockWhenCancelled(t *testing.T) { | |||
time.Millisecond, | |||
"the first two events were not successfully published") | |||
|
|||
// Cancel the producer, this should unblock its Publish method | |||
p.Cancel() | |||
// Close the queue, this should unblock the pending Publish call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was checking the wrong thing, for the same reason that the select cases were wrong: the producer shouldn't unblock when closed if its queue publish request has already been sent, because the queue can't tell that the producer has given up, so both the producer and the queue would "own" the cleanup for the event. The correct behavior is to unblock when the queue itself is closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a good comment to add to the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job tracking this down!
This pull request is now in conflicts. Could you fix it? 🙏
|
Can either of these be isolated into a test case? The first case seems like something that could be caught with a stress test, even if it doesn't reproduce every time. If our CI system could have caught it eventually we would've found this much sooner. |
With considerable fiddling, yes, I've now added a unit test that will usually detect the error but (if I wrangled the logic right) never give false positives. (Fully deterministic checking isn't possible with the current queue API because the bug involved scheduler behavior on buffered channels.) However, the new test failed 100 out of 100 times on main, and passed 100 out of 100 times with this PR, so it seems pretty consistent. |
💛 Build succeeded, but was flaky
Failed CI StepsHistory
cc @faec |
💚 Build Succeeded
History
cc @faec |
💚 Build Succeeded
History
cc @faec |
💚 Build Succeeded
History
cc @faec |
💚 Build Succeeded
History
cc @faec |
💚 Build Succeeded
History
cc @faec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I would have liked to see require.Eventually
instead of time.Sleep
, I understand that there is no indicator we could use for the test.
Thanks for the test, I also don't love the sleeps but I don't have any quick suggestions for removing them. Would rather have a test with sleeps than no test. |
… closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3)
… closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3)
… closed (#38094) (#38178) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3) Co-authored-by: Fae Charlton <[email protected]>
… closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3)
…eue closed, not producer closed (#38177) * Memory queue: cancel in-progress writes on queue closed, not producer closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3) * fix backport --------- Co-authored-by: Fae Charlton <[email protected]>
…eue closed, not producer closed (#38279) * Memory queue: cancel in-progress writes on queue closed, not producer closed (#38094) Fixes a race condition that could lead to incorrect event totals and occasional panics #37702. Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent. (cherry picked from commit d23b4d3) * fix backport * fix NewQueue call --------- Co-authored-by: Fae Charlton <[email protected]>
Proposed commit message
Fixes a race condition that could lead to incorrect event totals and occasional panics #37702.
Once a producer sends a get request to the memory queue, it must wait on the response unless the queue itself is closed, otherwise it can return a false failure. The previous code mistakenly waited on the done signal for the current producer rather than the queue. This PR adds the queue's done signal to the producer struct, and waits on that once the insert request is sent.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
This is hard to test reliably, because the original issue was a race condition that was already hard to reproduce. To maximize the chances of seeing the original problem:
queue.mem.events: 32
(the minimum) or some similarly small value. The queue needs to frequently be full, but still making progress. This way there is a higher chance that an event will be waiting in the queue's input channel buffer when its producer is cancelled, which is what leads to an incorrect event count.Related issues