-
-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ProduceSync returning "context canceled" error when parent context hasn't been cancelled #756
Comments
Do you have debug logs of this happening? |
"2024-06-18T20:37:48.051Z DEBUG [email protected]/kzap.go:105 sasl expiry limit reached, reauthenticating {""broker"": ""1""}" You can see that they are now returning a "context deadline exceeded" error instead of a "context canceled" error because we added a parent context with a timeout of 500ms. It was taking several seconds to get the "context canceled" so we added this to fail faster so we can retry. But the flow is the same. I can't tell in this scenario if the produce request that are failing were going to broker 1 or broker 3. If it's broker 3, then we appear to cancel and not retry produce requests for the original broker connection. See #249 for the change that creates a new connection when SASL authentication fails. If it's broker 1, then it seems more related to the ILLEGAL_SASL_STATE due to the short session. |
There is no canceling in the produce path of the codebase:
The cancel calls in Without looking at your code, my guess is that you are using the same context across multiple Produce calls and are canceling it after some messages return, which causes buffered records to be canceled too. The logs above look normal, minus the SASL problems that we've discussed in a few issues. If you really want to see where the cancel is coming from, I think you could write a wrapper context.Context interface that calls debug.PrintStack on cancel before calling the inner (original) cancel func. I'd be pretty curious to know what the debug logs say. I'm not really sure what to do in the client at this point for the SASL problems, considering this is almost the same behavior as the Kafka Java client itself. IMO, AWS needs to have some answer to the SASL problems -- like, propose a solution, or change what they allow. Given what I saw when I implemented the AWS_MSK_IAM auth 3yr ago, the broker side seemed a bit odd at the time. |
Let me know if you have a bit more of an idea on this one -- from my audit of the code, nothing should be causing other records to return context.Canceled. |
The underlying cause of this was the SASL "session too short" errors (#731). The solution posted there has also resolved this problem. Thanks! |
Many times throughout the day the following sequence of logs occur:
These ProduceSync errors are impacting our system. Is there an internal context that is being cancelled? Is there a reason these particular errors aren't being retried?
The text was updated successfully, but these errors were encountered: