-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes publishes tens of thousands of duplicate copies of messages #1903
Comments
After much gnashing of teeth, we finally found some diagnostic data for this, while using the latest v4.3.3 release. We disabled all publish retries within this library and its dependencies by way of {
gaxOpts: {
maxRetries: 0,
retry: {
retryCodes: [
// this has to be non-empty to be obeyed, but we don't want it to
// match on anything
NaN,
],
},
}
} This allowed us to see what the error was that (presumably) triggered this retry flood. What we got back was:
The combination of code Further, we can see in the logs that the messages that triggered this error were sent, as our error logging recorded a unique ID from the payload, and we can see that payload being processed by the subscribers 10 seconds (+/- 100ms) before this error was thrown on the publisher side. I infer that, with retries enabled, it would have returned this same or some similar error at high frequency causing the over-publishing of the single message. With the retries disabled within PubSub/gax/grpc, we are not seeing the "over-publishing" any more (thankfully our apps have their own retry logic sufficient for these cases), however that is not a desirable situation long term. To try to pin down more details on these, we have been running with limited GRPC trace logging enabled. We updated that to be:
With this we were able to capture another instance, and extracted these lines as the ones that appeared related to the failing call:
While I don't think there's any way for the client code to magically know that these messages actually were published server-side, it still seems like something strange is happening with the retries, because with them enabled we were seeing the retries appearing to go into a tight loop, as if the backoff settings were being ignored and retrying instantly, re-failing instantly, etc. |
Hi - I know that folks have been in touch via internal channels and that work on this is ongoing! Dropping in the comments here to link |
Thank you @leahecole. While esp. the GAX change seems related to retries, it's not clear to me how it could cause the "retries as fast as possible without any delay between attempts" behavior seen here? |
It is not clear to me either - that's what I'm trying to figure out - if it's correlated or not 🕵🏻♀️ stay tuned for more! The gax change was mostly related to server streaming retries but did touch a couple of conditionals related to unary retries. I'm working with a teammate to validate if those changes are at all at play here, or if it's an issue in a lower part of the stack. |
My non update update is that the more I dig, the less I think that the gax changes are the root cause here, though they may cause a side effect (still not sure). If I've read through the notes right, in the state where this issue occurs, I tried reproducing the error conditions locally (not in GKE) and could not - I never saw that spike in send_message_operation_count. I am wondering if something about that is creating the condition in which this happens, especially given that the firestore issue I posted above also sees DEADLINE_EXCEEDED when using GKE. My hope is that the gax PR that updates grpc makes this issue obsolete. I will keep investigating to try to rule out various factors. Sidebar - If it helps, I have used this library in the past for the cases where the timeout doesn't seem to get sent. |
Thanks for a great write-up. We have experienced the same thing in our production services as well, when there is a big rate of messages coming, some tend to get resent over and over |
I'm starting to think that this issue is related: #1213
There is a logic error there in the leaser that will end up auto-dispensing messages that have already expired when the first one in the queue expires. |
@feywind That may be the source of some problems, but in my team's specific case described here, I don't think any subscription-side issues were the source of the problem, because:
All together this tells me that the issue described here is purely a publisher-side issue, and indeed could probably happen even if no subscription/subscriber existed. Unless of course I'm not understanding the details in the other ticket, and the lease code mentioned is also involved in publishing somehow? |
@mgabeler-lee-6rs Yes, you're right - I think I misread the comment I quoted, apologies. |
@mgabeler-lee-6rs Quick update here, I think we might've found a potential new cause here, especially since you're setting retry settings. The library is taking in retry settings and replacing existing ones rather than merging them. This means that e.g. if you don't include the retry multipliers in the settings, the delay and timeouts end up becoming zero, so it just starts spamming the server with retries. Some others have seen this cause a huge publish of duplicate messages. We have a tracking bug to update the retry settings sample, and the docs, to cover this case, but if it's still being a problem for you, you might try filling out the retry settings object fully. e.g.: const retrySettings = {
retryCodes: [
10, // 'ABORTED'
1, // 'CANCELLED',
4, // 'DEADLINE_EXCEEDED'
13, // 'INTERNAL'
8, // 'RESOURCE_EXHAUSTED'
14, // 'UNAVAILABLE'
2, // 'UNKNOWN'
],
backoffSettings: {
initialRetryDelayMillis: 100,
retryDelayMultiplier: 1.3,
maxRetryDelayMillis: 60000,
initialRpcTimeoutMillis: 60000,
rpcTimeoutMultiplier: 1.0,
maxRpcTimeoutMillis: 60000,
totalTimeoutMillis: 60000,
},
}; |
Yes, that could explain our situation, previously we were passing: {
gaxOpts: {
retry: {
backoffSettings: {
totalTimeoutMillis: 2 * 60_000,
},
},
},
} Our workaround for this was to fully disable retries in the Google libraries with That said, I hadn't noticed delay/timeout parameters being zero when I was single-stepping through the gax code, so there may be a little more to this, not sure. Either way, since we have to have robust retry logic at our application layer (no level of retries in this library can guarantee success, esp. when k8s may |
Node google-gax internally is ... messy. 😹 I hope it can get cleaned up more, but in the meantime, I'm glad you found a good way forward. I'm going to go ahead and close this, and if anyone sees similar problems pop up, please comment. |
My team has started observing intermittent (but frequent) instances where an individual message publish call we make to this library results in tens of thousands of duplicate copies being sent to the PubSub servers.
This is symptomatically similar to old issue #847, but it is happening to us even at very low publish rates (single digit messages per second).
In one particular incident we were able to map through our trace logs, we could confirm that a single call to the publish method led to over 15 thousand copies being sent. Meanwhile the client library eventually reported a send timeout in this case, though other cases do not always result in a send timeout. We're confident that it's copies being sent by the library and not duplicate deliveries of some sort because we have been able to capture sufficient logging to confirm that the duplicates all have different PubSub message IDs despite having the identical payload, and because we can observe the network traffic from our applications (running in GKE) spiking to 10x or more their normal amount with perfect correlation to these events, and to corresponding spikes in stackdriver publish requests/operations counters.
We have been attempting to make a small reproducer application, but have been unsuccessful so far, despite this happening several times an hour across our production deployments. We have also been unsuccessful at reproducing this in staging deployments.
There is some soft evidence that network or pubsub server conditions have an influence on this, as we have many independent GKE deployments (same cluster, same app images, but different customers, no shared data between them), and there is a significant correlation across these deployments of when this happens -- we will get clusters of time when this is impacting many deployments, and then periods when it is impacting almost none of them.
If it is relevant, our applications make heavy use of ordered publishing, are almost never sending more than one message at a time per ordering key, but may have low tens of ordering keys in flight at the same time. Rarely we may have a hundred or so ordering keys in flight, but that is rare and we see this happening regularly with single digit numbers concurrently in flight.
There is a high, but not perfect, correlation with publish requests experiencing this duplication eventually ending in a
Total timeout of API google.pubsub.v1.Publisher exceeded 120000 milliseconds before any response was received
error.Tracing through older versions that are still deployed, we can find no instances of this happening when we were using version 3.0.1. We don't have much data for anything between that and v4.0.7. We
are trying out the latest v4.3.3 now, but nothing jumps out in the changelog as being likely related to this issue so farhave successfully reproduced this issue with v4.3.3 too.We have some limited gRPC trace logs captured from these incidents, but they didn't capture any reasons for the duplicate sends, only more confirmation that it is coming from the client library. The trace logs were captured with
GRPC_VERBOSITY=DEBUG
andGRPC_TRACE=channel,subchannel,call_stream
, and show a thundering horde ofcreateRetryingCall
,createLoadBalancingCall
,createResolvingCall
trios tomethod="/google.pubsub.v1.Publisher/Publish"
, but nothing else for the publishes. We can update theGRPC_TRACE
setting if there are suggestions for additional elements that will help diagnose this without slowing down the app too much with excessive logging.While most of our apps both publish & subscribe (streaming pull), this has been confirmed in at least one app that only publishes -- the "15k" example above happened to come from that app.
We have support case
50200427
open for this, but are filing an issue here because it seems very likely at this point that this is a bug somewhere in the pubsub + gax + grpc stack.Environment details
@google-cloud/pubsub
version: 3.0.1, 4.0.7, 4.3.3Steps to reproduce
pubsub.googleapis.com/topic/send_message_operation_count
metric to see if suddenly hundreds of messages per second are being published instead of less than tenEdits:
The text was updated successfully, but these errors were encountered: