-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pubsub: inconsistently receiving less messages after v0.24.0 #1177
Comments
@fastest963 You mentioned:
Just to be clear, are both your MaxExtension and AckDeadline 15s? I ask because we changed in v0.26.0 how we use AckDeadline. It may be that if you were setting this value quite high, you would now be sending much more ModAck RPCs (we now calculate how frequently to send based off your processing time, and in your case it would be the lower limit of 10s). I'm going to try deploying a v0.24.0 of the library and a v0.30.0 of the library to see whether we're seeing a lot more RPCs from the latter, which might explain your rate limiting. |
Yes, both of the values are the same. Would you recommend setting the MaxExtension to something higher or just disabling it completely (setting to less than 0)? |
In the latest in this weird issue... I deployed: Two 2 servers (out of 8), purple and pink, at 18:22 UTC and at 19:23 UTC they both dropped to receiving 0 messages. The other (v0.24.0 and 1.13.0) servers were unaffected this time. Hopefully this doesn't mean I've run into #1163 :/ Not sure if this helps, but if I zoom in on the opencensus metrics when the two dropped to 0 and then the rest briefly dropped then spiked back up, you can see a spike in retries at both times. The first retry spike is from the pink/purple servers (when they dropped to 0) and the second retry spike is from the rest of the servers (when they briefly nosedived). I was able to get a goroutine dump:
|
Bizarre.. That still looks like a gRPC issue (noting that picker_wrapper seems stuck on StreamingPull as well as number of other RPCs). Could you try |
@jadekler can do. Also want to make sure that this didn't get lost:
|
Ah, apologies. I think your settings look fine; I don't recall the semantics of MaxExtension changing. I'll keep an eye out for it whilst I go digging, though. Thanks again for sticking with this. |
Just to follow up, I've deployed an app at HEAD and an app at v0.24.0 and I'm seeing about equivalent amount of RPCs being sent. So, that's a dead end as to why you might have been rate limited. I'll keep running the app to see if I experience the deadlock and await further word about whether (orange is v0.24.0, blue is HEAD) |
I deployed to 2 servers at 20:00 UTC: There hasn't been any issues yet and it's been a bit over 90 minutes. I'm still confused how last night the old instances were somehow affected by the new instances so that's still bizarre to me but at least we might have narrowed it down to a grpc-go issue. I'll leave these 2 servers on these commits for a while longer and report if anything happens. |
Right on. That solidifies that it's likely a grpc-go problem. Thanks for helping us look into this. I think I have a repro, though my app only sees to go down for 5 minutes or so at a time. Anyways, we'll start looking in grpc to see what we can find. Will update here when I know more. |
@fastest963 Updates:
We're also discussing how we can help users diagnose problems like this more easily by adding logging and tracing options. Finally, if you are at all able to provide the holy grail - a reproduction - we'd love that, too! |
@jadekler I can try to. It takes a while for the problem to happen so it might be a while until I can narrow it down. Not sure what I'm doing different. The only thing I can think of is that the jobs take anywhere from 300-500ms to ack so maybe try sleeping inside your test to see if that somehow is the cause? |
FYI, keeping |
Yeah I noticed that. Hopefully I can narrow down the issue tonight and report back what commit it breaks on. Just deployed |
Thanks, much appreciate y'all. I can roll that commit back if we need to buy more time at an older version. |
Yes, a rollback while this bug gets sorted out would be nice. Thanks. |
Done. Thanks for mentioning. |
The preliminary results are showing that |
Great to hear - thanks for doing that. Look forward to hearing the results. |
I made a small program to try and simulate what we're doing and I ran it with https://gist.github.com/fastest963/e9ff2774926f864d06e3be1f9edd4000 With
And the stackdriver metrics for number of streaming pull ack message count: But with
I'd expect it to stay around 150k a minute (except for slow ramp up) but with the newer commit it fluctuates a lot and falls below 20k (even 1k at one point). I didn't run either of them for long enough to fully prove that the issue is between those commits but at least it seems like something that could reproduce what we're seeing. I thought I'd at least get something up for you to try and reproduce yourself as well. I'll continue running our regular subscribers with the older commit overnight and switch to the newer one in the morning. |
Here's a goroutine dump from the latest
|
@fastest963 Thank you for providing that repro. I've played with it today, and added instrumentation in pubsub to record:
Here are my results: I've only run this for about 20m, but in that time I noticed very little difference between the before and the after. I have it set up to run for a few hours now to see if I notice a degradation over time. As you've suggested, I think you're running into an issue that only occurs in a rate limiting situation. It's possible Pub/Sub sends a GOAWAY in a rate limiting situation and the grpc-go logic for handling GOAWAYs are funky in that commit. I'd love to narrow in on this a bit further, though. To check whether you're being rate limited, would you mind emailing your project ID, subscription ID, and topic ID to me at [email protected]? With that I can ask the Pub/Sub team to check out what's going on. Meanwhile, I'm waiting to hear back from their team about what rate limiting looks like, and how I can simulate it. If you have energy to try another thing, would you mind pulling down my copies of the instrumented client libraries and sending me the dump they produce against:
When running with these branches, the Pub/Sub client will spit out 3 extra log lines a minute in the same manner as your repro above. |
Sent over that info. I'll get another test started in the morning with those branches listed above and respond with the results. Thanks! |
Here's the log from
It dropped right at 3:58 PM UTC just around 60 minutes after the test started. It looks like |
I noticed with the last test above that publishes were also affected at the same time that receiving was affected. Here's with
I stopped it after 80 minutes because it wasn't showing any signs of slowing down. |
Thanks. This is valuable information. Here's a graph of those two for posterity: (edit: the commit SHAs are off in this diagram, but the data is good) The newer code seems less consistently smooth than the old, but that's a weak signal [1]. For next steps, here's what I'd like to suggest:
Thanks for your continued help, we much appreciate it. 1: We have internal performance benchmarks that run continuously, and they show no change. This is likely just network jitter: note that the "old" line is smooth only 2/3 the time, and equally jagged the last third. |
@jadekler The newer commit dropped at 3:58 PM UTC from over 100k a minute to less than 1,000 per minute and the topic was backing up. I think that's represented in your graph as it going to 0? It's definitely not much different until the point at which it seems to just completely fail and barely receive any messages. It also fails to publish messages as fast. You'll notice the To clarify, I'm not worried about the jitter and relatively similar performance in the beginning. Something happens with the new code after an hour and it just stops doing anything and that's the part I'm concerned about. You'll see in the graphs around the 60 minute mark and the red line is what the issue is. I'll start a test with your logging branch and HEAD of everything else and I'll post the results. Regarding the quotas, I don't see any that have been hit in the last 7 days. Are there any internal quotas that are not available for me to see? |
Right, yes, absolutely the drop is the problem. I neglected to mention that because there's not much to say about it that hasn't been said, but I should have, in hindsight. The jitter was new information, so I thought it would be good to briefly comment on; but, as mentioned, it's probably nothing. I still have no idea what causes the drop-off, but loosely speaking I'm looking for:
Not that I'm aware of, but I'm still waiting for a response from the Pub/Sub SREs. |
Ah okay! Just wanted to make sure I wasn't being misunderstood! Hopefully this helps:
I deduplicated where the logs were the same log was on every line (denoted with Update: Forgot to paste the logs from the beginning. Updated the snippet above with the logs from the start of the script. |
This is very, very useful. Thank you. We'll pore over this tomorrow. I presume these logs are around the time that you experienced the drop? |
Yup! To make it easier, here are just the counts:
You can see the drop started around 12:42am UTC and the |
Thank you very much. And just to check, could you post your commit SHA of google-api-go-client, google-cloud-go, and grpc-go? |
Woops, forgot to post that earlier!
|
@fastest963 Those dumps were extremely enlightening. It appears as though onClose is either deadlocked [1] or it's being allowed to skip resetting the transport [2]. It gets logged without a corresponding transition to TRANSIENT_FAILURE, which is very smelly. I've added some very verbose logging around all these flows: if you have any energy today or tomorrow, please give If you're not able to, no worries: we will continue to pore over that code and your recently-provided logs. 1: Your earlier goroutine dumps don't indicate it, so I suspect not. |
Okay re-ran with the latest commit on your branch
Full log is available here: https://gobin.io/eLqf Also, goroutine dump:
|
Thanks, we've found the problem. Working on a fix now. |
Tracking fix in grpc/grpc-go#2391. |
I believe that as of Also @fastest963, we plan on adding a special thanks for your help in the next grpc-go release as well as in the next google-cloud-go release. Furthermore, if you email me your address and shirt size, I think we may be able to send you a hoodie, too. We greatly appreciate all your help. Keeping this issue open for now until someone can confirm that the problem is no longer present with ff2aa05958775030998dbe2f9bccbe2af324adf4. |
Running a test with that commit now. Crossing my fingers! Thanks for the swag and the thanks! I'll send you an email later with the my details. |
Ran the test for over 4 hours and there was no dropping off 👍 |
Great! Let us know how it's looking after the weekend 👍 |
The grpc-go fix is looking great! We deployed it across all of our servers and things have been stable all day so far. I think this is good to close. |
Fantastic news! |
I can confirm that HEAD across the board has been looking good for me too. |
Woot, thanks for the confirmation @dsymonds. |
User @fastest963 reports that using v0.24.0 works as expected, but bumping to HEAD causes what appears to be rate limiting to happen.
See: #1163 (comment)
Continuing the conversation from that thread here, since the original issue there seems quite different.
The text was updated successfully, but these errors were encountered: