-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All firestore queries hang for almost exactly 1 hour #2495
Comments
Hey @beckjake, (1) I presume your code snippets meant to use (2) What kind of environment are you running in? as far as I remember, the application default credentials are valid for 1 hour. So I think something might be wrong with the token? Have you tried following the instructions here: https://firebase.google.com/docs/admin/setup#initialize-sdk ? |
Sorry, yes, that got lost in my translation as the result of
We're running in a pod within Google Kubernetes Engine. We call The application works for long periods of time before periodically one pod (or multiple pods) in our cluster gets stuck in this state. Today we've been trying to reproduce the issue and have only been able to reproduce ~1m long drop-outs that have a lot of messages like this:
but we also have times where we see those messages and don't experience any service interruptions, so I'm not sure how much to read into TRANSIENT_FAILUREs |
We're experiencing a real hang right this moment (
The log about This hang took >1hr. We attached a debugger and the process died, so unfortunately we don't know if it would have eventually self-corrected. |
hmmm... I see a lot of "deadline exceeded" errors. What is your usage pattern like? Are you reaching the Maximum Writes per second? |
We're definitely not making API requests >10mb in the cases that are hanging. I suppose it is possible, albeit unlikely, that something else in our infrastructure is making such requests at the same time - would that trigger this behavior on all subsequent reads?. We shouldn't be making transactions >270s, but we can check on that. If we are, they wouldn't be on the machine in question that's hitting the deadlines. |
One thing that I think is different about our situation vs the linked one is that we don't see DEADLINE_EXCEEDED on reads, it just hangs forever. is that expected? Is there some way we can at least detect this from the read side instead of having our application just hang? |
Just adding to Jacob's comments above, is there a better way to pinpoint what may be hitting a platform limit? Some telemetry of some kind or log? We are able to recreate this every few hours on only one of our 6 kubernetes replicas.... Also - there are some reasons I don't think it's related to the other issues you mentioned Ehsan
any further ideas...? |
i'd like to bump the priority of this.. this is blocking for our team. can we get some help on this? |
closing as it was related to the linked above ticket. |
We are seeing the same issue, how did you resolve @michaelAtCoalesce? |
@ehsannas could you re-open this please? I believe it's an active issue. |
the grpc library 1.10.2 is broken, make sure that your dependencies are using any version other than that |
Strange, my Firestore was exhibiting the same problems with 1.10.1:
|
i dont know when the issue was introduced, but 1.10.3 worked for us |
googleapis/gax-nodejs#1576 is getting updated. Once that's released we'll update the |
Then shouldn't this issue be open until resolved? |
Yes! (also internally tracked b/331664537) |
The gax-nodejs release (googleapis/gax-nodejs#1581) is in progress. |
Is there reason to believe that release will address the issue? |
Hi folks, please update your npm dependencies and make sure you're getting |
Yes, we depend on the grpc library, which indicated issues with the old versions (such as grpc/grpc-node#2690). Firestore SDK does attempt to retry requests where possible, so, that might have manifested as slow request/response. |
@ehsannas is there a way we could verify that with debugging / logging that shows the retries? |
Hey @lox , I believe setting the following environment variables will provide additional grpc debug logs:
|
the dependencies have been updated in the latest release. Thanks @lahirumaramba . |
[REQUIRED] Step 2: Describe your environment
[REQUIRED] Step 3: Describe the problem
Periodically, reads from firebase using the admin SDK will hang for an extended period. We've isolated the hang to reading a single pretty small document out of a particular collection, which happens on every request. The document simply doesn't return for an extended period. Once a process gets in this state, it remains "stuck" for either 55 minutes or an hour, almost exactly. Firestore auth succeeds, but subsequent reads from firebase hang. Then, with no user intervention required, it just starts working properly again.
Steps to reproduce:
We've been unable to identify consistent reproduce steps, but we believe the steps to reproduce are simply doing concurrent auths, database reads, and potentially a small number of database writes (to different docs) for multiple users at the same time.
We were able to reproduce this with the following environment variables set (though we still don't really know what exactly reproduced it):
I notice that in the bad state, all queries end up logging
load_balancing_call | [8682] Pick result: QUEUE subchannel: null status: undefined undefined
whereas in the happy state, we see
Pick result: COMPLETE
, a subchannel ID, and an IP address.I've uploaded two log files: One is from shortly before the bad times started to shortly after it started, and the second one is from shortly before things started to correct to shortly after. I found the timings very suspicious, exactly 1 hour seems too good to be true. There's a lot of logging going on, and pulling it all from GCP + removing our app logs isn't a great experience so I didn't upload all our output for the hour, but these log files should be fairly representative.
Last successful read : 2024-03-13T20:42:24.614Z (first file)
First hanging read : 2024-03-13T20:42:25.901Z (first file)
Change in behavior : 2024-03-13T21:37:12.490Z (second file) - a bunch of what looks like connection cleanup happens here
Things fully resolve :~2024-03-13T21:42:???? (observed application acting healthy by this point, but it definitely could have been 5 minutes earlier as we'd stopped paying careful attention at this point)
bad-times-begin.txt
bad-times-end.txt
Another thing we've noticed, and which I have a hard time believing is more than a coincidence, is that this generally happens at roughly 45 minutes past the hour.
When this occurs, we see a
DEADLINE_EXCEEDED
from any writes we're doing concurrently (creating a doc in a collection).Relevant Code:
This is with some abstractions removed, but the flow of it is:
When it's in the hanging state, this will trigger a
DEADLINE_EXCEEDED
The text was updated successfully, but these errors were encountered: