All firestore queries hang for almost exactly 1 hour #2495

beckjake · 2024-03-14T16:51:13Z

[REQUIRED] Step 2: Describe your environment

Operating System version: Alpine 3.16
Firebase SDK version: 12.0.0
Firebase Product: database, we also use auth
Node.js version: v18.11.0, also observed on v20.11.1
NPM version: 8.19.2

[REQUIRED] Step 3: Describe the problem

Periodically, reads from firebase using the admin SDK will hang for an extended period. We've isolated the hang to reading a single pretty small document out of a particular collection, which happens on every request. The document simply doesn't return for an extended period. Once a process gets in this state, it remains "stuck" for either 55 minutes or an hour, almost exactly. Firestore auth succeeds, but subsequent reads from firebase hang. Then, with no user intervention required, it just starts working properly again.

Steps to reproduce:

We've been unable to identify consistent reproduce steps, but we believe the steps to reproduce are simply doing concurrent auths, database reads, and potentially a small number of database writes (to different docs) for multiple users at the same time.

We were able to reproduce this with the following environment variables set (though we still don't really know what exactly reproduced it):

GRPC_VERBOSITY=debug
GRPC_TRACE=all

I notice that in the bad state, all queries end up logging load_balancing_call | [8682] Pick result: QUEUE subchannel: null status: undefined undefined

whereas in the happy state, we see Pick result: COMPLETE, a subchannel ID, and an IP address.

I've uploaded two log files: One is from shortly before the bad times started to shortly after it started, and the second one is from shortly before things started to correct to shortly after. I found the timings very suspicious, exactly 1 hour seems too good to be true. There's a lot of logging going on, and pulling it all from GCP + removing our app logs isn't a great experience so I didn't upload all our output for the hour, but these log files should be fairly representative.

Last successful read : 2024-03-13T20:42:24.614Z (first file)
First hanging read : 2024-03-13T20:42:25.901Z (first file)
Change in behavior : 2024-03-13T21:37:12.490Z (second file) - a bunch of what looks like connection cleanup happens here
Things fully resolve :~2024-03-13T21:42:???? (observed application acting healthy by this point, but it definitely could have been 5 minutes earlier as we'd stopped paying careful attention at this point)
bad-times-begin.txt
bad-times-end.txt

Another thing we've noticed, and which I have a hard time believing is more than a coincidence, is that this generally happens at roughly 45 minutes past the hour.

When this occurs, we see a DEADLINE_EXCEEDED from any writes we're doing concurrently (creating a doc in a collection).

Relevant Code:

This is with some abstractions removed, but the flow of it is:

// once
const admin = require("firebase-admin");
admin.initializeApp({ credential: admin.credential.applicationDefault() });

// every request
const idToken = await Post(`https://securetoken.googleapis.com/v1/token?key=${apiKey}`, { refresh_token, grant_type: "refresh_token" }).then(response => response.data.id_token);
const decodedToken = await admin.auth().verifyIdToken(idToken);
// hangs here
const userDoc = await admin.firestore().collection("users").doc(decodedToken.uid).get().then(doc => doc.data());

When it's in the hanging state, this will trigger a DEADLINE_EXCEEDED

const doc = admin.firestore().collection("orgs").doc()
// triggers an error with {"code":4,"details":"Deadline exceeded","metadata":{},"note":"Exception occurred in retry method that was not classified as transient"}
const result = await doc.set({/* some data */}).then(() => doc.id);

The text was updated successfully, but these errors were encountered:

ehsannas · 2024-03-14T22:17:52Z

Hey @beckjake,

(1) I presume your code snippets meant to use admin.firestore().collection... rather than admin.collection... ?

(2) What kind of environment are you running in? as far as I remember, the application default credentials are valid for 1 hour. So I think something might be wrong with the token? Have you tried following the instructions here: https://firebase.google.com/docs/admin/setup#initialize-sdk ?
depending on your environment, you may just need to call initializeApp().

beckjake · 2024-03-14T22:38:09Z

Hey @beckjake,

(1) I presume your code snippets meant to use admin.firestore().collection... rather than admin.collection... ?

Sorry, yes, that got lost in my translation as the result of admin.firestore() is a thing we pass between a couple functions. I'll update the original post.

(2) What kind of environment are you running in? as far as I remember, the application default credentials are valid for 1 hour. So I think something might be wrong with the token? Have you tried following the instructions here: https://firebase.google.com/docs/admin/setup#initialize-sdk ? depending on your environment, you may just need to call initializeApp().

We're running in a pod within Google Kubernetes Engine. We call admin.initializeApp({ credential: admin.credential.applicationDefault() }); when our application starts up. We set the GOOGLE_APPLICATION_CREDENTIALS environment variable, which points at a file on disk containing a service account JSON file.

The application works for long periods of time before periodically one pod (or multiple pods) in our cluster gets stuck in this state. Today we've been trying to reproduce the issue and have only been able to reproduce ~1m long drop-outs that have a lot of messages like this:

D 2024-03-14T21:02:40.816Z | transport | [object Object] connection failed with error connect EADDRNOTAVAIL 2607:f8b0:4001:c2b::5f:443 - Local (:::0)
D 2024-03-14T21:02:40.816Z | subchannel | (216) 2607:f8b0:4001:c2b::5f:443 CONNECTING -> TRANSIENT_FAILURE
D 2024-03-14T21:02:40.816Z | subchannel | (216) 2607:f8b0:4001:c2b::5f:443 TRANSIENT_FAILURE -> IDLE
D 2024-03-14T21:02:40.816Z | subchannel | (218) 2607:f8b0:4001:c18::5f:443 TRANSIENT_FAILURE -> CONNECTING

but we also have times where we see those messages and don't experience any service interruptions, so I'm not sure how much to read into TRANSIENT_FAILUREs

beckjake · 2024-03-14T23:14:00Z

We're experiencing a real hang right this moment (Pick result: QUEUE) on one of our pods. One thing I find in the stuck pod's logs but not the others is the state=TRANSPARENT_ONLY handling status with progress PROCESSED from child:

D 2024-03-14T23:08:08.381Z | v1.10.2 1 | resolving_call | [180] cancelWithStatus code: 4 details: "Deadline exceeded"
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | retrying_call | [181] cancelWithStatus code: 4 details: "Deadline exceeded"
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | retrying_call | [181] ended with status: code=4 details="Deadline exceeded"
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | load_balancing_call | [182] cancelWithStatus code: 4 details: "Deadline exceeded"
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | load_balancing_call | [182] ended with status: code=4 details="Deadline exceeded"
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | retrying_call | [181] Received status from child [182]
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | retrying_call | [181] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [182] in state ACTIVE
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | retrying_call | [181] ended with status: code=4 details="Deadline exceeded"
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | resolving_call | [180] ended with status: code=4 details="Deadline exceeded"
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | resolving_call | [180] Received status
D 2024-03-14T23:08:08.381Z | v1.10.2 1 | resolving_call | [180] Received status
D 2024-03-14T23:08:08.382Z | v1.10.2 1 | resolving_call | [180] cancelWithStatus code: 1 details: "Cancelled on client"
D 2024-03-14T23:08:08.382Z | v1.10.2 1 | retrying_call | [181] cancelWithStatus code: 1 details: "Cancelled on client"
D 2024-03-14T23:08:08.384Z | v1.10.2 1 | retrying_call | [181] ended with status: code=1 details="Cancelled on client"
D 2024-03-14T23:08:08.384Z | v1.10.2 1 | load_balancing_call | [182] cancelWithStatus code: 1 details: "Cancelled on client"
D 2024-03-14T23:08:08.384Z | v1.10.2 1 | resolving_call | [180] Received status

The log about D 2024-03-14T23:08:08.381Z | v1.10.2 1 | retrying_call | [181] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [182] in state ACTIVE repeats with different numbers in brackets and begins exactly 5 minutes (30,000 ms which I see is a deadline that gets logged a lot) after we first detected the outage.

This hang took >1hr. We attached a debugger and the process died, so unfortunately we don't know if it would have eventually self-corrected.

ehsannas · 2024-03-14T23:56:25Z

hmmm... I see a lot of "deadline exceeded" errors. What is your usage pattern like? Are you reaching the Maximum Writes per second?

Also please check: this and this.

beckjake · 2024-03-15T00:06:57Z

We're definitely not making API requests >10mb in the cases that are hanging. I suppose it is possible, albeit unlikely, that something else in our infrastructure is making such requests at the same time - would that trigger this behavior on all subsequent reads?.

We shouldn't be making transactions >270s, but we can check on that. If we are, they wouldn't be on the machine in question that's hitting the deadlines.

beckjake · 2024-03-15T00:13:33Z

One thing that I think is different about our situation vs the linked one is that we don't see DEADLINE_EXCEEDED on reads, it just hangs forever. is that expected? Is there some way we can at least detect this from the read side instead of having our application just hang?

michaelAtCoalesce · 2024-03-15T03:59:13Z

Just adding to Jacob's comments above, is there a better way to pinpoint what may be hitting a platform limit? Some telemetry of some kind or log?

We are able to recreate this every few hours on only one of our 6 kubernetes replicas....

Also - there are some reasons I don't think it's related to the other issues you mentioned Ehsan

we should get at least some response instead of a hang of an hour+, i don't think anyone has reported an hour+ hang on reads like we are seeing in any of the threads you linked Ehsan
we would likely see it across all of our replicas as opposed to just one if it were load or limits related.
it isn't reliably recreatable.. if it were a load issue one would expect that we would see it recreate more consistently than just once every couple hours...

any further ideas...?

michaelAtCoalesce · 2024-03-15T22:15:52Z

i'd like to bump the priority of this.. this is blocking for our team. can we get some help on this?

michaelAtCoalesce · 2024-03-19T20:57:33Z

closing as it was related to the linked above ticket.

lox · 2024-03-27T13:25:29Z

We are seeing the same issue, how did you resolve @michaelAtCoalesce?

lox · 2024-03-27T13:49:00Z

@ehsannas could you re-open this please? I believe it's an active issue.

michaelAtCoalesce · 2024-03-27T14:57:04Z

the grpc library 1.10.2 is broken, make sure that your dependencies are using any version other than that

lox · 2024-03-27T20:09:02Z

Strange, my Firestore was exhibiting the same problems with 1.10.1:

@google-cloud/firestore 7.3.0
└─┬ google-gax 4.0.5
  └── @grpc/grpc-js 1.9.13
@google-cloud/kms 4.1.0
└─┬ google-gax 4.3.1
  └── @grpc/grpc-js 1.10.1
@google-cloud/pubsub 4.3.2
└─┬ google-gax 4.3.1
  └── @grpc/grpc-js 1.10.1
@google-cloud/secret-manager 5.1.0
└─┬ google-gax 4.3.1
  └── @grpc/grpc-js 1.10.1
firebase-admin 12.0.0
└─┬ @google-cloud/firestore 7.3.0
  └─┬ google-gax 4.0.5
    └── @grpc/grpc-js 1.9.13
firebase-functions 4.7.0
└─┬ firebase-admin 12.0.0 peer
  └─┬ @google-cloud/firestore 7.3.0
    └─┬ google-gax 4.0.5
      └── @grpc/grpc-js 1.9.13
langchain 0.1.17
└─┬ @langchain/community 0.0.27
  └─┬ firebase-admin 12.0.0 peer
    └─┬ @google-cloud/firestore 7.3.0
      └─┬ google-gax 4.0.5
        └── @grpc/grpc-js 1.9.13

michaelAtCoalesce · 2024-03-27T20:27:12Z

i dont know when the issue was introduced, but 1.10.3 worked for us

ehsannas · 2024-03-27T21:06:11Z

googleapis/gax-nodejs#1576 is getting updated. Once that's released we'll update the nodejs-firestore and firebase-admin-node

lox · 2024-03-27T21:14:29Z

Then shouldn't this issue be open until resolved?

ehsannas · 2024-03-27T21:54:54Z

Yes! (also internally tracked b/331664537)

ehsannas · 2024-05-03T20:46:39Z

The gax-nodejs release (googleapis/gax-nodejs#1581) is in progress.

lox · 2024-05-03T23:13:10Z

The gax-nodejs release (googleapis/gax-nodejs#1581) is in progress.

Is there reason to believe that release will address the issue?

ehsannas · 2024-05-08T23:39:23Z

Hi folks, please update your npm dependencies and make sure you're getting @google-cloud/firestore version 7.7.0, and google-gax version 4.3.3 or above. Please let me know if you still experience any issues.

ehsannas · 2024-05-08T23:41:14Z

Is there reason to believe that release will address the issue?

Yes, we depend on the grpc library, which indicated issues with the old versions (such as grpc/grpc-node#2690). Firestore SDK does attempt to retry requests where possible, so, that might have manifested as slow request/response.

lox · 2024-05-08T23:42:34Z

@ehsannas is there a way we could verify that with debugging / logging that shows the retries?

ehsannas · 2024-05-09T17:26:32Z

Hey @lox , I believe setting the following environment variables will provide additional grpc debug logs:

export GRPC_VERBOSITY=debug
export GRPC_TRACE=all

ehsannas · 2024-05-17T16:38:12Z

the dependencies have been updated in the latest release. Thanks @lahirumaramba .

ehsannas · 2024-05-17T16:39:39Z

@lox @beckjake please let us know if you have any issues after upgrading to the latest release. Thank you.

google-oss-bot added the api: database label Mar 14, 2024

ehsannas self-assigned this Mar 14, 2024

michaelAtCoalesce mentioned this issue Mar 16, 2024

Requests seem not to respond anymore after upgrading to 1.10.2 grpc/grpc-node#2690

Closed

ehsannas closed this as completed Mar 26, 2024

ehsannas reopened this Mar 27, 2024

leahecole mentioned this issue Apr 2, 2024

Sometimes publishes tens of thousands of duplicate copies of messages googleapis/nodejs-pubsub#1903

Closed

ehsannas assigned lahirumaramba May 17, 2024

ehsannas removed their assignment May 17, 2024

ehsannas closed this as completed May 17, 2024

milaGGL mentioned this issue Jul 31, 2024

[Firestore] Error adding document: Error: 4 DEADLINE_EXCEEDED: Deadline exceeded after 269.549s,metadata filters: 0.002s,LB pick: 0.002s,remote_addr=142.251.16.95:443 #2655

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All firestore queries hang for almost exactly 1 hour #2495

All firestore queries hang for almost exactly 1 hour #2495

beckjake commented Mar 14, 2024 •

edited

Loading

ehsannas commented Mar 14, 2024 •

edited

Loading

beckjake commented Mar 14, 2024

beckjake commented Mar 14, 2024 •

edited

Loading

ehsannas commented Mar 14, 2024 •

edited

Loading

beckjake commented Mar 15, 2024

beckjake commented Mar 15, 2024

michaelAtCoalesce commented Mar 15, 2024 •

edited

Loading

michaelAtCoalesce commented Mar 15, 2024 •

edited

Loading

michaelAtCoalesce commented Mar 19, 2024

lox commented Mar 27, 2024

lox commented Mar 27, 2024

michaelAtCoalesce commented Mar 27, 2024

lox commented Mar 27, 2024

michaelAtCoalesce commented Mar 27, 2024

ehsannas commented Mar 27, 2024

lox commented Mar 27, 2024

ehsannas commented Mar 27, 2024 •

edited

Loading

ehsannas commented May 3, 2024

lox commented May 3, 2024

ehsannas commented May 8, 2024

ehsannas commented May 8, 2024

lox commented May 8, 2024

ehsannas commented May 9, 2024

ehsannas commented May 17, 2024

ehsannas commented May 17, 2024

All firestore queries hang for almost exactly 1 hour #2495

All firestore queries hang for almost exactly 1 hour #2495

Comments

beckjake commented Mar 14, 2024 • edited Loading

[REQUIRED] Step 2: Describe your environment

[REQUIRED] Step 3: Describe the problem

Steps to reproduce:

Relevant Code:

ehsannas commented Mar 14, 2024 • edited Loading

beckjake commented Mar 14, 2024

beckjake commented Mar 14, 2024 • edited Loading

ehsannas commented Mar 14, 2024 • edited Loading

beckjake commented Mar 15, 2024

beckjake commented Mar 15, 2024

michaelAtCoalesce commented Mar 15, 2024 • edited Loading

michaelAtCoalesce commented Mar 15, 2024 • edited Loading

michaelAtCoalesce commented Mar 19, 2024

lox commented Mar 27, 2024

lox commented Mar 27, 2024

michaelAtCoalesce commented Mar 27, 2024

lox commented Mar 27, 2024

michaelAtCoalesce commented Mar 27, 2024

ehsannas commented Mar 27, 2024

lox commented Mar 27, 2024

ehsannas commented Mar 27, 2024 • edited Loading

ehsannas commented May 3, 2024

lox commented May 3, 2024

ehsannas commented May 8, 2024

ehsannas commented May 8, 2024

lox commented May 8, 2024

ehsannas commented May 9, 2024

ehsannas commented May 17, 2024

ehsannas commented May 17, 2024

beckjake commented Mar 14, 2024 •

edited

Loading

ehsannas commented Mar 14, 2024 •

edited

Loading

beckjake commented Mar 14, 2024 •

edited

Loading

ehsannas commented Mar 14, 2024 •

edited

Loading

michaelAtCoalesce commented Mar 15, 2024 •

edited

Loading

michaelAtCoalesce commented Mar 15, 2024 •

edited

Loading

ehsannas commented Mar 27, 2024 •

edited

Loading