-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"timeout exceeded when trying to connect" spike after upgrading to 8.2.1 #2262
Comments
We're having the same issue - causing us some headache. Just looking at 8.2.2 to see if that fixed it. |
Looking at the changes in 8.2.2 I'm not sure those are going to resolve my issue since I'm not using https://github.com/brianc/node-postgres/compare/[email protected]@8.2.2 |
@mriedem The fix applies to any long message. |
hmm sorry this is happening - I'm assuming this isn't happening all the time or in a reproducible way, just sometimes at random? Those issues are hard to debug. I'm gonna check my app logs & see if I can see any of this - I usually run the latest on our systems and they're under pretty heavy load all day. |
It's definitely reproducible / persistent to the point that we had to revert the package upgrade in our app because of the regression. I mean, if we had some debug patch or something applied in our staging environment I'm sure we'd hit it to get more data if you have something in mind. I can't say there is a specific query that fails 100% of the time though. I'm not sure if it helps, but there were quite a few of the errors happening on queries to fetch data from tables with |
@jsatsvat did moving to 8.2.2 resolve the issue for you? |
No, it didn't unfortunately.
FYI - We don't have any JSON columns as far as I am aware. |
yeah I mean is there a way this can be reproduced in an isolated script so I could run it locally and analyze what's going on? |
Are y'all using an ORM (knex, typeorm, etc) or using |
We're not using an ORM, just |
okay - and I'm guessing it happens some what "at random" meaning your app has been running for a while & then you get a timeout? Once you get a timeout do you get a bunch or is it just one every once and a while? |
We're using slonik. And yes, at least for us, it happens randomly out of a sudden with a bunch of timeouts at once. |
dang - do you have numbers around how long it takes before it happens? What's the recovery you do? restart the app? |
For now we've drastically increased the pool size (previously set to 100) which temporarily takes some pressure off it; but as a quick recovery measure we do indeed restart the affected service. As it seems to happen randomly, I am not sure as to how helpful numbers will be as we can't really determine a certain interval or similar. |
I would have to dig into this. In our production cluster we have the app running in 20 replicas and each has a pool configured for 150 connections. Our readiness probe is set to hit an API which does a I think I can say when it hits we get a bunch of timeouts which would probably explain why the waiting count per pod (in the graph in the issue description) spikes, because presumably something is blocking in the pool and so other requests are waiting until a timeout occurs. |
K thanks for the info this is helpful. One other question...are you all using SSL for the connection? I'm wondering if there's a weird edge case in the startup / request ssl packet handling. |
Sorry...more questions. I'm trying to repro locally...no luck yet. What cloud are y'all running in? I wonder if it's networking related to k8s or the like. |
Yeah we're setting We're in the IBM cloud using the K8S service there to run the app and the IBM Cloud Databases for Postgresql for the PG backend. |
Alright...I left a script run all night that was connecting like this:
Sorry this is hard to track down....I'll go ahead and add the What do you think? |
Thanks for the detailed feedback and ideas of things to try. We can try (1) when that's released. Doing (2) would also be pretty easy but I'd need to tinker a bit with figuring out our slowest query times to set a proper timeout. Thankfully we do wrap I don't think I'll probably be able to get to this today but we should be able to try some of this early next week. Thanks again. |
Looks like we are experiencing this same issue. We are using a pool and pool.query() for all our db queries, so if you would like additional info we can provide, we can try anything you suggest as well. |
@brianc re your earlier comment on things to try for (3) we have a wrapper on
That was from July 1 when we upgraded to [email protected] and excludes any logs from I'm not sure about (2) though. Sorry if this is a dumb question but I'm not sure how to do this:
I was thinking maybe there would be a property on the |
on my connection file
I created a wrapper for our pool.query function to calculate durations and then updated every pool.query in the rest of the project to the pool.myQuery. Doing this we were able to find that we did have an extremely long-running query that was hitting the DB very frequently. Turns out there was a trigger on this table we had to disable to get the long-running query taken care of. This caused all our connection pool to be consumed very quickly and not be released in time for other queries to use appropriately. Hopefully, this code might help @mriedem find a way to log the query execution times out. |
We have something similar, everything goes through a function that wraps I've posted some code for that today in our |
We're starting to play around with
This is the definition of that table:
In our production DB there are currently 2250262 records in that table and the |
We're now actively trimming this table and running |
Hi, |
What issue are you having? can you describe it a bit more? Did you upgrade from an older version without the issue? I don't know of any changes which may have caused the issue. There have been several reports on this thread, but nothing conclusive yet.
Thanks for the info and it sounds like you cleared up your own issue? Did the issue only show up w/ pg@8 and not pg@7? As an aside: I have a separate pool in our app for long running queries (we have a few that take several seconds) with longer connection timeouts and so on. It can help. |
@briangonzalez Not yet no. We're still on [email protected] since I've been dragging my feet on doing the upgrade. We have instrumented some stuff in our code for logging warnings in slow queries and such to hopefully help us identify / alert on any performance regression when we do upgrade. |
@brianc Do we have some kind of easy timeout for queries. So that the pool is available for the waiting clients if it takes large time for running queries. ? For now my connections keep building more and more, but out of the blue they stop leaving the connections and finally the application hangs out. This is happening on Production for me . |
Have you tried setting statement_timeout or query_timeout? See the Client constructor config which you can pass through the Pool constructor. Those values are passthrough to postgres. |
hi @mriedem, |
@mriedem any news on that? using [email protected] with [email protected] and facing the same issue under load in production. |
@mriedem did you solve that? I use "pg": "8.9.0", same issue here. |
Having same issue on 8.8 |
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as duplicate.
Are y'all able to prove that you aren't being hit by long event loop blocks during connection acquisition? |
Getting the same error with version 8.11.3 and we are sure there is no event loop lag (we measure and would log it). Our service is also running on GCP/GKE and are using the @google-cloud/cloud-sql-connector together with the Pool to connect to a Cloud SQL for PostgreSQL instance. It's only affecting a small fraction of queries. |
Dang! sorry about this! Do you have any code or way to reproduce this at all? I'd love to help fix this for you. |
Issue SummaryI am encountering a Connection Pool ConfigurationHere is the configuration being used: const pool = new Pool({
user: process.env.DATABASE_USER,
password: process.env.DATABASE_PASSWORD,
host: process.env.DATABASE_HOST,
port: process.env.DATABASE_PORT,
database: process.env.DATABASE_NAME,
max: 10, // Maximum number of clients in the pool
idleTimeoutMillis: 30000, // Idle timeout
connectionTimeoutMillis: 20000, // Connection timeout
keepAlive: true,
});
### **Error Details**
Error executing query Error: timeout exceeded when trying to connect
at C:\products\ATMS\scita_atms_backend\node_modules\pg-pool\index.js:45:11
at runNextTicks (node:internal/process/task_queues:60:5)
at listOnTimeout (node:internal/timers:540:9)
at process.processTimers (node:internal/timers:514:7)
at async updatealertEventTypeAllDuration (C:\products\ATMS\scita_atms_backend\db_updater.js:433:22)
node:internal/process/promises:289
triggerUncaughtException(err, true /* fromPromise */);
^
Error: timeout exceeded when trying to connect
at C:\products\ATMS\scita_atms_backend\node_modules\pg-pool\index.js:45:11
at runNextTicks (node:internal/process/task_queues:60:5)
at listOnTimeout (node:internal/timers:540:9)
at process.processTimers (node:internal/timers:514:7)
at async updatealertEventTypeAllDuration (C:\products\ATMS\scita_atms_backend\db_updater.js:433:22)
Node.js v20.10.0 |
Yesterday we upgraded from [email protected] (with [email protected]) to [email protected] (with [email protected]), specifically from our package-lock.json:
We started seeing a spike in
timeout exceeded when trying to connect
errors with this stacktrace:This is a pretty basic express app with a postgres 12 backend running on node 12.
We report metrics on the connection pool max/total/idle/waiting count values and there is an obvious spike in the wait count from the time the 8.2.1 upgrade was deployed (around 9am CT yesterday) and then the drop when we reverted that change (about 6am CT today):
That corresponds with our API request/response/error rates (again, just a simple express app over a pg db):
We're not sure how to debug this. These are the relevant values we're using related to the Pool config:
We have a staging environment where this showed up as well but we didn't have an alert setup for it (we do now). So if there is something we can do to help debug this and provide information back we can probably do that in our staging environment.
The text was updated successfully, but these errors were encountered: