Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add keepalives params to postgres driver config #3860

Merged
merged 1 commit into from
Jan 18, 2021

Conversation

mnacos
Copy link
Contributor

@mnacos mnacos commented Jan 15, 2021

Context

We get many postgres errors/logs about dropped connections and new connections being established. The number of such logs went up recently when we reduced the number of availability checks (!), which exercise the database. This was matched with degraded service performance / rails response times.

image

This makes me think idle database connections are being silently dropped by Azure after 4 mins, which seems to be a default setting on the platform. This would explain the degraded performance, as Rails is unaware the connections have been dropped, attempts to send statements to the database, errors, then has to re-establish the database connection and eventually succeeds. Having frequent availability checks that exercise the database connection means fewer of these connections are dropped / more healthy database connections are available in the pool.

Changes proposed in this pull request

Add explicit client-side keepalives configuration so that Rails db connections to Postgres send keepalive probes every minute. If this dropped db connections theory is correct, the number of dropped postgres connection events and connection re-establishing events will drop close to zero.

UPDATE

@vigneshmsft has deployed this branch to the devops environment and the db connections profile seems to have changed, possibly improved. The number of closed/re-established connections is not zero but there are scheduled background processes working and availability checks are on.

New connections:

image

Closed connections:

image

Active db connections:

image

(red line is min active db connections, blue line is max)

Sidekiq logs:

image

Guidance to review

I've tested this locally with tcpdump and I can see the additional probes.

For configuring these values, I started adding an environment variables, but decided it may not be worth it, as the keepalive settings should be harmless on all environments.

Relevant Postgresql documentation: https://www.postgresql.org/docs/9.3/libpq-connect.html#LIBPQ-CONNECT-OPTIONS

In particular,

keepalives Controls whether client-side TCP keepalives are used. The default value is 1, meaning on, but you can change this to 0, meaning off, if keepalives are not wanted. This parameter is ignored for connections made via a Unix-domain socket.

keepalives_idle Controls the number of seconds of inactivity after which TCP should send a keepalive message to the server. A value of zero uses the system default. This parameter is ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems where TCP_KEEPIDLE or an equivalent socket option is available, and on Windows; on other systems, it has no effect.

keepalives_interval Controls the number of seconds after which a TCP keepalive message that is not acknowledged by the server should be retransmitted. A value of zero uses the system default. This parameter is ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems where TCP_KEEPINTVL or an equivalent socket option is available, and on Windows; on other systems, it has no effect.

keepalives_count Controls the number of TCP keepalives that can be lost before the client's connection to the server is considered dead. A value of zero uses the system default. This parameter is ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems where TCP_KEEPCNT or an equivalent socket option is available; on other systems, it has no effect.

Are the numbers chosen for this PR reasonable?

Link to Trello card

No Trello card, responding to an incident

Things to check

  • This code does not rely on migrations in the same Pull Request
  • If this code includes a migration adding or changing columns, it also backfills existing records for consistency
  • API release notes have been updated if necessary
  • New environment variables have been added to the Azure config

@mnacos mnacos added the Core Shared issue between candidate/vendor/support/API label Jan 15, 2021
@tijmenb tijmenb temporarily deployed to apply-for-te-add-keepal-kqwl13 January 15, 2021 13:32 Inactive
@mnacos mnacos force-pushed the add-keepalives-to-postgres-config branch from 9340abe to ee7a1e0 Compare January 15, 2021 17:14
@mnacos mnacos temporarily deployed to apply-for-te-add-keepal-kqwl13 January 15, 2021 17:15 Inactive
@tijmenb tijmenb temporarily deployed to apply-for-te-add-keepal-2lwwal January 15, 2021 17:18 Inactive
@duncanjbrown
Copy link
Contributor

The graphs certainly look better, esp the one showing connections living longer. I suppose it makes sense that that the number of blue connections is now closer to 10 (ie RAILS_MAX_THREADS) if we're not wasting connections.

I had a look at the system values on QA:

/app # cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
/app # cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
/app # cat /proc/sys/net/ipv4/tcp_keepalive_probes
9

So is this a fair summary of the 4-minute hypothesis?

Azure kills the connection after 240 seconds (4 * 60), so the keepalive at 7200s would never make it. Dead connections hang around in the pool until they're picked up by Rails, which chucks them ("client forcibly closed connection") and reconnects

And in the new world, TCP keepalives are sent every 60s. If one fails we retry 3 times over the course of 30s before declaring the connection dead? This means many fewer dead connections in the pool.

tl;dr I think this change makes sense.

One q: should we change the TCP values here or in the container config? Are there other things (Logit?) that could benefit?

@mnacos
Copy link
Contributor Author

mnacos commented Jan 18, 2021

So is this a fair summary of the 4-minute hypothesis?

Azure kills the connection after 240 seconds (4 * 60), so the keepalive at 7200s would never make it. Dead connections hang around in the pool until they're picked up by Rails, which chucks them ("client forcibly closed connection") and reconnects

And in the new world, TCP keepalives are sent every 60s. If one fails we retry 3 times over the course of 30s before declaring the connection dead? This means many fewer dead connections in the pool.

Precisely

One q: should we change the TCP values here or in the container config? Are there other things (Logit?) that could benefit?

That's a very good point, we could end up losing logs again if we stop frequent availability checks (e.g. dwbutler/logstash-logger#156 for reminder). There are two other kinds of longstanding TCP connections from the app to consider:

  • logstash TCP connection to Logit
  • sidekiq worker connections to Redis (assuming queuing tasks from web threads and clockwork uses new connections, but we should probably verify)

It should be possible to do something similar from the client for Redis: redis/redis-rb@5885967

Setting these values at the container level would be preferable, of course, if possible. There's some interplay between Docker host and container that could get in the way.

@duncanjbrown
Copy link
Contributor

Setting these values at the container level would be preferable, of course, if possible. There's some interplay between Docker host and container that could get in the way.

Good point. Let's configure the clients — this can be the first

@@ -9,6 +9,10 @@ default: &default
host: <%= ENV['DB_HOSTNAME'] %>
port: <%= ENV['DB_PORT'] %>
database: <%= ENV['DB_DATABASE'] %>
keepalives: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is ok bc it made a change in devops, but I'd expect these to be nested under variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, variables is a Rails thing, see rails/rails@97d06e8

Anything defined there will be set for the session via SQL, after the connection is established. Any top-level items, on the other hand, will be part of the connection string (ruby-pg implementation).

@mnacos mnacos merged commit 4eb67bd into master Jan 18, 2021
@mnacos mnacos deleted the add-keepalives-to-postgres-config branch January 18, 2021 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core Shared issue between candidate/vendor/support/API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants