Add keepalives params to postgres driver config #3860

mnacos · 2021-01-15T13:28:49Z

Context

We get many postgres errors/logs about dropped connections and new connections being established. The number of such logs went up recently when we reduced the number of availability checks (!), which exercise the database. This was matched with degraded service performance / rails response times.

This makes me think idle database connections are being silently dropped by Azure after 4 mins, which seems to be a default setting on the platform. This would explain the degraded performance, as Rails is unaware the connections have been dropped, attempts to send statements to the database, errors, then has to re-establish the database connection and eventually succeeds. Having frequent availability checks that exercise the database connection means fewer of these connections are dropped / more healthy database connections are available in the pool.

Changes proposed in this pull request

Add explicit client-side keepalives configuration so that Rails db connections to Postgres send keepalive probes every minute. If this dropped db connections theory is correct, the number of dropped postgres connection events and connection re-establishing events will drop close to zero.

UPDATE

@vigneshmsft has deployed this branch to the devops environment and the db connections profile seems to have changed, possibly improved. The number of closed/re-established connections is not zero but there are scheduled background processes working and availability checks are on.

New connections:

Closed connections:

Active db connections:

(red line is min active db connections, blue line is max)

Sidekiq logs:

Guidance to review

I've tested this locally with tcpdump and I can see the additional probes.

For configuring these values, I started adding an environment variables, but decided it may not be worth it, as the keepalive settings should be harmless on all environments.

Relevant Postgresql documentation: https://www.postgresql.org/docs/9.3/libpq-connect.html#LIBPQ-CONNECT-OPTIONS

In particular,

keepalives Controls whether client-side TCP keepalives are used. The default value is 1, meaning on, but you can change this to 0, meaning off, if keepalives are not wanted. This parameter is ignored for connections made via a Unix-domain socket.

keepalives_idle Controls the number of seconds of inactivity after which TCP should send a keepalive message to the server. A value of zero uses the system default. This parameter is ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems where TCP_KEEPIDLE or an equivalent socket option is available, and on Windows; on other systems, it has no effect.

keepalives_interval Controls the number of seconds after which a TCP keepalive message that is not acknowledged by the server should be retransmitted. A value of zero uses the system default. This parameter is ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems where TCP_KEEPINTVL or an equivalent socket option is available, and on Windows; on other systems, it has no effect.

keepalives_count Controls the number of TCP keepalives that can be lost before the client's connection to the server is considered dead. A value of zero uses the system default. This parameter is ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems where TCP_KEEPCNT or an equivalent socket option is available; on other systems, it has no effect.

Are the numbers chosen for this PR reasonable?

Link to Trello card

No Trello card, responding to an incident

Things to check

This code does not rely on migrations in the same Pull Request
If this code includes a migration adding or changing columns, it also backfills existing records for consistency
API release notes have been updated if necessary
New environment variables have been added to the Azure config

duncanjbrown · 2021-01-18T09:34:27Z

The graphs certainly look better, esp the one showing connections living longer. I suppose it makes sense that that the number of blue connections is now closer to 10 (ie RAILS_MAX_THREADS) if we're not wasting connections.

I had a look at the system values on QA:

/app # cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
/app # cat /proc/sys/net/ipv4/tcp_keepalive_intvl
75
/app # cat /proc/sys/net/ipv4/tcp_keepalive_probes
9

So is this a fair summary of the 4-minute hypothesis?

Azure kills the connection after 240 seconds (4 * 60), so the keepalive at 7200s would never make it. Dead connections hang around in the pool until they're picked up by Rails, which chucks them ("client forcibly closed connection") and reconnects

And in the new world, TCP keepalives are sent every 60s. If one fails we retry 3 times over the course of 30s before declaring the connection dead? This means many fewer dead connections in the pool.

tl;dr I think this change makes sense.

One q: should we change the TCP values here or in the container config? Are there other things (Logit?) that could benefit?

mnacos · 2021-01-18T10:16:30Z

So is this a fair summary of the 4-minute hypothesis?

Azure kills the connection after 240 seconds (4 * 60), so the keepalive at 7200s would never make it. Dead connections hang around in the pool until they're picked up by Rails, which chucks them ("client forcibly closed connection") and reconnects

And in the new world, TCP keepalives are sent every 60s. If one fails we retry 3 times over the course of 30s before declaring the connection dead? This means many fewer dead connections in the pool.

Precisely

One q: should we change the TCP values here or in the container config? Are there other things (Logit?) that could benefit?

That's a very good point, we could end up losing logs again if we stop frequent availability checks (e.g. dwbutler/logstash-logger#156 for reminder). There are two other kinds of longstanding TCP connections from the app to consider:

logstash TCP connection to Logit
sidekiq worker connections to Redis (assuming queuing tasks from web threads and clockwork uses new connections, but we should probably verify)

It should be possible to do something similar from the client for Redis: redis/redis-rb@5885967

Setting these values at the container level would be preferable, of course, if possible. There's some interplay between Docker host and container that could get in the way.

duncanjbrown · 2021-01-18T10:21:42Z

Setting these values at the container level would be preferable, of course, if possible. There's some interplay between Docker host and container that could get in the way.

Good point. Let's configure the clients — this can be the first

duncanjbrown · 2021-01-18T10:38:39Z

config/database.yml

@@ -9,6 +9,10 @@ default: &default
  host: <%= ENV['DB_HOSTNAME'] %>
  port: <%= ENV['DB_PORT'] %>
  database: <%= ENV['DB_DATABASE'] %>
+  keepalives: 1


I assume this is ok bc it made a change in devops, but I'd expect these to be nested under variables?

The pg gem will append all options as part of the query string, see https://github.com/ged/ruby-pg/blob/master/lib/pg/connection.rb#L79 and https://github.com/ged/ruby-pg/blob/fb465855ce1dd12cf7eb69c92af222052aa956ce/spec/pg/connection_spec.rb#L160

Not sure where variables are being referenced!

So, variables is a Rails thing, see rails/rails@97d06e8

Anything defined there will be set for the session via SQL, after the connection is established. Any top-level items, on the other hand, will be part of the connection string (ruby-pg implementation).

mnacos added the Core Shared issue between candidate/vendor/support/API label Jan 15, 2021

mnacos requested review from vigneshmsft and duncanjbrown January 15, 2021 13:30

tijmenb temporarily deployed to apply-for-te-add-keepal-kqwl13 January 15, 2021 13:32 Inactive

Add keepalives params to postgres driver config

ee7a1e0

mnacos force-pushed the add-keepalives-to-postgres-config branch from 9340abe to ee7a1e0 Compare January 15, 2021 17:14

mnacos temporarily deployed to apply-for-te-add-keepal-kqwl13 January 15, 2021 17:15 Inactive

tijmenb temporarily deployed to apply-for-te-add-keepal-2lwwal January 15, 2021 17:18 Inactive

duncanjbrown reviewed Jan 18, 2021

View reviewed changes

duncanjbrown approved these changes Jan 18, 2021

View reviewed changes

mnacos merged commit 4eb67bd into master Jan 18, 2021

mnacos deleted the add-keepalives-to-postgres-config branch January 18, 2021 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add keepalives params to postgres driver config #3860

Add keepalives params to postgres driver config #3860

mnacos commented Jan 15, 2021 •

edited

Loading

duncanjbrown commented Jan 18, 2021

mnacos commented Jan 18, 2021

duncanjbrown commented Jan 18, 2021

duncanjbrown Jan 18, 2021

mnacos Jan 18, 2021

mnacos Jan 18, 2021

Add keepalives params to postgres driver config #3860

Add keepalives params to postgres driver config #3860

Conversation

mnacos commented Jan 15, 2021 • edited Loading

Context

Changes proposed in this pull request

Guidance to review

Link to Trello card

Things to check

duncanjbrown commented Jan 18, 2021

mnacos commented Jan 18, 2021

duncanjbrown commented Jan 18, 2021

duncanjbrown Jan 18, 2021

Choose a reason for hiding this comment

mnacos Jan 18, 2021

Choose a reason for hiding this comment

mnacos Jan 18, 2021

Choose a reason for hiding this comment

mnacos commented Jan 15, 2021 •

edited

Loading