-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add keepalives params to postgres driver config #3860
Conversation
9340abe
to
ee7a1e0
Compare
The graphs certainly look better, esp the one showing connections living longer. I suppose it makes sense that that the number of blue connections is now closer to 10 (ie RAILS_MAX_THREADS) if we're not wasting connections. I had a look at the system values on QA:
So is this a fair summary of the 4-minute hypothesis? Azure kills the connection after 240 seconds (4 * 60), so the keepalive at 7200s would never make it. Dead connections hang around in the pool until they're picked up by Rails, which chucks them ("client forcibly closed connection") and reconnects And in the new world, TCP keepalives are sent every 60s. If one fails we retry 3 times over the course of 30s before declaring the connection dead? This means many fewer dead connections in the pool. tl;dr I think this change makes sense. One q: should we change the TCP values here or in the container config? Are there other things (Logit?) that could benefit? |
Precisely
That's a very good point, we could end up losing logs again if we stop frequent availability checks (e.g. dwbutler/logstash-logger#156 for reminder). There are two other kinds of longstanding TCP connections from the app to consider:
It should be possible to do something similar from the client for Redis: redis/redis-rb@5885967 Setting these values at the container level would be preferable, of course, if possible. There's some interplay between Docker host and container that could get in the way. |
Good point. Let's configure the clients — this can be the first |
@@ -9,6 +9,10 @@ default: &default | |||
host: <%= ENV['DB_HOSTNAME'] %> | |||
port: <%= ENV['DB_PORT'] %> | |||
database: <%= ENV['DB_DATABASE'] %> | |||
keepalives: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is ok bc it made a change in devops, but I'd expect these to be nested under variables
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pg
gem will append all options as part of the query string, see https://github.com/ged/ruby-pg/blob/master/lib/pg/connection.rb#L79 and https://github.com/ged/ruby-pg/blob/fb465855ce1dd12cf7eb69c92af222052aa956ce/spec/pg/connection_spec.rb#L160
Not sure where variables
are being referenced!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, variables
is a Rails thing, see rails/rails@97d06e8
Anything defined there will be set for the session via SQL, after the connection is established. Any top-level items, on the other hand, will be part of the connection string (ruby-pg
implementation).
Context
We get many postgres errors/logs about dropped connections and new connections being established. The number of such logs went up recently when we reduced the number of availability checks (!), which exercise the database. This was matched with degraded service performance / rails response times.
This makes me think idle database connections are being silently dropped by Azure after 4 mins, which seems to be a default setting on the platform. This would explain the degraded performance, as Rails is unaware the connections have been dropped, attempts to send statements to the database, errors, then has to re-establish the database connection and eventually succeeds. Having frequent availability checks that exercise the database connection means fewer of these connections are dropped / more healthy database connections are available in the pool.
Changes proposed in this pull request
Add explicit client-side
keepalives
configuration so that Rails db connections to Postgres send keepalive probes every minute. If this dropped db connections theory is correct, the number of dropped postgres connection events and connection re-establishing events will drop close to zero.UPDATE
@vigneshmsft has deployed this branch to the devops environment and the db connections profile seems to have changed, possibly improved. The number of closed/re-established connections is not zero but there are scheduled background processes working and availability checks are on.
New connections:
Closed connections:
Active db connections:
(red line is min active db connections, blue line is max)
Sidekiq logs:
Guidance to review
I've tested this locally with
tcpdump
and I can see the additional probes.For configuring these values, I started adding an environment variables, but decided it may not be worth it, as the keepalive settings should be harmless on all environments.
Relevant Postgresql documentation: https://www.postgresql.org/docs/9.3/libpq-connect.html#LIBPQ-CONNECT-OPTIONS
In particular,
keepalives
Controls whether client-side TCP keepalives are used. The default value is 1, meaning on, but you can change this to 0, meaning off, if keepalives are not wanted. This parameter is ignored for connections made via a Unix-domain socket.keepalives_idle
Controls the number of seconds of inactivity after which TCP should send a keepalive message to the server. A value of zero uses the system default. This parameter is ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems where TCP_KEEPIDLE or an equivalent socket option is available, and on Windows; on other systems, it has no effect.keepalives_interval
Controls the number of seconds after which a TCP keepalive message that is not acknowledged by the server should be retransmitted. A value of zero uses the system default. This parameter is ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems where TCP_KEEPINTVL or an equivalent socket option is available, and on Windows; on other systems, it has no effect.keepalives_count
Controls the number of TCP keepalives that can be lost before the client's connection to the server is considered dead. A value of zero uses the system default. This parameter is ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems where TCP_KEEPCNT or an equivalent socket option is available; on other systems, it has no effect.Are the numbers chosen for this PR reasonable?
Link to Trello card
No Trello card, responding to an incident
Things to check