-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Establish a new connection instead of reconnect! #18010
Establish a new connection instead of reconnect! #18010
Conversation
Note, it looks like we do this in the messaging gem too:
|
ActiveRecord::Base.connection.reconnect! | ||
|
||
# Remove the connection and establish a new one since reconnect! doesn't always play nice with SSL postgresql connections | ||
ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection("primary")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to explicitly specify the primary connection name or will this work without a parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it needs to disconnect the pool and drop the key for the specification name from the owner_to_pool
hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't think that's the remove_connection
we call.
irb(main):001:0> ActiveRecord::Base.method(:remove_connection).source_location
=> ["/home/ncarboni/.gem/ruby/2.4.3/gems/activerecord-5.0.7/lib/active_record/connection_handling.rb", 136]
Or is @connection_specification_name
not "primary
" here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that should work too. Good point.
Note, I chose the hardcoded "primary" because these places were calling explicitly reconnect!
on AR::Base
and not a model/subclass. I guess calling connection_specification_name
on AR::Base
is nicer than the hardcoded "primary". I'd rather not rely on connection_handler.retrieve_connection_pool(name)
returning "primary" when the name is nil though.
What do you think?
spec_name = ActiveRecord::Base.connection_specification_name
ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection(spec_name))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that looks good.
It does seem a bit strange that they choose to go straight to the instance variable in remove_connection
rather than using the logic in the connection_specification_name
method. I'm sure there are Reasons for the way it's done, but your version looks more like the behavior we want here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version you wrote up in the comment that is.... Just realized that was a bit ambiguous as both versions are "yours"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
Note, I was going to one line to avoid adding more lines but 3 AR::Base on one line is one too many.
Great work reproducing this @jrafanie! |
https://bugzilla.redhat.com/show_bug.cgi?id=1626005 It was found that reconnect! against a postgresql SSL connection could occur while postgresql was stopping or starting, even before the "database system is ready to accept connections." If this is timed correctly, this connection could fail SSL handshaking, and the client could "give" up trying SSL connections and only ever try non-SSL. The workaround until this commit, was to restart the client process such as the worker or evm server. This was evident in the postgresql log where you'd see a client, 172.168.1.99, initiate a SSL connection: 2018-08-30 11:23:52 EDT:172.168.1.99(56988):5b880c08.200c:root@vmdb_production:[8204]:LOG: connection authorized: user=root database=vmdb_production SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off) Then, sometime later postgresql restarts and this connection timed it "right" and all future connections were attempted with SSL off and fail: 2018-08-30 11:23:52 EDT:172.168.1.99(56996):5b880c08.2218:root@vmdb_production:[8728]:FATAL: no pg_hba.conf entry for host "172.168.1.99", user "root", database "vmdb_production", SSL off Somewhere deep in pg/libpq/openssl, the client code failed to initialize the SSL handshaking code and will continually only attempt non-SSL connections until you restart the process. Until we can dig deep into the pg/libpq/openssl code, this code change forces the re-initialization of the SSL handshaking code in the client by discarding the existing connection and establishing a new one. To test this, I first recreated this solely in an ActiveRecord environment, by restarting postgresql in one ssh session on an appliance and running this in rails console in another: ``` 500.times do begin puts conn.select_value("select count(*) from users;") rescue PG::Error, ActiveRecord::StatementInvalid begin conn.reconnect! rescue retry end end sleep 0.001 end ``` Then, I verified this code change in a full appliance server/worker environment by first recreating this (after many iterations) and then confirming it did not recreate with this change applied: ``` for x in `seq 1 500` do # restart postgresql is too fast, so I needed to first stop it, sleep # a little, then start it, sleep a bit longer, then keep doing that in # a loop, while the workers/servers would try to reconnect!. small_rand=`ruby -e 'puts rand(4)'` rand=`ruby -e 'puts rand(10) + 10'` echo $small_rand echo $rand systemctl stop rh-postgresql95-postgresql sleep $small_rand systemctl start rh-postgresql95-postgresql sleep $rand done ```
For reference, this is an example where postgresql accepts a connection before it's ready to accept them... leading to this problem:
|
ca3d193
to
2f236f6
Compare
ok, updated with @carbonin's suggestions, thanks! |
Checked commit jrafanie@2f236f6 with ruby 2.3.3, rubocop 0.52.1, haml-lint 0.20.0, and yamllint 1.10.0 |
Note, using the above LOVELY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good billy rae
https://bugzilla.redhat.com/show_bug.cgi?id=1626005
It was found that
reconnect!
against a postgresql SSL connection couldoccur while postgresql was stopping or starting, even before the
"database system is ready to accept connections." If this is timed
correctly, this connection could fail SSL handshaking, and the
client could "give" up trying SSL connections and only ever try non-SSL.
The workaround until this commit, was to restart the client process such
as the worker or evm server.
This was evident in the postgresql log where you'd see a client,
172.168.1.99
, initiate a SSL connection:Then, sometime later postgresql restarts and this connection timed it
"right" and all future connections were attempted with SSL off and fail:
Somewhere deep in pg/libpq/openssl, the client code failed to initialize
the SSL handshaking code and will continually only attempt non-SSL
connections until you restart the process.
Until we can dig deep into the pg/libpq/openssl code, this code change
forces the re-initialization of the SSL handshaking code in the client
by discarding the existing connection and establishing a new one.
To test this, I first recreated this solely in an ActiveRecord
environment, by restarting postgresql in one ssh session on an appliance
and running this in rails console in another:
Then, I verified this code change in a full appliance server/worker
environment by first recreating this (after many iterations) and then
confirming it did not recreate with this change applied: