From 2f236f684f4d76a884d307e03fc3e80a0beb2bed Mon Sep 17 00:00:00 2001 From: Joe Rafaniello Date: Tue, 18 Sep 2018 16:04:52 -0400 Subject: [PATCH] Establish a new connection instead of reconnect! https://bugzilla.redhat.com/show_bug.cgi?id=1626005 It was found that reconnect! against a postgresql SSL connection could occur while postgresql was stopping or starting, even before the "database system is ready to accept connections." If this is timed correctly, this connection could fail SSL handshaking, and the client could "give" up trying SSL connections and only ever try non-SSL. The workaround until this commit, was to restart the client process such as the worker or evm server. This was evident in the postgresql log where you'd see a client, 172.168.1.99, initiate a SSL connection: 2018-08-30 11:23:52 EDT:172.168.1.99(56988):5b880c08.200c:root@vmdb_production:[8204]:LOG: connection authorized: user=root database=vmdb_production SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off) Then, sometime later postgresql restarts and this connection timed it "right" and all future connections were attempted with SSL off and fail: 2018-08-30 11:23:52 EDT:172.168.1.99(56996):5b880c08.2218:root@vmdb_production:[8728]:FATAL: no pg_hba.conf entry for host "172.168.1.99", user "root", database "vmdb_production", SSL off Somewhere deep in pg/libpq/openssl, the client code failed to initialize the SSL handshaking code and will continually only attempt non-SSL connections until you restart the process. Until we can dig deep into the pg/libpq/openssl code, this code change forces the re-initialization of the SSL handshaking code in the client by discarding the existing connection and establishing a new one. To test this, I first recreated this solely in an ActiveRecord environment, by restarting postgresql in one ssh session on an appliance and running this in rails console in another: ``` 500.times do begin puts conn.select_value("select count(*) from users;") rescue PG::Error, ActiveRecord::StatementInvalid begin conn.reconnect! rescue retry end end sleep 0.001 end ``` Then, I verified this code change in a full appliance server/worker environment by first recreating this (after many iterations) and then confirming it did not recreate with this change applied: ``` for x in `seq 1 500` do # restart postgresql is too fast, so I needed to first stop it, sleep # a little, then start it, sleep a bit longer, then keep doing that in # a loop, while the workers/servers would try to reconnect!. small_rand=`ruby -e 'puts rand(4)'` rand=`ruby -e 'puts rand(10) + 10'` echo $small_rand echo $rand systemctl stop rh-postgresql95-postgresql sleep $small_rand systemctl start rh-postgresql95-postgresql sleep $rand done ``` --- app/models/miq_queue_worker_base/runner.rb | 5 ++++- app/models/miq_server.rb | 4 +++- app/models/miq_server/environment_management.rb | 4 +++- app/models/miq_server/queue_management.rb | 6 ++++-- 4 files changed, 14 insertions(+), 5 deletions(-) diff --git a/app/models/miq_queue_worker_base/runner.rb b/app/models/miq_queue_worker_base/runner.rb index bcd61062247..1c638c61fb7 100644 --- a/app/models/miq_queue_worker_base/runner.rb +++ b/app/models/miq_queue_worker_base/runner.rb @@ -106,7 +106,10 @@ def deliver_queue_message(msg) if status == MiqQueue::STATUS_TIMEOUT begin _log.info("#{log_prefix} Reconnecting to DB after timeout error during queue deliver") - ActiveRecord::Base.connection.reconnect! + + # Remove the connection and establish a new one since reconnect! doesn't always play nice with SSL postgresql connections + spec_name = ActiveRecord::Base.connection_specification_name + ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection(spec_name)) @worker.update_spid! rescue => err do_exit("Exiting worker due to timeout error that could not be recovered from...error: #{err.class.name}: #{err.message}", 1) diff --git a/app/models/miq_server.rb b/app/models/miq_server.rb index b0caf89e9ba..af2296a6b49 100644 --- a/app/models/miq_server.rb +++ b/app/models/miq_server.rb @@ -349,7 +349,9 @@ def monitor begin _log.info("Reconnecting to database after error...") - ActiveRecord::Base.connection.reconnect! + # Remove the connection and establish a new one since reconnect! doesn't always play nice with SSL postgresql connections + spec_name = ActiveRecord::Base.connection_specification_name + ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection(spec_name)) rescue Exception => err _log.error("#{err.message}, during reconnect!") else diff --git a/app/models/miq_server/environment_management.rb b/app/models/miq_server/environment_management.rb index ed38c76fc5a..1505d098b9c 100644 --- a/app/models/miq_server/environment_management.rb +++ b/app/models/miq_server/environment_management.rb @@ -55,7 +55,9 @@ def get_network_information end def validate_database - ActiveRecord::Base.connection.reconnect! + # Remove the connection and establish a new one since reconnect! doesn't always play nice with SSL postgresql connections + spec_name = ActiveRecord::Base.connection_specification_name + ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection(spec_name)) # Log the Versions _log.info("Database Adapter: [#{ActiveRecord::Base.connection.adapter_name}], version: [#{ActiveRecord::Base.connection.database_version}]") if ActiveRecord::Base.connection.respond_to?(:database_version) diff --git a/app/models/miq_server/queue_management.rb b/app/models/miq_server/queue_management.rb index fb013296978..8d93b438fc2 100644 --- a/app/models/miq_server/queue_management.rb +++ b/app/models/miq_server/queue_management.rb @@ -21,9 +21,11 @@ def process_miq_queue if status == "timeout" begin _log.info("Reconnecting to DB after timeout error during queue deliver") - ActiveRecord::Base.connection.reconnect! + # Remove the connection and establish a new one since reconnect! doesn't always play nice with SSL postgresql connections + spec_name = ActiveRecord::Base.connection_specification_name + ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection(spec_name)) rescue => err - _log.error("Error encountered during error:#{err.class.name}: #{err.message}") + _log.error("Error encountered reconnecting to the database, error:#{err.class.name}: #{err.message}") end end