Establish a new connection instead of reconnect!

https://bugzilla.redhat.com/show_bug.cgi?id=1626005 It was found that reconnect! against a postgresql SSL connection could occur while postgresql was stopping or starting, even before the "database system is ready to accept connections." If this is timed correctly, this connection could fail SSL handshaking, and the client could "give" up trying SSL connections and only ever try non-SSL. The workaround until this commit, was to restart the client process such as the worker or evm server. This was evident in the postgresql log where you'd see a client, 172.168.1.99, initiate a SSL connection: 2018-08-30 11:23:52 EDT:172.168.1.99(56988):5b880c08.200c:root@vmdb_production:[8204]:LOG: connection authorized: user=root database=vmdb_production SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off) Then, sometime later postgresql restarts and this connection timed it "right" and all future connections were attempted with SSL off and fail: 2018-08-30 11:23:52 EDT:172.168.1.99(56996):5b880c08.2218:root@vmdb_production:[8728]:FATAL: no pg_hba.conf entry for host "172.168.1.99", user "root", database "vmdb_production", SSL off Somewhere deep in pg/libpq/openssl, the client code failed to initialize the SSL handshaking code and will continually only attempt non-SSL connections until you restart the process. Until we can dig deep into the pg/libpq/openssl code, this code change forces the re-initialization of the SSL handshaking code in the client by discarding the existing connection and establishing a new one. To test this, I first recreated this solely in an ActiveRecord environment, by restarting postgresql in one ssh session on an appliance and running this in rails console in another: ``` 500.times do begin puts conn.select_value("select count(*) from users;") rescue PG::Error, ActiveRecord::StatementInvalid begin conn.reconnect! rescue retry end end sleep 0.001 end ``` Then, I verified this code change in a full appliance server/worker environment by first recreating this (after many iterations) and then confirming it did not recreate with this change applied: ``` for x in `seq 1 500` do # restart postgresql is too fast, so I needed to first stop it, sleep # a little, then start it, sleep a bit longer, then keep doing that in # a loop, while the workers/servers would try to reconnect!. small_rand=`ruby -e 'puts rand(4)'` rand=`ruby -e 'puts rand(10) + 10'` echo $small_rand echo $rand systemctl stop rh-postgresql95-postgresql sleep $small_rand systemctl start rh-postgresql95-postgresql sleep $rand done ```
jrafanie · Sep 21, 2018 · 2f236f6 · 2f236f6
1 parent 156cd6b
commit 2f236f6
Show file tree

Hide file tree

Showing 4 changed files with 14 additions and 5 deletions.
diff --git a/app/models/miq_queue_worker_base/runner.rb b/app/models/miq_queue_worker_base/runner.rb
@@ -106,7 +106,10 @@ def deliver_queue_message(msg)
       if status == MiqQueue::STATUS_TIMEOUT
         begin
           _log.info("#{log_prefix} Reconnecting to DB after timeout error during queue deliver")
-          ActiveRecord::Base.connection.reconnect!
+
+          # Remove the connection and establish a new one since reconnect! doesn't always play nice with SSL postgresql connections
+          spec_name = ActiveRecord::Base.connection_specification_name
+          ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection(spec_name))
           @worker.update_spid!
         rescue => err
           do_exit("Exiting worker due to timeout error that could not be recovered from...error: #{err.class.name}: #{err.message}", 1)

diff --git a/app/models/miq_server.rb b/app/models/miq_server.rb
@@ -349,7 +349,9 @@ def monitor
 
     begin
       _log.info("Reconnecting to database after error...")
-      ActiveRecord::Base.connection.reconnect!
+      # Remove the connection and establish a new one since reconnect! doesn't always play nice with SSL postgresql connections
+      spec_name = ActiveRecord::Base.connection_specification_name
+      ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection(spec_name))
     rescue Exception => err
       _log.error("#{err.message}, during reconnect!")
     else

diff --git a/app/models/miq_server/environment_management.rb b/app/models/miq_server/environment_management.rb
@@ -55,7 +55,9 @@ def get_network_information
     end
 
     def validate_database
-      ActiveRecord::Base.connection.reconnect!
+      # Remove the connection and establish a new one since reconnect! doesn't always play nice with SSL postgresql connections
+      spec_name = ActiveRecord::Base.connection_specification_name
+      ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection(spec_name))
 
       # Log the Versions
       _log.info("Database Adapter: [#{ActiveRecord::Base.connection.adapter_name}], version: [#{ActiveRecord::Base.connection.database_version}]")                   if ActiveRecord::Base.connection.respond_to?(:database_version)

diff --git a/app/models/miq_server/queue_management.rb b/app/models/miq_server/queue_management.rb
@@ -21,9 +21,11 @@ def process_miq_queue
       if status == "timeout"
         begin
           _log.info("Reconnecting to DB after timeout error during queue deliver")
-          ActiveRecord::Base.connection.reconnect!
+          # Remove the connection and establish a new one since reconnect! doesn't always play nice with SSL postgresql connections
+          spec_name = ActiveRecord::Base.connection_specification_name
+          ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection(spec_name))
         rescue => err
-          _log.error("Error encountered during <ActiveRecord::Base.connection.reconnect!> error:#{err.class.name}: #{err.message}")
+          _log.error("Error encountered reconnecting to the database, error:#{err.class.name}: #{err.message}")
         end
       end