Skip to content

Commit

Permalink
Establish a new connection instead of reconnect!
Browse files Browse the repository at this point in the history
https://bugzilla.redhat.com/show_bug.cgi?id=1626005

It was found that reconnect! against a postgresql SSL connection could
occur while postgresql was stopping or starting, even before the
"database system is ready to accept connections."  If this is timed
correctly, this connection could fail SSL handshaking, and the
client could "give" up trying SSL connections and only ever try non-SSL.

The workaround until this commit, was to restart the client process such
as the worker or evm server.

This was evident in the postgresql log where you'd see a client,
172.168.1.99, initiate a SSL connection:

2018-08-30 11:23:52
EDT:172.168.1.99(56988):5b880c08.200c:root@vmdb_production:[8204]:LOG:
connection authorized: user=root database=vmdb_production SSL enabled
(protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off)

Then, sometime later postgresql restarts and this connection timed it
"right" and all future connections were attempted with SSL off and fail:

2018-08-30 11:23:52
EDT:172.168.1.99(56996):5b880c08.2218:root@vmdb_production:[8728]:FATAL:
no pg_hba.conf entry for host "172.168.1.99", user "root", database
"vmdb_production", SSL off

Somewhere deep in pg/libpq/openssl, the client code failed to initialize
the SSL handshaking code and will continually only attempt non-SSL
connections until you restart the process.

Until we can dig deep into the pg/libpq/openssl code, this code change
forces the re-initialization of the SSL handshaking code in the client
by discarding the existing connection and establishing a new one.

To test this, I first recreated this solely in an ActiveRecord
environment, by restarting postgresql in one ssh session on an appliance
and running this in rails console in another:

```
500.times do
  begin
    puts conn.select_value("select count(*) from users;")
  rescue PG::Error, ActiveRecord::StatementInvalid
    begin
      conn.reconnect!
    rescue
      retry
    end
  end
  sleep 0.001
end
```

Then, I verified this code change in a full appliance server/worker
environment by first recreating this (after many iterations) and then
confirming it did not recreate with this change applied:

```
for x in `seq 1 500`
do
  # restart postgresql is too fast, so I needed to first stop it, sleep
  # a little, then start it, sleep a bit longer, then keep doing that in
  # a loop, while the workers/servers would try to reconnect!.
  small_rand=`ruby -e 'puts rand(4)'`
  rand=`ruby -e 'puts rand(10) + 10'`
  echo $small_rand
  echo $rand
  systemctl stop rh-postgresql95-postgresql
  sleep $small_rand
  systemctl start rh-postgresql95-postgresql
  sleep $rand
done
```
  • Loading branch information
jrafanie committed Sep 21, 2018
1 parent 156cd6b commit 2f236f6
Show file tree
Hide file tree
Showing 4 changed files with 14 additions and 5 deletions.
5 changes: 4 additions & 1 deletion app/models/miq_queue_worker_base/runner.rb
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,10 @@ def deliver_queue_message(msg)
if status == MiqQueue::STATUS_TIMEOUT
begin
_log.info("#{log_prefix} Reconnecting to DB after timeout error during queue deliver")
ActiveRecord::Base.connection.reconnect!

# Remove the connection and establish a new one since reconnect! doesn't always play nice with SSL postgresql connections
spec_name = ActiveRecord::Base.connection_specification_name
ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection(spec_name))
@worker.update_spid!
rescue => err
do_exit("Exiting worker due to timeout error that could not be recovered from...error: #{err.class.name}: #{err.message}", 1)
Expand Down
4 changes: 3 additions & 1 deletion app/models/miq_server.rb
Original file line number Diff line number Diff line change
Expand Up @@ -349,7 +349,9 @@ def monitor

begin
_log.info("Reconnecting to database after error...")
ActiveRecord::Base.connection.reconnect!
# Remove the connection and establish a new one since reconnect! doesn't always play nice with SSL postgresql connections
spec_name = ActiveRecord::Base.connection_specification_name
ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection(spec_name))
rescue Exception => err
_log.error("#{err.message}, during reconnect!")
else
Expand Down
4 changes: 3 additions & 1 deletion app/models/miq_server/environment_management.rb
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,9 @@ def get_network_information
end

def validate_database
ActiveRecord::Base.connection.reconnect!
# Remove the connection and establish a new one since reconnect! doesn't always play nice with SSL postgresql connections
spec_name = ActiveRecord::Base.connection_specification_name
ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection(spec_name))

# Log the Versions
_log.info("Database Adapter: [#{ActiveRecord::Base.connection.adapter_name}], version: [#{ActiveRecord::Base.connection.database_version}]") if ActiveRecord::Base.connection.respond_to?(:database_version)
Expand Down
6 changes: 4 additions & 2 deletions app/models/miq_server/queue_management.rb
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,11 @@ def process_miq_queue
if status == "timeout"
begin
_log.info("Reconnecting to DB after timeout error during queue deliver")
ActiveRecord::Base.connection.reconnect!
# Remove the connection and establish a new one since reconnect! doesn't always play nice with SSL postgresql connections
spec_name = ActiveRecord::Base.connection_specification_name
ActiveRecord::Base.establish_connection(ActiveRecord::Base.remove_connection(spec_name))
rescue => err
_log.error("Error encountered during <ActiveRecord::Base.connection.reconnect!> error:#{err.class.name}: #{err.message}")
_log.error("Error encountered reconnecting to the database, error:#{err.class.name}: #{err.message}")
end
end

Expand Down

0 comments on commit 2f236f6

Please sign in to comment.