Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Establish a new connection instead of reconnect!
https://bugzilla.redhat.com/show_bug.cgi?id=1626005 It was found that reconnect! against a postgresql SSL connection could occur while postgresql was stopping or starting, even before the "database system is ready to accept connections." If this is timed correctly, this connection could fail SSL handshaking, and the client could "give" up trying SSL connections and only ever try non-SSL. The workaround until this commit, was to restart the client process such as the worker or evm server. This was evident in the postgresql log where you'd see a client, 172.168.1.99, initiate a SSL connection: 2018-08-30 11:23:52 EDT:172.168.1.99(56988):5b880c08.200c:root@vmdb_production:[8204]:LOG: connection authorized: user=root database=vmdb_production SSL enabled (protocol=TLSv1.2, cipher=ECDHE-RSA-AES256-GCM-SHA384, compression=off) Then, sometime later postgresql restarts and this connection timed it "right" and all future connections were attempted with SSL off and fail: 2018-08-30 11:23:52 EDT:172.168.1.99(56996):5b880c08.2218:root@vmdb_production:[8728]:FATAL: no pg_hba.conf entry for host "172.168.1.99", user "root", database "vmdb_production", SSL off Somewhere deep in pg/libpq/openssl, the client code failed to initialize the SSL handshaking code and will continually only attempt non-SSL connections until you restart the process. Until we can dig deep into the pg/libpq/openssl code, this code change forces the re-initialization of the SSL handshaking code in the client by discarding the existing connection and establishing a new one. To test this, I first recreated this solely in an ActiveRecord environment, by restarting postgresql in one ssh session on an appliance and running this in rails console in another: ``` 500.times do begin puts conn.select_value("select count(*) from users;") rescue PG::Error, ActiveRecord::StatementInvalid begin conn.reconnect! rescue retry end end sleep 0.001 end ``` Then, I verified this code change in a full appliance server/worker environment by first recreating this (after many iterations) and then confirming it did not recreate with this change applied: ``` for x in `seq 1 500` do # restart postgresql is too fast, so I needed to first stop it, sleep # a little, then start it, sleep a bit longer, then keep doing that in # a loop, while the workers/servers would try to reconnect!. small_rand=`ruby -e 'puts rand(4)'` rand=`ruby -e 'puts rand(10) + 10'` echo $small_rand echo $rand systemctl stop rh-postgresql95-postgresql sleep $small_rand systemctl start rh-postgresql95-postgresql sleep $rand done ```
- Loading branch information