Skip to content
This repository has been archived by the owner on Sep 24, 2022. It is now read-only.

Fixed error handling when getting a connection. #115

Closed
wants to merge 1 commit into from

Conversation

jchumnanvech
Copy link
Contributor

Right now we are encountering a bug when someone resets the postgres-db service. The connections become stale. When psycopg2 Error is thrown set_exc_info with error throws another exception which causes the connection to not be put back in the connection pool (line 560 is not executed because line 559 blows up). Its in the the tornado concurrent exception traceback logger.

It is trying to pass in the error into
self.formatted_tb = traceback.format_exception(*exc_info) and exc_info is only 1 parameter and it excepts 3
This is in concurrent.py line 113

Changing it to set exception seems to make it work.

@haizaar
Copy link
Collaborator

haizaar commented Aug 13, 2015

Nice catch. It's bug indeed, but of a different kind. The thing is that set_exception does not preserve tracebacks in Python 2 (see relevant docs).

That's why I use set_exc_info where it makes sense. So instead of converting to set_exception, can you try this:

ping_future.set_exc_info(sys.exc_info())

I also strive to cover all bugs with unittests. Can you please write one for this issue? I guess it would be enough to close conn's file descriptor to simulate server restart.

Thanks.

@friedcell
Copy link
Contributor

Explored this for a bit and it seems a bit more complicated.

If Connection.ping dies in Pool._operate if will do a Pool.putconn in all cases but one - conn = fut.result() which can't happen from a _ping_future_connection anyways as connection is passed and fut is a fake future with a set result.

If a file descriptor is closed another error happens - AssertionError: Tried to release non-busy connection as Pool.putconn in Pool._ping_future_connection is called on an Ping._operate error that already does Pool.putconn in its own error handling.

Checking how restarts work I found psycopg#196 but restarting the server still doesn't mark the connection as closed properly (at least on osx using psycopg2 2.6.1). Once that happens connects are never recycled.

Connections to a host also fail in retrying as Connection._io_callback fails, retry loop is not triggered as the exception is added to the future and the query fails - this time the connection is correctly marked as closed and is recycled.

Am I missing something?

@haizaar
Copy link
Collaborator

haizaar commented Aug 26, 2015

OK, lets take it one thing in a time. @jchumnanvech original bug is an issue with incorrect API call. It's trivial and should be fixed. But the bug itself surfaced another issue - unittest do not properly verify connectivity issues handling.

Regarding psycopg#196. It was fixed back then when I complained about it and is still working as of psycopg2 2.6.1 on my Ubunutu 14.04. Probably it's OSX specific issue. @friedcell, I suggest you reopen the ticket on psycopg2 tracker.

Now, @friedcell discovered another bug:
On connectivity errors, putconn is called twice on the same connection and hits assertion error. Need to open another bug on that and write a proper reproduction.

@friedcell, regarding your last paragraph - I fail to understand what you are trying to say (that's probably me :) ). Can you please open another issue for this and we'll continue to analyze it there? (Some code that catches concrete failure scenarios would really help).

@friedcell
Copy link
Contributor

I'll open another issue for reconnection issues - I have tests for that but need to do some more testing on Ubuntu.

I don't yet have tests for double putconn and the set_exc_info issue...

@haizaar
Copy link
Collaborator

haizaar commented Aug 26, 2015

Looks like server restarts can be simulated pretty well with the following:

>>> import psycopg2
>>> conn = psycopg2.connect("user=foo password=bar host=127.0.0.1 port=5432")
>>> import socket
>>> s = socket.fromfd(conn.fileno(), socket.AF_INET, socket.SOCK_STREAM)
>>> s.shutdown(socket.SHUT_WR)
>>> cur = conn.cursor(); cur.execute("SELECT 1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
psycopg2.OperationalError: SSL connection has been closed unexpectedly

>>> conn.closed
2
>>> 

The real server restart causes the following:

>>> cur = conn.cursor(); cur.execute("SELECT 1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
psycopg2.OperationalError: terminating connection due to administrator command
SSL connection has been closed unexpectedly

The next step is to incorporate connection killing into unittests, see where it fails and continue from there.

@haizaar
Copy link
Collaborator

haizaar commented Aug 26, 2015

I don't yet have tests for double putconn and the set_exc_info issue...

I think I can tinker it.

@friedcell
Copy link
Contributor

I got different errors on OSX - my biggest problem was the fact that psycopg2 does not report the connections as closed. Solved double putconn with a simple if conn in self.conns.busy: in _ping_future_connection

@haizaar
Copy link
Collaborator

haizaar commented Aug 26, 2015

@friedcell, I finally see what you are saying - the retrial code runs only when cursor.execute fails synchronously. Unfortunately when server restarts, it fails asynchronously later on in _io_callback and is propagated straight to the caller.

Something to work on...

Thanks for discovering this.

@haizaar
Copy link
Collaborator

haizaar commented Aug 26, 2015

With async code, shutting socket down for writes, does not set conn.closed. But shutting it for reads, does the trick. However DatabaseError except of OperationalError is thrown. So need to catch both of them for retrials:

from __future__ import print_function

import psycopg2
import select
import socket
import traceback


def wait(conn):
    while 1:
        state = conn.poll()
        if state == psycopg2.extensions.POLL_OK:
            break
        elif state == psycopg2.extensions.POLL_WRITE:
            select.select([], [conn.fileno()], [])
        elif state == psycopg2.extensions.POLL_READ:
            select.select([conn.fileno()], [], [])
        else:
            raise psycopg2.OperationalError("poll() returned %s" % state)


aconn = psycopg2.connect(dsn="user=foo password=bar host=127.0.0.1 port=5432", async=1)
wait(aconn)

s = socket.fromfd(aconn.fileno(), socket.AF_INET, socket.SOCK_STREAM)
s.shutdown(socket.SHUT_RD)

try:
    acur = aconn.cursor()
    acur.execute("SELECT 1")
    wait(aconn)
except Exception as err:
    print("Execution failed: %s" % err)
    traceback.print_exc()
    print("aconn.closed = %s" % aconn.closed)
else:
    print(acur.fetchone())

And the result is:

$ python asynclose.py 
Execution failed: SSL SYSCALL error: EOF detected

Traceback (most recent call last):
  File "asynclose.py", line 35, in <module>
    wait(aconn)
  File "asynclose.py", line 12, in wait
    state = conn.poll()
OperationalError: SSL SYSCALL error: EOF detected

aconn.closed = 2

@haizaar
Copy link
Collaborator

haizaar commented Aug 26, 2015

I fixed all of, hopefully, it in my master branch. Guys, can you try this version: https://github.com/haizaar/momoko?
Both ping issue and reconnects should work.

At the end, shutting down sockets, did not work - this somehow augments psycopg2 state in the weird way, such that most of the time it works, but one of every 30 test runs fails with wonky error. Spent a lot of time trying to understand whether it's a problem with my code, but conclusion was that it's not. And in fact after incorporating third-party tcp proxy, it all works smooth.

If you run tests, make sure to run make -C tcproxy first.

@haizaar
Copy link
Collaborator

haizaar commented Aug 26, 2015

Guys, does anyone has any experience with Travis CI? for some reason my proxy tests fail when run on Travis - can't force psycopg2 into connecting through TCP to my proxy.

Here is the failing build - https://travis-ci.org/haizaar/momoko/builds/77378307 (pulls from my master).

@haizaar haizaar closed this in 5e83907 Aug 27, 2015
@haizaar
Copy link
Collaborator

haizaar commented Aug 27, 2015

Travis fixed. I've merged the code to master. Feedback is welcome.

@friedcell
Copy link
Contributor

Tested in all scenarios that were failing for me before - works perfectly now. Sorry for the delay, fell into a zone where normal things don't happen very often.

@haizaar
Copy link
Collaborator

haizaar commented Sep 1, 2015

Excellent. I'll release 2.2.0 later this week.
On Sep 1, 2015 21:38, "Marko Mrdjenovic" [email protected] wrote:

Tested in all scenarios that were failing for me before - works perfectly
now. Sorry for the delay, fell into a zone where normal things don't happen
very often.


Reply to this email directly or view it on GitHub
#115 (comment).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants