Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PQsetClientEncoding() called without releasing the GVL during initialization #245

Closed
ged opened this issue Aug 4, 2016 · 11 comments
Closed

Comments

@ged
Copy link
Owner

ged commented Aug 4, 2016

Original report by danielnc (Bitbucket: danielnc, GitHub: danielnc).


I think I've spotted a bug with 0.18.4

We are scaling out our database and using several sidekiq workers.

We are currently using postgresql 9.4 with pgbouncer to allow more light-weight connections

Rails 3.2 + Latest version of Sidekiq + pg 0.18.4

Whenever we exhaust pgbouncer real connections Sidekiq just hangs and wont even respond to TTIN commands. I have to force kill sidekiq process with -9

​This happens on our production environment whenever we reach our 250 real connections and hit our database really heavy​

An easy way to reproduce locally is to create 1 sidekiq process with 25 threads and 1 real(server connection) pgbouncer connection(50000 client connections allowed)

​Here is my local gdb output​(osx) our production environments(centos) behave the same way

https://gist.github.com/danielnc/81e27b8b4012a41bc795e01a5b36db99

​Here is my pgbouncer.ini config to make it easy to reproduce

https://gist.github.com/danielnc/3ce76c8a8957e8a52f65c9224839e670​

I've created a minimal rails 4.2 app with sidekiq and I'm able to reproduce the same behaviour so this is doesn't seems to be a rails specific version issue

Thread 10 (Thread 0x2103 of process 40990):
0  0x00007fff94b892a2 in poll () from /usr/lib/system/libsystem_kernel.dylib
1  0x00000001035c892d in pqSocketCheck () from /usr/local/opt/postgresql/lib/libpq.5.dylib
2  0x00000001035c87f9 in pqWaitTimed () from /usr/local/opt/postgresql/lib/libpq.5.dylib
3  0x00000001035c5fd8 in PQgetResult () from /usr/local/opt/postgresql/lib/libpq.5.dylib
4  0x00000001035c62be in PQexecFinish () from /usr/local/opt/postgresql/lib/libpq.5.dylib
5  0x00000001035c31bc in PQsetClientEncoding () from /usr/local/opt/postgresql/lib/libpq.5.dylib
6  0x000000010359d2a7 in pgconn_set_default_encoding () from /Users/user/.rbenv/versions/2.2.4/gemsets/caremessage-v1.1/extensions/x86_64-darwin-15/2.2.0-static/pg-0.18.4/pg_ext.bundle
7  0x00000001035990b7 in pgconn_init () from /Users/user/.rbenv/versions/2.2.4/gemsets/caremessage-v1.1/extensions/x86_64-darwin-15/2.2.0-static/pg-0.18.4/pg_ext.bundle
8  0x0000000102daa929 in vm_call0_body ()
9  0x0000000102da9d82 in rb_call0 ()
10 0x0000000102cbc969 in rb_class_new_instance ()

pg is not releasing the GVL before making a network call.

PQsetClientEncoding() is currently called without releasing the GVL while the PG::Connection initialisation. It should either be GVL aware or use the asynchronous libpq interface.

Thanks

@ged
Copy link
Owner Author

ged commented Aug 4, 2016

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).


You're right, PQsetClientEncoding() is currently called without releasing the GVL while the PG::Connection initialisation. It should either be GVL aware or use the asynchronous libpq interface.

I'll fix this.

@ged
Copy link
Owner Author

ged commented Aug 15, 2016

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).


This should be fixed with commit 41f81b72476 . I'll release another pre-release soon. Could you please try it out?

@ged
Copy link
Owner Author

ged commented Aug 15, 2016

Original comment by danielnc (Bitbucket: danielnc, GitHub: danielnc).


Hi @larskanis

I'm not too familiar with bitbucket(I mostly use GitHub)

Is there an easy way to install a gem directly from you commit or branch?

Thanks

@ged
Copy link
Owner Author

ged commented Aug 19, 2016

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).


@danielnc You can use the latest gem prerelease either by:

gem inst pg --pre

or with bundler:

gem "pg", "=0.19.0.pre20160817083826" 

It's also possible now, to fetch and build the gem from a git repository:

gem "pg", git: "https://github.com/ged/ruby-pg"

@ged
Copy link
Owner Author

ged commented Aug 24, 2016

Original comment by danielnc (Bitbucket: danielnc, GitHub: danielnc).


Hi @larskanis

Our process is still stuck

Not sure if related to this issue or not

Here is GDB thread dump

Any ideas what might be causing this?

https://gist.github.com/danielnc/58734c5673fa0da8d3d02eddf2184c36

@ged
Copy link
Owner Author

ged commented Aug 24, 2016

Original comment by Michael Granger (Bitbucket: ged, GitHub: ged).


@danielnc it looks like it's now giving up the GVL in both threads that are trying to connect, so when you say "Our process is still stuck" do you mean that it still blocks other Ruby threads? If you need the connect to be asynchronous from Ruby, you can use PG::Connection.connect_start + conn.connect_poll instead of PG::Connection.new, but new blocks until it can connect (or until it times out) by design.

You can also set your own connection timeout using the connect_timeout option so connections fail faster as well.

@ged
Copy link
Owner Author

ged commented Aug 24, 2016

Original comment by danielnc (Bitbucket: danielnc, GitHub: danielnc).


@ged when I say our process is still stuck it still block other Ruby Threads.

We don't need it to be async because we are using PgBouncer as a middleman

Clients are able to connect but are stuck in the middle of running queries so I'm not 100% if this is another issue with pg

@ged
Copy link
Owner Author

ged commented Aug 24, 2016

Original comment by Chris Bandy (Bitbucket: cbandy, GitHub: cbandy).


I see a pgconn_exec in thread 18. Possibly related content here? https://bitbucket.org/ged/ruby-pg/issues/240/unrepsonsive-process#comment-30002868

@ged
Copy link
Owner Author

ged commented Aug 24, 2016

Original comment by Michael Granger (Bitbucket: ged, GitHub: ged).


Yes, but it's followed by a rb_thread_call_without_gvl. The issue you referenced is about signal-handling, not blocking the interpreter.

@ged
Copy link
Owner Author

ged commented Sep 4, 2016

Original comment by Lars Kanis (Bitbucket: larskanis, GitHub: larskanis).


@danielnc I've changed the part for setting the default_encoding to use the asynchronous libpq interface: a646a1394e

Therefore if a thread opens a new connection and blocks at "SET client_encoding", it will still respond to signals (Process.kill) and exception events (Thread#raise). Previously the blocking threads did not respond to such events (but all other threads did respond).

You should also make sure, that when working with signals, you're using PG::Connection#async_exec instead of #exec for all queries. As @cbandy pointed out, there is a single call to #exec in your log, which could be the blocking piece in your application. As an alternative you can use the PG::Connection#send/block/get_last_result sequence.

I've pushed another pre-release on rubygems.org with the latest change. Could you please try it out?

@ged
Copy link
Owner Author

ged commented Sep 6, 2016

Original comment by danielnc (Bitbucket: danielnc, GitHub: danielnc).


@larskanis this did not fix the issue where our connections just lingers on pgbouncer(although threads are now responsive to signals)

I've opened an issue(pgbouncer/pgbouncer#143) to try to understand why I'm getting FATAL and WARNING messages. I'm assuming that this might be the cause of clients being stuck while waiting to receive results from pgbouncer

I think we can close this issue since at least now threads are responsive

I'll have to try to investigate further since we cannot find a reason why our clients cannot work well with pgbouncer

Thanks

@ged ged closed this as completed Sep 6, 2016
@ged ged added this to the Pending milestone Oct 8, 2019
cbandy pushed a commit to cbandy/ruby-pg that referenced this issue Jan 10, 2020
PQsetClientEncoding() is also called within PG::Connection.new
which could block due to network delay.

This fixes bitbucket issue ged#245 .
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant