Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query hangs when using async_exec #583

Open
jordan-thoms opened this issue Aug 16, 2024 · 3 comments
Open

Query hangs when using async_exec #583

jordan-thoms opened this issue Aug 16, 2024 · 3 comments

Comments

@jordan-thoms
Copy link

I've been debugging an issue where we have intermittent hangs of our ruby processes when running queries on CockroachDB using the ruby-pg driver. This is with with ruby-3.3.4, libpq-dev 15.8-0+deb12u1, ruby-pg 1.5.7, and Linux 6.8.0-1012-gcp however I have been able to reproduce the issue on other versions also.

We are able to reproduce the issue ~80% of the time with a simple query like this:

conn = PG.connect(ENV['COCKROACH_DATABASE_URL'])
result = conn.exec(<THE QUERY>)

This will hang forever, tying up a ruby thread in our sidekiq workers until the worker is shutdown, even days later.

If I disable async_exec, the same query always works, i.e. this works 100% of the time:

PG::Connection.async_api = false
conn = PG.connect(ENV['COCKROACH_DATABASE_URL'])
result = conn.exec(<THE QUERY>)

The triggering of this is related to the content returned from the query - if I leave out certain columns, or remove the ORDER BY, then the issue doesn't show up.

Here's the gdb trace when the query is hung:

(gdb) where
#0  0x000070e98a319256 in __ppoll (fds=fds@entry=0x7fff46646eb8, nfds=nfds@entry=1, timeout=<optimized out>,
    sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:42
#1  0x000070e98a878390 in rb_thread_wait_for_single_fd (fd=6, events=1, timeout=timeout@entry=0x0)
    at /usr/src/ruby/hrtime.h:158
#2  0x000070e98a71dcd9 in rb_io_wait (timeout=4, events=3, io=<optimized out>)
    at ./include/ruby/internal/arithmetic/int.h:172
#3  rb_io_wait (io=<optimized out>, events=events@entry=3, timeout=timeout@entry=4) at io.c:1422
#4  0x000070e96ebab84a in wait_socket_readable (ptimeout=0x0, is_readable=<optimized out>, self=124147919560760)
    at pg_connection.c:2441
#5  pgconn_async_get_last_result (self=self@entry=124147919560760) at pg_connection.c:3139
#6  0x000070e96ebac939 in pgconn_async_exec (argc=1, argv=0x70e98a09c368, self=124147919560760) at pg_connection.c:3288
#7  0x000070e98a8ae8cc in vm_call_cfunc_with_frame_ (stack_bottom=<optimized out>, argv=<optimized out>, argc=1,
    calling=<optimized out>, reg_cfp=0x70e98a19bbb0, ec=0x62de88a5aac0) at /usr/src/ruby/vm_insnhelper.c:3502
#8  vm_call_cfunc_with_frame (ec=0x62de88a5aac0, reg_cfp=0x70e98a19bbb0, calling=<optimized out>)
    at /usr/src/ruby/vm_insnhelper.c:3530
#9  0x000070e98a8c0289 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>,
    cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /usr/src/ruby/vm_callinfo.h:403
#10 vm_exec_core (ec=0x7fff46646eb8, ec@entry=0x62de88a5aac0) at /usr/src/ruby/insns.def:834
#11 0x000070e98a8c6149 in rb_vm_exec (ec=ec@entry=0x62de88a5aac0) at vm.c:2486
#12 0x000070e98a8c6e94 in eval_string_with_scope (line=9, file=<optimized out>, src=124147872936320,
    scope=124147909484040) at /usr/src/ruby/vm_eval.c:1761
#13 rb_f_eval (argc=<optimized out>, argv=<optimized out>, self=<optimized out>) at /usr/src/ruby/vm_eval.c:1804
#14 0x000070e98a8ae8cc in vm_call_cfunc_with_frame_ (stack_bottom=<optimized out>, argv=<optimized out>, argc=4,
    calling=<optimized out>, reg_cfp=0x70e98a19bc20, ec=0x62de88a5aac0) at /usr/src/ruby/vm_insnhelper.c:3502
#15 vm_call_cfunc_with_frame (ec=0x62de88a5aac0, reg_cfp=0x70e98a19bc20, calling=<optimized out>)
    at /usr/src/ruby/vm_insnhelper.c:3530
#16 0x000070e98a8c0289 in vm_sendish (method_explorer=<optimized out>, block_handler=<optimized out>,
    cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at /usr/src/ruby/vm_callinfo.h:403
#17 vm_exec_core (ec=0x7fff46646eb8, ec@entry=0x62de88a5aac0) at /usr/src/ruby/insns.def:834
#18 0x000070e98a8c6149 in rb_vm_exec (ec=ec@entry=0x62de88a5aac0) at vm.c:2486
#19 0x000070e98a8d7320 in invoke_block (captured=0x70e98a19bed8, opt_pc=<optimized out>, type=<optimized out>, cref=0x0

It seems the issue is that wait_socket_readable is getting called when there is nothing more to read from the socket, and as pgconn_async_get_last_result is passing NULL as the timeout value ( https://github.com/ged/ruby-pg/blob/master/ext/pg_connection.c#L3139 ), the process just waits forever. I compiled ruby-pg with this change to hardcode a wait_timeout, and afterwards I would see a hang of a few seconds when the issue was triggered, rather than an infinite one:

diff --git a/ext/pg_connection.c b/ext/pg_connection.c
index e4194a3..1d5e276 100644
--- a/ext/pg_connection.c
+++ b/ext/pg_connection.c
@@ -2438,7 +2438,7 @@ wait_socket_readable( VALUE self, struct timeval *ptimeout, void *(*is_readable)

                        socket_io = pgconn_socket_io(self);
                        /* Wait for the socket to become readable before checking again */
-                       ret = pg_rb_io_wait(socket_io, RB_INT2NUM(PG_RUBY_IO_READABLE), wait_timeout);
+                       ret = pg_rb_io_wait(socket_io, RB_INT2NUM(PG_RUBY_IO_READABLE), 5);
                } else {
                        ret = Qfalse;
                }

This issue seems related to #325 (comment) , and some investigation there suggested it was a result of the frame sizes.

@larskanis
Copy link
Collaborator

larskanis commented Aug 16, 2024

Is it possible to create a dockerfile with a repro?

@jordan-thoms
Copy link
Author

The ruby side is easily replicable with docker, but requires a specific state on the CockroachDB side to trigger this, which I haven't been able to recreate so far - I'll try to do that.

Is it the expected behaviour that pgconn_async_get_last_result would ever block for a significant period? From the comment here it sounds like it shouldn't block:

ruby-pg/ext/pg_connection.c

Lines 3138 to 3139 in beafa09

/* wait for input (without blocking) before reading each result */
wait_socket_readable(self, NULL, get_result_readable);

However, since it calls wait_socket_readable without specifying a timeout, it can actually end up permanently blocked waiting for the socket to be readable...

@larskanis
Copy link
Collaborator

Is it the expected behaviour that pgconn_async_get_last_result would ever block for a significant period? From the comment here it sounds like it shouldn't block:

Yes it is intended that it blocks with no timeout. I improved the comment in 7ced092 . I could imagine that there's a bad interaction between OpenSSL and libpq as described in #325 (comment) . Is it possible for you to test it without SSL?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants