vtexplain race condition causes failures on slow machines #5474

aquarapid · 2019-11-27T02:11:32Z

On slower hosts, especially where /tmp is on a real filesystem as opposed to faster tmpfs, the following can happen:

$ time /vt/bin/vtexplain -schema-file schema.sql -vschema-file vschema.json -replication-mode "ROW" -output-mode text -sql "SELECT * from xxx where yyy = 5;" -shards 2 
----------------------------------------------------------------------
SELECT * from xxx where yyy = 5

1 ks1/-80: select * from xxx where yyy = 5 limit 10001

----------------------------------------------------------------------
E1127 02:01:05.085814   23037 conn.go:724] Error reading packet from client 4 (@): read unix /tmp/fakesqldb097001394/fakesqldb.sock->@: use of closed network connection
io.ReadFull(header size) failed

There are some other error variations like:

E1127 01:29:18.801368   21746 conn.go:722] Error reading packet from client 4 (@): read unix /tmp/fakesqldb792609684/fakesqldb.sock->@: use of closed network connection

The errors are somewhat non-deterministic, since the host might just be fast enough for these not to be seen during every invocation.

Hosts where /tmp is a real filesystem (on spinning disk) and not tmpfs as in most Linux distros are particularly susceptible (e.g. the default config of Ubuntu 16.04 on a GCP GCE VM). /tmp is significant, because this is where vtexplain stores it's unix socket files for communication between the fakesqldb instances it starts and the (many) child processes it launches during execution.

The cause is a race during the shutdown of the fakesqldb tabletserver instances. If we insert a check to wait until the instances are properly shutdown (with a timeout, just in case), it will work as expected. PR to follow.

The text was updated successfully, but these errors were encountered:

instances to exit cleanly. Fixes vitessio#5474 Signed-off-by: Jacques Grove <[email protected]>

aquarapid added a commit to planetscale/vitess that referenced this issue Nov 27, 2019

Fix vtexplain race by waiting around for the fakesqldb tabletserver

5d5e150

instances to exit cleanly. Fixes vitessio#5474 Signed-off-by: Jacques Grove <[email protected]>

aquarapid mentioned this issue Nov 27, 2019

Fix vtexplain race by waiting around for the fakesqldb tabletserver #5476

Merged

demmer closed this as completed in #5476 Nov 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vtexplain race condition causes failures on slow machines #5474

vtexplain race condition causes failures on slow machines #5474

aquarapid commented Nov 27, 2019

vtexplain race condition causes failures on slow machines #5474

vtexplain race condition causes failures on slow machines #5474

Comments

aquarapid commented Nov 27, 2019