Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vtexplain race condition causes failures on slow machines #5474

Closed
aquarapid opened this issue Nov 27, 2019 · 0 comments · Fixed by #5476
Closed

vtexplain race condition causes failures on slow machines #5474

aquarapid opened this issue Nov 27, 2019 · 0 comments · Fixed by #5476

Comments

@aquarapid
Copy link
Contributor

On slower hosts, especially where /tmp is on a real filesystem as opposed to faster tmpfs, the following can happen:

$ time /vt/bin/vtexplain -schema-file schema.sql -vschema-file vschema.json -replication-mode "ROW" -output-mode text -sql "SELECT * from xxx where yyy = 5;" -shards 2 
----------------------------------------------------------------------
SELECT * from xxx where yyy = 5

1 ks1/-80: select * from xxx where yyy = 5 limit 10001

----------------------------------------------------------------------
E1127 02:01:05.085814   23037 conn.go:724] Error reading packet from client 4 (@): read unix /tmp/fakesqldb097001394/fakesqldb.sock->@: use of closed network connection
io.ReadFull(header size) failed

There are some other error variations like:

E1127 01:29:18.801368   21746 conn.go:722] Error reading packet from client 4 (@): read unix /tmp/fakesqldb792609684/fakesqldb.sock->@: use of closed network connection

The errors are somewhat non-deterministic, since the host might just be fast enough for these not to be seen during every invocation.

Hosts where /tmp is a real filesystem (on spinning disk) and not tmpfs as in most Linux distros are particularly susceptible (e.g. the default config of Ubuntu 16.04 on a GCP GCE VM). /tmp is significant, because this is where vtexplain stores it's unix socket files for communication between the fakesqldb instances it starts and the (many) child processes it launches during execution.

The cause is a race during the shutdown of the fakesqldb tabletserver instances. If we insert a check to wait until the instances are properly shutdown (with a timeout, just in case), it will work as expected. PR to follow.

aquarapid added a commit to planetscale/vitess that referenced this issue Nov 27, 2019
instances to exit cleanly.  Fixes vitessio#5474

Signed-off-by: Jacques Grove <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant