-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli/sqlshell: starting SQL shell hangs when SQL leases range is stalled #81673
Comments
The larger goal here is to allow using tools like |
Are you aware of the |
@knz That's really cool to know, I didn't know that existed. It seems to be partly what we'd want regarding not showing the database and transaction status in the prompt, but it still appears to query for the server version which I believe stalled out during the last outage. Perhaps then the fix here is to make |
server version / Maybe you asked the shell to connect to the default database (@ajwerner @maryliag can you confirm that a simple SELECT to a no-kv vtable will not incur KV traffic? i.e. no table lease, no stall waiting to store statement statistics, etc?) |
Yes this would be great, thanks. |
I've reproduced it successfully again. This is what I see in the sqlcli (you can ignore the bad connection, I stopped the server, it was stalled on fetching the node_build_info): Here's the stack trace of the goroutine (that I believe) was running
This is stack trace was taken with my (updated) branch at https://github.com/1lann/cockroach/tree/1lann/stalled-sql-leases-range by starting a single node with I've also confirmed the call to |
The particular stack trace you've found is most definitely not the one that is serving crdb_internal.node_build_info. What is possible however, is that we've inadvertently built logic that accesses other tables as a side-effect of running a query. Perhaps to store statement statistics or something? We need a full goroutine dump and possibly server-side tracing. |
okay I found what's going on. I will post more details in a minute. |
So the proximate issue at hand is that an unqualified crdb_internal query resolves to However note that this will likely not solve the larger problem -- if the lease table is unavailable, chances are the connection will not even succeed authentication: the We do have a special case in the code when the client connects with a TLS client cert for user |
@joshimhoff @1lann in light of the above, please advise how you would like to move forward. If you think it's time to make progress on #44134, we will want to touch base with a PM. |
(btw @1lann - a simple stack trace in this case was not particularly useful. To debug this, I used the cluster settings |
81988: ui: fix when search contains * r=maryliag a=maryliag Previously, when searching for * on statement and transaction pages, the pages would crash. This commit fixes this issue by using the literal value * during the regex for highlighting the text. Fixes #81695 Release note (bug fix): Statement and Transaction pages no longer crash when search term includes '*'. 82014: cli/sql: avoid taking a lease on the target db upon connect r=otan a=knz Informs #81673. cc `@joshimhoff` `@1lann` This makes `--debug-sql-cli` slightly more useful if something is wrong with the database in the URL string or if there's something wrong in the leasing subsystem. Release note: None 82032: docs: fix broken link in range merge RFC r=knz a=justinj Co-authored-by: Marylia Gutierrez <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Justin Jaffray <[email protected]>
Nice, Jason, re: repro!
For this use case, we intend to use the root user already, as we noticed the special return authenticated path in the auth code. So with your fix, I think we are all set. Thanks for it. The main thing we want to do here is send KV troubleshooting queries, e.g. to |
@joshimhoff I don't mind doing that but in light of the conversation in Slack, is it worth the effort to go about writing a test for this if it's so difficult to predict the different failure states and if such failure states should even be supported? |
Agreed. Let's leave this as is for now. Thanks for the fix, @knz! |
Describe the problem
We recently had a problem where the SQL leases range (range ID 7) was unavailable due to #81561
We noticed that
cockroach sql
hangs when the SQL leases range is stalled (i.e. unavailable) when running in interactive mode. It is still possible to execute some SQL queries that don't depend on the leases range in non-interactive mode with-e
. This appears to be because when running in interactive mode, numerous SQL queries that are dependent on the leases range are executed to provide rich feedback to the user, such as printing the server version/metadata, the currently used database, and the current transaction status. We confirmed that retrieving the server metadata on start is one such cause as when we modified code to skip over the call to retrieve server metadata, the client made further progress on establishing a shell.It would be nice if
cockroach sql
could handle this failure mode more gracefully in interactive mode and run in a degraded state that still allows the user to attempt to execute queries without having to use-e
.cc @joshimhoff
To Reproduce
See diff/branch here: https://github.com/cockroachdb/cockroach/compare/master...1lann:1lann/stalled-sql-leases-range?expand=1
When a SIGHUP signal is sent to the
cockroach
process built from this modified branch, attempting to access range 7 will stall (note that you may need to wait until for the lease to completely expire to observe the full effects of this, or signal SIGHUP before establishing a connection to CRDB). I used this branch to play around with the effects of range 7 being unavailable and to reproduce the screenshot below.Expected behavior
For
cockroach sql
to start and not stall.Jira issue: CRDB-16021
The text was updated successfully, but these errors were encountered: