Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli/sqlshell: starting SQL shell hangs when SQL leases range is stalled #81673

Closed
1lann opened this issue May 23, 2022 · 15 comments
Closed

cli/sqlshell: starting SQL shell hangs when SQL leases range is stalled #81673

1lann opened this issue May 23, 2022 · 15 comments
Labels
A-cli-client CLI commands that pertain to using SQL features C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-postmortem Originated from a Postmortem action item. O-sre For issues SRE opened or otherwise cares about tracking.

Comments

@1lann
Copy link
Contributor

1lann commented May 23, 2022

Describe the problem

We recently had a problem where the SQL leases range (range ID 7) was unavailable due to #81561

We noticed that cockroach sql hangs when the SQL leases range is stalled (i.e. unavailable) when running in interactive mode. It is still possible to execute some SQL queries that don't depend on the leases range in non-interactive mode with -e. This appears to be because when running in interactive mode, numerous SQL queries that are dependent on the leases range are executed to provide rich feedback to the user, such as printing the server version/metadata, the currently used database, and the current transaction status. We confirmed that retrieving the server metadata on start is one such cause as when we modified code to skip over the call to retrieve server metadata, the client made further progress on establishing a shell.

It would be nice if cockroach sql could handle this failure mode more gracefully in interactive mode and run in a degraded state that still allows the user to attempt to execute queries without having to use -e.

cc @joshimhoff

To Reproduce

See diff/branch here: https://github.com/cockroachdb/cockroach/compare/master...1lann:1lann/stalled-sql-leases-range?expand=1

When a SIGHUP signal is sent to the cockroach process built from this modified branch, attempting to access range 7 will stall (note that you may need to wait until for the lease to completely expire to observe the full effects of this, or signal SIGHUP before establishing a connection to CRDB). I used this branch to play around with the effects of range 7 being unavailable and to reproduce the screenshot below.

image

Expected behavior

For cockroach sql to start and not stall.

Jira issue: CRDB-16021

@1lann 1lann added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-cli-client CLI commands that pertain to using SQL features labels May 23, 2022
@joshimhoff joshimhoff added the O-sre For issues SRE opened or otherwise cares about tracking. label May 23, 2022
@joshimhoff
Copy link
Collaborator

joshimhoff commented May 23, 2022

It would be nice if cockroach sql could handle this failure mode more gracefully in interactive mode and run in a degraded state that still allows the user to attempt to execute queries without having to use -e.

The larger goal here is to allow using tools like SELECT * FROM crdb_internal.probe_ranges to understand various KV outages well enough to mitigate them via node restarts & similar. The theory is that these tools being available increases the chances SRE can mitigate the outage without the need for KV L2, which reduces both time to fix & load on KV L2. Of course, we can also teach SREs the -e workaround.

CC @nvanbenschoten

@1lann 1lann added the O-postmortem Originated from a Postmortem action item. label May 25, 2022
@knz
Copy link
Contributor

knz commented May 26, 2022

Are you aware of the --debug-sql-cli flag, which already does what is requested here? Would it have helped in this case?

@1lann
Copy link
Contributor Author

1lann commented May 26, 2022

@knz That's really cool to know, I didn't know that existed. It seems to be partly what we'd want regarding not showing the database and transaction status in the prompt, but it still appears to query for the server version which I believe stalled out during the last outage. Perhaps then the fix here is to make SELECT * FROM crdb_internal.node_build_info not depend on SQL leases range being available, or I could be reproducing the problem wrong. I can try reproducing the problem again tomorrow and seeing what happens.

@knz
Copy link
Contributor

knz commented May 26, 2022

it still appears to query for the server version

server version / node_build_info can't stall, it's an in-memory operation.

Maybe you asked the shell to connect to the default database defaultdb and that may stall? Might need to say you want to request "no database" instead.

(@ajwerner @maryliag can you confirm that a simple SELECT to a no-kv vtable will not incur KV traffic? i.e. no table lease, no stall waiting to store statement statistics, etc?)

@knz
Copy link
Contributor

knz commented May 26, 2022

@1lann

I can try reproducing the problem again tomorrow and seeing what happens.

Yes this would be great, thanks.

@1lann
Copy link
Contributor Author

1lann commented May 28, 2022

I've reproduced it successfully again. This is what I see in the sqlcli (you can ignore the bad connection, I stopped the server, it was stalled on fetching the node_build_info):
image

Here's the stack trace of the goroutine (that I believe) was running SELECT * FROM crdb_internal.node_build_info and stalled out:

github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithoutRangeID(0xc0023d6800, {0x9c07fd8, 0xc0012c6b40}, 0xc000a12700)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:151 +0xd1
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:108
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(_, {_, _}, {{{0x16f337fdf1ee0710, 0x0, 0x0}, 0x0, {0x1, 0x1, 0x1, ...}, ...}, ...})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:197 +0x725
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x1, 0x1, 0x1, ...}, ...}, ...})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:191 +0x1c8
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1({0x9c07fd8?, 0xc0012c6a80?})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:1006 +0x328
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc000d24c60, {0x9c07fd8, 0xc0012c6a80}, {0x81a1040?, 0xc8ae7f8?}, 0xc003869a88)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:344 +0xd1
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc0002a09f8?, {0x9c07fd8?, 0xc0012c6a80?}, {0x4011027?}, 0xc003869b58?)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:989 +0xfd
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc0002a0900, {0x9c07fd8, 0xc0012c69f0}, 0xc000a12600)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:1058 +0xfc
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch(...)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:618
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch(_, {_, _}, _, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, ...}, ...})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:209 +0x15e
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x1, 0x1, 0x1, ...}, ...}, ...})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:191 +0x12b
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...}, ...)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:2062 +0x104e
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...}, ...)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1608 +0xc6b
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...}, ...)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1210 +0x426
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:831 +0x705
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:82 +0x270
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnMetricRecorder).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:46 +0x118
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnCommitter).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:129 +0x725
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...}, ...)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:242 +0x263
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:177 +0x4b6
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:290 +0x286
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:105 +0x9d
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:232 +0x4e8
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:532 +0x567
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...}, ...)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:986 +0x14d
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:1124 +0x21b
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill({0x9c07f30, 0xc001daa240}, 0x17c7?, 0xc00294b080)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:831 +0xfc
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Run(0xc002e7cb00, {0x9c07f30, 0xc001daa240}, 0x9c07f30?)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:696 +0x74
github.com/cockroachdb/cockroach/pkg/sql.(*tableWriterBase).finalize(0xc0023ab228, {0x9c07f30, 0xc001daa240})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/tablewriter.go:226 +0x13d
github.com/cockroachdb/cockroach/pkg/sql.(*insertNode).BatchedNext(0xc0023ab200, {{0x9c07f30, 0xc001daa240}, 0xc002f23c00, 0xc002b20d08})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/insert.go:260 +0x22b
github.com/cockroachdb/cockroach/pkg/sql.(*rowCountNode).startExec(0xc000cabcc8, {{0x9c07f30?, 0xc001daa240?}, 0xc002f23c00?, 0xc002b20d08?})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_batch.go:173 +0xce
github.com/cockroachdb/cockroach/pkg/sql.startExec.func2({0xc00386dd90?, 0x9c09d28?}, {0x9c0a310, 0xc000cabcc8})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/plan.go:517 +0x143
github.com/cockroachdb/cockroach/pkg/sql.(*planVisitor).visitInternal.func1()
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:112 +0x3e
github.com/cockroachdb/cockroach/pkg/sql.(*planVisitor).visitInternal(0xc00386dd90, {0x9c0a310?, 0xc000cabcc8?}, {0x88a3378?, 0x5?})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:297 +0x14a9
github.com/cockroachdb/cockroach/pkg/sql.(*planVisitor).visit(0xc00386dd90, {0x9c0a310, 0xc000cabcc8})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:79 +0xf7
github.com/cockroachdb/cockroach/pkg/sql.walkPlan(...)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/walk.go:43
github.com/cockroachdb/cockroach/pkg/sql.startExec({{0x9c07f30?, 0xc001daa240?}, 0xc002f23c00?, 0xc002b20d08?}, {0x9c0a310, 0xc000cabcc8})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/plan.go:520 +0x119
github.com/cockroachdb/cockroach/pkg/sql.(*planNodeToRowSource).Start(0xc003122000, {0x9c07f30?, 0xc001daa240?})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_node_to_row_source.go:147 +0x87
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*FlowCoordinator).Start.func1()
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:122 +0x34
github.com/cockroachdb/cockroach/pkg/sql/colexecerror.CatchVectorizedRuntimeError(0x0?)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/colexecerror/error.go:91 +0x62
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*FlowCoordinator).Start(0xc0028aab00, {0x9c07f30?, 0xc001daa240?})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/flow_coordinator.go:121 +0x6b
github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*ProcessorBaseNoHelper).Run(0xc0028aab00, {0x9c07f30?, 0xc001daa240?})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:707 +0x3a
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).Run(0xc0011f1860, {0x9c07f30, 0xc001daa240}, 0xc0028aab00?)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:475 +0x258
github.com/cockroachdb/cockroach/pkg/sql/colflow.(*vectorizedFlow).Run(0xc002b20fc8?, {0x9c07f30?, 0xc001daa240?}, 0xc002e7cb00?)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/colflow/vectorized_flow.go:249 +0x205
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc001eee8c0, {0x9c07fd8, 0xc0012c6570}, 0xc001789500, 0xc002e7cb00, 0xc000bd9000, 0xc0028a0700, 0xc002b20fc8, 0x0)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:582 +0xc56
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun(0x9c07fd8?, {0x9c07fd8, 0xc0012c6570}, 0xc002b20d08?, 0xc001789500, 0xc001522960?, {{0x9c0a310?, 0xc000cabcc8?}, 0x0?}, 0xc0028a0700)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:1430 +0x1af
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine(0xc002b20600, {0x9c07fd8, 0xc0012c6570}, 0xc002b20d08, 0xc0012c6570?, {0x9c51ce8, 0xc001522960}, 0x0?, 0xc000e53d98)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1472 +0x5f5
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc002b20600, {0x9c07fd8, 0xc001522a20}, 0xc002b20d08, {0x9c51ce8, 0xc001522960})
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1148 +0xa0c
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc002b20600, {0x9c07f30?, 0xc000c6fa40?}, {{0x9c2e868, 0xc00167e190}, {0xc002901180, 0x7a}, 0x0, 0x1}, 0x0, ...)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:688 +0x1fd1
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc002b20600, {0x9c07f30, 0xc000c6fa40}, {{0x9c2e868, 0xc00167e190}, {0xc002901180, 0x7a}, 0x0, 0x1}, 0x0, ...)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:145 +0x5b1
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func1({{{0x9c2e868, 0xc00167e190}, {0xc002901180, 0x7a}, 0x0, 0x1}, {0x8cde8, 0xeda23d64c, 0x0}, {0x8cde8, ...}, ...}, ...)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1886 +0x305
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(0xc002b20600)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1890 +0xb48
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc002b20600, {0x9c07fd8, 0xc001522420}, 0x6fdd7c0?, {0x0, 0x0, 0x0, 0x0, 0x0}, 0x0)
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1812 +0x265
github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).initConnEx.func1()
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:206 +0xa5
created by github.com/cockroachdb/cockroach/pkg/sql.(*InternalExecutor).initConnEx
	/Users/jason/.go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:205 +0x5ec

This is stack trace was taken with my (updated) branch at https://github.com/1lann/cockroach/tree/1lann/stalled-sql-leases-range by starting a single node with ./cockroach start-single-node --listen-addr 127.0.0.1 --insecure, running killall -SIGHUP cockroach and trying to connect to it with cockroach sql --debug-sql-cli --insecure.

I've also confirmed the call to (*tableWriterBase).finalize in the stack trace, is indeed finalizing the lease table via printing a debug stack trace there as well.

@knz
Copy link
Contributor

knz commented May 28, 2022

The particular stack trace you've found is most definitely not the one that is serving crdb_internal.node_build_info.
I'm also mostly convinced that no lease is needed to query that virtual table.

What is possible however, is that we've inadvertently built logic that accesses other tables as a side-effect of running a query. Perhaps to store statement statistics or something? We need a full goroutine dump and possibly server-side tracing.

@knz
Copy link
Contributor

knz commented May 28, 2022

okay I found what's going on. I will post more details in a minute.

@knz
Copy link
Contributor

knz commented May 28, 2022

So the proximate issue at hand is that an unqualified crdb_internal query resolves to select defaultdb.crdb_internal.xxx and this in turn takes a lease against the defaultdb.
The solution is to ensure the query is run against an empty database name (see PR #82014).

However note that this will likely not solve the larger problem -- if the lease table is unavailable, chances are the connection will not even succeed authentication: the system.users, system.role_members, system.role_options and other things will not be available either.

We do have a special case in the code when the client connects with a TLS client cert for user root, in which case we bypass many SQL things, but that special case does not (yet) work for other users or when using password authentication. For that, we would need to make progress on always-available users (see RFC in PR #44134).

@knz
Copy link
Contributor

knz commented May 28, 2022

@joshimhoff @1lann in light of the above, please advise how you would like to move forward. If you think it's time to make progress on #44134, we will want to touch base with a PM.

@knz
Copy link
Contributor

knz commented May 28, 2022

(btw @1lann - a simple stack trace in this case was not particularly useful. To debug this, I used the cluster settings sql.trace.txn.enable_threshold and sql.trace.log_statement_execute).

craig bot pushed a commit that referenced this issue May 30, 2022
81988: ui: fix when search contains * r=maryliag a=maryliag

Previously, when searching for * on statement
and transaction pages, the pages would crash.
This commit fixes this issue by using the literal
value * during the regex for highlighting the text.

Fixes #81695

Release note (bug fix): Statement and Transaction pages no longer
crash when search term includes '*'.

82014: cli/sql: avoid taking a lease on the target db upon connect r=otan a=knz

Informs #81673.   cc `@joshimhoff` `@1lann` 

This makes `--debug-sql-cli` slightly more useful if something is
wrong with the database in the URL string or if there's something
wrong in the leasing subsystem.

Release note: None

82032: docs: fix broken link in range merge RFC r=knz a=justinj



Co-authored-by: Marylia Gutierrez <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Justin Jaffray <[email protected]>
@joshimhoff
Copy link
Collaborator

joshimhoff commented May 31, 2022

Nice, Jason, re: repro!

We do have a special case in the code when the client connects with a TLS client cert for user root, in which case we bypass many SQL things, but that special case does not (yet) work for other users or when using password authentication. For that, we would need to make progress on always-available users (see RFC in PR #44134).

For this use case, we intend to use the root user already, as we noticed the special return authenticated path in the auth code. So with your fix, I think we are all set. Thanks for it.

The main thing we want to do here is send KV troubleshooting queries, e.g. to crdb_internal.probe_ranges, in case the proximate cause of a hard down cluster is a borked system range. We've tested that such queries are available in face of a hard SQL down cluster. This came up in a recent CC incident, where a cluster was hard down because the sql lease table was unavailable.

@joshimhoff
Copy link
Collaborator

The only additional thing I can think to do is to write an integration test that ensures we can keep accessing a root SQL shell in face of a hard down cluster. Wdyt about that, @knz? I wonder if @1lann wants to do it. I can give him pointers; I've written similar tests.

@1lann
Copy link
Contributor Author

1lann commented May 31, 2022

@joshimhoff I don't mind doing that but in light of the conversation in Slack, is it worth the effort to go about writing a test for this if it's so difficult to predict the different failure states and if such failure states should even be supported?

@joshimhoff
Copy link
Collaborator

Agreed. Let's leave this as is for now. Thanks for the fix, @knz!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cli-client CLI commands that pertain to using SQL features C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-postmortem Originated from a Postmortem action item. O-sre For issues SRE opened or otherwise cares about tracking.
Projects
None yet
Development

No branches or pull requests

3 participants