-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli: Enable profiles and other debug info for tenants #76539
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! A few nits/questions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @rimadeodhar)
-- commits, line 7 at r1:
Any chance these can be split into 2 separate commits? I think it would make it easier to reason about which diffs are relevant.
Code quote:
also addresses a bug
-- commits, line 10 at r1:
nit: typo
Code quote:
p;d
pkg/cli/zip_cluster_wide.go, line 201 at r1 (raw file):
nodesResponse, err := zc.status.Nodes(ctx, &serverpb.NodesRequest{}) nodesList := &serverpb.NodesListResponse{} if code := status.Code(errors.Cause(err)); code == codes.Unimplemented {
nit: Should this also include some kind of err != nil
check?
Code quote:
if code := status.Code(errors.Cause(err)); code == codes.Unimplemented {
pkg/server/status_util.go, line 31 at r1 (raw file):
// This file contains helper functions used by tenantStatusServer and statusServer. func profileLocal( ctx context.Context, req *serverpb.ProfileRequest, st *cluster.Settings,
nit: For this fn and the 2 below, would it make sense to try and decouple the params list from the request objects? It might help keep them flexible to be used elsewhere in the future - or do we want them to be coupled? 🤔
Code quote:
req *serverpb.ProfileRequest
pkg/server/status_util.go, line 97 at r1 (raw file):
func getLocalFiles( req *serverpb.GetFilesRequest, heapProfileDirName string,
Yay! This means we should get active query dumps in tenant debug.zip's for each SQL server instance. 🎉
Code quote:
heapProfileDirName string,
pkg/server/tenant_status.go, line 810 at r1 (raw file):
} instanceID, local, err := t.parseInstanceID(request.NodeId)
nit: consider updating the docstring comments on the request proto message to note that this can also be an instance ID
Code quote:
request.NodeId
pkg/server/tenant_status.go, line 838 at r1 (raw file):
} instanceID, local, err := t.parseInstanceID(request.NodeId)
nit: same as above
Code quote:
request.NodeId
pkg/server/tenant_status.go, line 1100 at r1 (raw file):
} instanceID, local, err := t.parseInstanceID(req.NodeId)
nit: ditto above
Code quote:
req.NodeId
9101acb
to
cb27d5f
Compare
cb27d5f
to
13c7cf5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the overall structure. I'll let your team judge on local correctness.
Reviewed 15 of 15 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @rimadeodhar, and @srosenberg)
pkg/server/status_util.go, line 1 at r2 (raw file):
// Copyright 2021 The Cockroach Authors.
file name: status_local_file_retrieval.go
(The name _util.go
suggests this should become a dumping ground. Let's avoid creating that impression.)
pkg/server/status_util.go, line 1 at r2 (raw file):
// Copyright 2021 The Cockroach Authors.
nit: 2022, not 2021
pkg/server/status_util.go, line 30 at r2 (raw file):
) // This file contains helper functions used by tenantStatusServer and statusServer.
Here and below: explain whether each function returns a gRPC error or not.
pkg/server/status_util.go, line 47 at r2 (raw file):
} if err := pprof.StartCPUProfile(&buf); err != nil { return err
what about this err return
pkg/server/status_util.go, line 57 at r2 (raw file):
} }); err != nil { return nil, err
what about this err return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 15 of 15 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @rimadeodhar, and @srosenberg)
-- commits, line 11 at r3:
nit: storage nodes instead of kv?
-- commits, line 19 at r3:
From reading this commit message I don't gain any understanding about what the problem was. Can you provide some more info here on what the issue is and why the NodesList
API is inadequate while the Nodes
API is.
pkg/cli/zip.go, line 99 at r3 (raw file):
ctx context.Context, ni nodesInfo, fn func(ctx context.Context, nodeDetails serverpb.NodeDetails, nodeStatus *statuspb.NodeStatus) error,
Consider adding a comment here that nodeStatus
should be expected to be nil for valid SQL-only nodes.
pkg/cli/zip.go, line 105 at r3 (raw file):
return errors.AssertionFailedf("nodes list is empty") } if ni.nodeStatusResponse != nil && len(ni.nodeStatusResponse.Nodes) != len(ni.nodeListResponse.Nodes) {
Are these guaranteed to match? Just wondering if, for instance, one lists recently decommissioned nodes and another does not. It's possible the count may not match in exactly those situations where we take a debug.zip so just want to check.
pkg/cli/zip_cluster_wide.go, line 118 at r3 (raw file):
// SQL only servers will only return nodeDetails for all SQL nodes. // Storage servers will return both nodeDetails as well as nodeStatus // for all KV nodes.
Nothing below is named nodeDetails
. Can you clarify?
pkg/cli/zip_cluster_wide.go, line 121 at r3 (raw file):
type nodesInfo struct { nodeStatusResponse *serverpb.NodesResponse nodeListResponse *serverpb.NodesListResponse
nit: should naming match type? plural nodesList...
and nodesStatus...
?
pkg/cli/zip_cluster_wide.go, line 204 at r3 (raw file):
if code := status.Code(errors.Cause(err)); code == codes.Unimplemented { // Likely a SQL only server; try the NodesList endpoint. nodesList, err = zc.status.NodesList(ctx, &serverpb.NodesListRequest{})
if we know this is a SQL-only server should we just return from here? I got a bit confused about why the if clause below was overwriting the nodesList
but I get that it's for storage/kv nodes only now.
pkg/cli/zip_cluster_wide.go, line 206 at r3 (raw file):
nodesList, err = zc.status.NodesList(ctx, &serverpb.NodesListRequest{}) } if err != nil {
this err
gets overwritten sometimes by the NodesList
call above...any way we can still catch the error messages from the Nodes
call?
pkg/server/tenant_status.go, line 830 at r2 (raw file):
func (t *tenantStatusServer) Stacks( ctx context.Context, request *serverpb.StacksRequest, ) (*serverpb.JSONResponse, error) {
should this one have an Admin permission check>?
13c7cf5
to
8227fde
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @dhartunian, @knz, and @srosenberg)
Previously, abarganier (Alex Barganier) wrote…
Any chance these can be split into 2 separate commits? I think it would make it easier to reason about which diffs are relevant.
Done.
Previously, dhartunian (David Hartunian) wrote…
From reading this commit message I don't gain any understanding about what the problem was. Can you provide some more info here on what the issue is and why the
NodesList
API is inadequate while theNodes
API is.
Done.
pkg/cli/zip.go, line 99 at r3 (raw file):
Previously, dhartunian (David Hartunian) wrote…
Consider adding a comment here that
nodeStatus
should be expected to be nil for valid SQL-only nodes.
Done.
pkg/cli/zip.go, line 105 at r3 (raw file):
Previously, dhartunian (David Hartunian) wrote…
Are these guaranteed to match? Just wondering if, for instance, one lists recently decommissioned nodes and another does not. It's possible the count may not match in exactly those situations where we take a debug.zip so just want to check.
Yes, it is guaranteed to match since the nodeListResponse is constructed from the nodeStatusResponse on storage servers.
pkg/cli/zip_cluster_wide.go, line 201 at r1 (raw file):
Previously, abarganier (Alex Barganier) wrote…
nit: Should this also include some kind of
err != nil
check?
The nil check does happen subsequently. So, we first check to see if the error is an unimplemented method error and if so, we try the new API. If not, then we perform the nil check below. The original error will get overwritten by the error returned by NodesListRequest
in case the initial error was an unimplemented method error but that is intentional.
pkg/cli/zip_cluster_wide.go, line 118 at r3 (raw file):
Previously, dhartunian (David Hartunian) wrote…
Nothing below is named
nodeDetails
. Can you clarify?
Done.
pkg/cli/zip_cluster_wide.go, line 204 at r3 (raw file):
Previously, dhartunian (David Hartunian) wrote…
if we know this is a SQL-only server should we just return from here? I got a bit confused about why the if clause below was overwriting the
nodesList
but I get that it's for storage/kv nodes only now.
There is no way to know at this point whether the server is SQL only or not. So, we allow the method to complete execution.
pkg/cli/zip_cluster_wide.go, line 206 at r3 (raw file):
Previously, dhartunian (David Hartunian) wrote…
this
err
gets overwritten sometimes by theNodesList
call above...any way we can still catch the error messages from theNodes
call?
Thats intentional. The only time the error will get overwritten if the original call to NodesRequest
returns an unimplemented error indicating this is a SQL only server.
pkg/server/tenant_status.go, line 830 at r2 (raw file):
Previously, dhartunian (David Hartunian) wrote…
should this one have an Admin permission check>?
Done.
pkg/server/status_util.go, line 31 at r1 (raw file):
Previously, abarganier (Alex Barganier) wrote…
nit: For this fn and the 2 below, would it make sense to try and decouple the params list from the request objects? It might help keep them flexible to be used elsewhere in the future - or do we want them to be coupled? 🤔
I have intentionally couple it with the request types as these methods are meant to be helper methods to both the tenant and storage status servers. I don't want these methods to be used as general purpose util methods.
pkg/server/status_util.go, line 97 at r1 (raw file):
Previously, abarganier (Alex Barganier) wrote…
Yay! This means we should get active query dumps in tenant debug.zip's for each SQL server instance. 🎉
Done.
pkg/server/status_util.go, line 1 at r2 (raw file):
Previously, knz (kena) wrote…
file name:
status_local_file_retrieval.go
(The name
_util.go
suggests this should become a dumping ground. Let's avoid creating that impression.)
Done.
pkg/server/status_util.go, line 30 at r2 (raw file):
Previously, knz (kena) wrote…
Here and below: explain whether each function returns a gRPC error or not.
Done.
pkg/server/status_util.go, line 47 at r2 (raw file):
Previously, knz (kena) wrote…
what about this err return
Done.
pkg/server/status_util.go, line 57 at r2 (raw file):
Previously, knz (kena) wrote…
what about this err return
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @dhartunian, @knz, @rimadeodhar, and @srosenberg)
pkg/cli/zip_cluster_wide.go, line 201 at r1 (raw file):
Previously, rimadeodhar (Rima Deodhar) wrote…
The nil check does happen subsequently. So, we first check to see if the error is an unimplemented method error and if so, we try the new API. If not, then we perform the nil check below. The original error will get overwritten by the error returned by
NodesListRequest
in case the initial error was an unimplemented method error but that is intentional.
I see, thanks for clarifying. SGTM!
pkg/server/status_util.go, line 31 at r1 (raw file):
Previously, rimadeodhar (Rima Deodhar) wrote…
I have intentionally couple it with the request types as these methods are meant to be helper methods to both the tenant and storage status servers. I don't want these methods to be used as general purpose util methods.
👍 Sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 10 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @dhartunian, @knz, @rimadeodhar, and @srosenberg)
TFTR! bors r+ |
8227fde
to
4d66a1b
Compare
This commit updates debug.zip functionality to collect goroutine stacks and profiles for all active SQL instances for a tenant. Release note: None
This commit addresses a bug where the nodes.json and status.json data for storage servers was not getting populated correctly. Due to a recent switch to the NodesList API, the nodes and status json files were missing information such as store metrics, store status etc. for storage nodes. This bug has been addressed by reverting back to using the Nodes API when the debug zip command is run against a storage server. We will continue to use NodesList API for SQL only servers since we want to filter out storage specific information for SQL only servers. Release note: None
4d66a1b
to
fd6c580
Compare
Retrying merge. bors r+ |
Build succeeded: |
This PR updates debug.zip functionality to
collect goroutine stacks and profiles for all
active SQL instances for a tenant.
This PR also addresses a bug where the nodes.json
and status.json data was not getting populated
correctly due to the switch to the
NodesList
API.This bug has been addressed by using the p;d
Nodes
API when the debug zip command is run againsta storage server.
Release note: None