Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: Enable profiles and other debug info for tenants #76539

Merged
merged 2 commits into from
Feb 25, 2022

Conversation

rimadeodhar
Copy link
Collaborator

@rimadeodhar rimadeodhar commented Feb 14, 2022

This PR updates debug.zip functionality to
collect goroutine stacks and profiles for all
active SQL instances for a tenant.
This PR also addresses a bug where the nodes.json
and status.json data was not getting populated
correctly due to the switch to the NodesList API.
This bug has been addressed by using the p;d
Nodes API when the debug zip command is run against
a storage server.

Release note: None

@rimadeodhar rimadeodhar requested review from knz and a team February 14, 2022 21:37
@rimadeodhar rimadeodhar requested a review from a team as a code owner February 14, 2022 21:37
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! A few nits/questions.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @rimadeodhar)


-- commits, line 7 at r1:
Any chance these can be split into 2 separate commits? I think it would make it easier to reason about which diffs are relevant.

Code quote:

also addresses a bug

-- commits, line 10 at r1:
nit: typo

Code quote:

p;d

pkg/cli/zip_cluster_wide.go, line 201 at r1 (raw file):

	nodesResponse, err := zc.status.Nodes(ctx, &serverpb.NodesRequest{})
	nodesList := &serverpb.NodesListResponse{}
	if code := status.Code(errors.Cause(err)); code == codes.Unimplemented {

nit: Should this also include some kind of err != nil check?

Code quote:

if code := status.Code(errors.Cause(err)); code == codes.Unimplemented {

pkg/server/status_util.go, line 31 at r1 (raw file):

// This file contains helper functions used by tenantStatusServer and statusServer.
func profileLocal(
	ctx context.Context, req *serverpb.ProfileRequest, st *cluster.Settings,

nit: For this fn and the 2 below, would it make sense to try and decouple the params list from the request objects? It might help keep them flexible to be used elsewhere in the future - or do we want them to be coupled? 🤔

Code quote:

req *serverpb.ProfileRequest

pkg/server/status_util.go, line 97 at r1 (raw file):

func getLocalFiles(
	req *serverpb.GetFilesRequest,
	heapProfileDirName string,

Yay! This means we should get active query dumps in tenant debug.zip's for each SQL server instance. 🎉

Code quote:

	heapProfileDirName string,

pkg/server/tenant_status.go, line 810 at r1 (raw file):

	}

	instanceID, local, err := t.parseInstanceID(request.NodeId)

nit: consider updating the docstring comments on the request proto message to note that this can also be an instance ID

Code quote:

request.NodeId

pkg/server/tenant_status.go, line 838 at r1 (raw file):

	}

	instanceID, local, err := t.parseInstanceID(request.NodeId)

nit: same as above

Code quote:

request.NodeId

pkg/server/tenant_status.go, line 1100 at r1 (raw file):

	}

	instanceID, local, err := t.parseInstanceID(req.NodeId)

nit: ditto above

Code quote:

req.NodeId

@rimadeodhar rimadeodhar requested review from a team as code owners February 16, 2022 17:28
@rimadeodhar rimadeodhar requested review from srosenberg and abarganier and removed request for a team February 16, 2022 17:28
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: on the overall structure. I'll let your team judge on local correctness.

Reviewed 15 of 15 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @rimadeodhar, and @srosenberg)


pkg/server/status_util.go, line 1 at r2 (raw file):

// Copyright 2021 The Cockroach Authors.

file name: status_local_file_retrieval.go

(The name _util.go suggests this should become a dumping ground. Let's avoid creating that impression.)


pkg/server/status_util.go, line 1 at r2 (raw file):

// Copyright 2021 The Cockroach Authors.

nit: 2022, not 2021


pkg/server/status_util.go, line 30 at r2 (raw file):

)

// This file contains helper functions used by tenantStatusServer and statusServer.

Here and below: explain whether each function returns a gRPC error or not.


pkg/server/status_util.go, line 47 at r2 (raw file):

			}
			if err := pprof.StartCPUProfile(&buf); err != nil {
				return err

what about this err return


pkg/server/status_util.go, line 57 at r2 (raw file):

			}
		}); err != nil {
			return nil, err

what about this err return

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 15 of 15 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @rimadeodhar, and @srosenberg)


-- commits, line 11 at r3:
nit: storage nodes instead of kv?


-- commits, line 19 at r3:
From reading this commit message I don't gain any understanding about what the problem was. Can you provide some more info here on what the issue is and why the NodesList API is inadequate while the Nodes API is.


pkg/cli/zip.go, line 99 at r3 (raw file):

	ctx context.Context,
	ni nodesInfo,
	fn func(ctx context.Context, nodeDetails serverpb.NodeDetails, nodeStatus *statuspb.NodeStatus) error,

Consider adding a comment here that nodeStatus should be expected to be nil for valid SQL-only nodes.


pkg/cli/zip.go, line 105 at r3 (raw file):

		return errors.AssertionFailedf("nodes list is empty")
	}
	if ni.nodeStatusResponse != nil && len(ni.nodeStatusResponse.Nodes) != len(ni.nodeListResponse.Nodes) {

Are these guaranteed to match? Just wondering if, for instance, one lists recently decommissioned nodes and another does not. It's possible the count may not match in exactly those situations where we take a debug.zip so just want to check.


pkg/cli/zip_cluster_wide.go, line 118 at r3 (raw file):

// SQL only servers will only return nodeDetails for all SQL nodes.
// Storage servers will return both nodeDetails as well as nodeStatus
// for all KV nodes.

Nothing below is named nodeDetails. Can you clarify?


pkg/cli/zip_cluster_wide.go, line 121 at r3 (raw file):

type nodesInfo struct {
	nodeStatusResponse *serverpb.NodesResponse
	nodeListResponse   *serverpb.NodesListResponse

nit: should naming match type? plural nodesList... and nodesStatus...?


pkg/cli/zip_cluster_wide.go, line 204 at r3 (raw file):

	if code := status.Code(errors.Cause(err)); code == codes.Unimplemented {
		// Likely a SQL only server; try the NodesList endpoint.
		nodesList, err = zc.status.NodesList(ctx, &serverpb.NodesListRequest{})

if we know this is a SQL-only server should we just return from here? I got a bit confused about why the if clause below was overwriting the nodesList but I get that it's for storage/kv nodes only now.


pkg/cli/zip_cluster_wide.go, line 206 at r3 (raw file):

		nodesList, err = zc.status.NodesList(ctx, &serverpb.NodesListRequest{})
	}
	if err != nil {

this err gets overwritten sometimes by the NodesList call above...any way we can still catch the error messages from the Nodes call?


pkg/server/tenant_status.go, line 830 at r2 (raw file):

func (t *tenantStatusServer) Stacks(
	ctx context.Context, request *serverpb.StacksRequest,
) (*serverpb.JSONResponse, error) {

should this one have an Admin permission check>?

Copy link
Collaborator Author

@rimadeodhar rimadeodhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @dhartunian, @knz, and @srosenberg)


-- commits, line 7 at r1:

Previously, abarganier (Alex Barganier) wrote…

Any chance these can be split into 2 separate commits? I think it would make it easier to reason about which diffs are relevant.

Done.


-- commits, line 19 at r3:

Previously, dhartunian (David Hartunian) wrote…

From reading this commit message I don't gain any understanding about what the problem was. Can you provide some more info here on what the issue is and why the NodesList API is inadequate while the Nodes API is.

Done.


pkg/cli/zip.go, line 99 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Consider adding a comment here that nodeStatus should be expected to be nil for valid SQL-only nodes.

Done.


pkg/cli/zip.go, line 105 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Are these guaranteed to match? Just wondering if, for instance, one lists recently decommissioned nodes and another does not. It's possible the count may not match in exactly those situations where we take a debug.zip so just want to check.

Yes, it is guaranteed to match since the nodeListResponse is constructed from the nodeStatusResponse on storage servers.


pkg/cli/zip_cluster_wide.go, line 201 at r1 (raw file):

Previously, abarganier (Alex Barganier) wrote…

nit: Should this also include some kind of err != nil check?

The nil check does happen subsequently. So, we first check to see if the error is an unimplemented method error and if so, we try the new API. If not, then we perform the nil check below. The original error will get overwritten by the error returned by NodesListRequest in case the initial error was an unimplemented method error but that is intentional.


pkg/cli/zip_cluster_wide.go, line 118 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Nothing below is named nodeDetails. Can you clarify?

Done.


pkg/cli/zip_cluster_wide.go, line 204 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

if we know this is a SQL-only server should we just return from here? I got a bit confused about why the if clause below was overwriting the nodesList but I get that it's for storage/kv nodes only now.

There is no way to know at this point whether the server is SQL only or not. So, we allow the method to complete execution.


pkg/cli/zip_cluster_wide.go, line 206 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

this err gets overwritten sometimes by the NodesList call above...any way we can still catch the error messages from the Nodes call?

Thats intentional. The only time the error will get overwritten if the original call to NodesRequest returns an unimplemented error indicating this is a SQL only server.


pkg/server/tenant_status.go, line 830 at r2 (raw file):

Previously, dhartunian (David Hartunian) wrote…

should this one have an Admin permission check>?

Done.


pkg/server/status_util.go, line 31 at r1 (raw file):

Previously, abarganier (Alex Barganier) wrote…

nit: For this fn and the 2 below, would it make sense to try and decouple the params list from the request objects? It might help keep them flexible to be used elsewhere in the future - or do we want them to be coupled? 🤔

I have intentionally couple it with the request types as these methods are meant to be helper methods to both the tenant and storage status servers. I don't want these methods to be used as general purpose util methods.


pkg/server/status_util.go, line 97 at r1 (raw file):

Previously, abarganier (Alex Barganier) wrote…

Yay! This means we should get active query dumps in tenant debug.zip's for each SQL server instance. 🎉

Done.


pkg/server/status_util.go, line 1 at r2 (raw file):

Previously, knz (kena) wrote…

file name: status_local_file_retrieval.go

(The name _util.go suggests this should become a dumping ground. Let's avoid creating that impression.)

Done.


pkg/server/status_util.go, line 30 at r2 (raw file):

Previously, knz (kena) wrote…

Here and below: explain whether each function returns a gRPC error or not.

Done.


pkg/server/status_util.go, line 47 at r2 (raw file):

Previously, knz (kena) wrote…

what about this err return

Done.


pkg/server/status_util.go, line 57 at r2 (raw file):

Previously, knz (kena) wrote…

what about this err return

Done.

Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @dhartunian, @knz, @rimadeodhar, and @srosenberg)


pkg/cli/zip_cluster_wide.go, line 201 at r1 (raw file):

Previously, rimadeodhar (Rima Deodhar) wrote…

The nil check does happen subsequently. So, we first check to see if the error is an unimplemented method error and if so, we try the new API. If not, then we perform the nil check below. The original error will get overwritten by the error returned by NodesListRequest in case the initial error was an unimplemented method error but that is intentional.

I see, thanks for clarifying. SGTM!


pkg/server/status_util.go, line 31 at r1 (raw file):

Previously, rimadeodhar (Rima Deodhar) wrote…

I have intentionally couple it with the request types as these methods are meant to be helper methods to both the tenant and storage status servers. I don't want these methods to be used as general purpose util methods.

👍 Sounds good to me.

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 10 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier, @dhartunian, @knz, @rimadeodhar, and @srosenberg)

@rimadeodhar
Copy link
Collaborator Author

TFTR!

bors r+

This commit updates debug.zip functionality to
collect goroutine stacks and profiles for all
active SQL instances for a tenant.

Release note: None
This commit addresses a bug where the nodes.json
and status.json data for storage servers was not
getting populated correctly. Due to a recent switch
to the NodesList API, the nodes and status json files
were missing information such as store metrics,
store status etc. for storage nodes.
This bug has been addressed by reverting back to using
the Nodes API when the debug zip command is run against
a storage server.
We will continue to use NodesList API for SQL only
servers since we want to filter out storage specific
information for SQL only servers.

Release note: None
@rimadeodhar
Copy link
Collaborator Author

Retrying merge.

bors r+

@craig craig bot merged commit 2826900 into cockroachdb:master Feb 25, 2022
@craig
Copy link
Contributor

craig bot commented Feb 25, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants