-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: new statement details endpoint #76592
Conversation
a7a01e6
to
f2c6834
Compare
f2c6834
to
e56b979
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @maryliag)
-- commits, line 9 at r1:
Nit: we don't really have the concept of execution timestamp
since that information is already lost by the time we collect those stmt stats. Do you mean aggregation timestamp
?
pkg/server/serverpb/status.proto, line 1326 at r1 (raw file):
// query, failed, implicitTxn and database. So we don't need to add them // to the request. uint64 fingerprint_id = 1;
Should we also add in the transaction fingerprint ID here? Since the same statement stats can be part of multiple different transactions
pkg/server/serverpb/status.proto, line 1330 at r1 (raw file):
string full_scan = 3; string distSQL = 4; string vec = 5;
full_scan
, distSQL
and vec
are essentially the details information of a query plan, which is encoded in the plan_hash
. I feel that since we are already loading multiple plans for the statement details page, perhaps we don't really need these three fields?
pkg/server/serverpb/status.proto, line 1333 at r1 (raw file):
// Unix time range for aggregated statements. int64 start = 6 [(gogoproto.nullable) = true]; int64 end = 7 [(gogoproto.nullable) = true];
Most of the protobuf in CRDB define the "time" type using google.protobuf.Timestamp
type with some gogoproto
override. Perhaps we can use that here instead of int64
.
E.g.
google.protobuf.Timestamp created_at = 5 [(gogoproto.stdtime) = true];
pkg/server/serverpb/status.proto, line 1350 at r1 (raw file):
(gogoproto.stdduration) = true]; google.protobuf.Timestamp aggregated_ts = 3 [(gogoproto.nullable) = false, (gogoproto.stdtime) = true]; string explain_plan = 4;
Hmm is this field meant to be used to store the sampled plans? I think historically we have been using the ExplainTreePlanNode
for this field.
pkg/server/serverpb/status.proto, line 1351 at r1 (raw file):
google.protobuf.Timestamp aggregated_ts = 3 [(gogoproto.nullable) = false, (gogoproto.stdtime) = true]; string explain_plan = 4; uint64 plan_hash = 5;
also might worth a comment here noting what it means when this field is null.
pkg/server/serverpb/status.proto, line 1358 at r1 (raw file):
CollectedStatement statement = 1 [(gogoproto.nullable) = false]; repeated CollectedStatementGrouped statements_per_aggregated_ts = 2 [(gogoproto.nullable) = false]; repeated CollectedStatementGrouped statements_per_plan_hash = 3 [(gogoproto.nullable) = false];
Hmm is this field populated for "every distinct plan hash" or "every distinct plan hash and every distinct aggregated_ts" ? Might worth a comment here.
pkg/server/serverpb/status.proto, line 1361 at r1 (raw file):
// If set and non-empty, indicates the prefix to application_name // used for statements/queries issued internally by CockroachDB. string internal_app_name_prefix = 4;
hmm do we want to allow this to be nullable
?
58e0193
to
dbbdc1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)
Previously, Azhng (Archer Zhang) wrote…
Nit: we don't really have the concept of
execution timestamp
since that information is already lost by the time we collect those stmt stats. Do you meanaggregation timestamp
?
Yes, I meant the aggregation. Made the changes
pkg/server/serverpb/status.proto, line 1326 at r1 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Should we also add in the transaction fingerprint ID here? Since the same statement stats can be part of multiple different transactions
On the statement page we don't care about the txn fingerprint, only on the transaction details page is that we split
pkg/server/serverpb/status.proto, line 1330 at r1 (raw file):
Previously, Azhng (Archer Zhang) wrote…
full_scan
,distSQL
andvec
are essentially the details information of a query plan, which is encoded in theplan_hash
. I feel that since we are already loading multiple plans for the statement details page, perhaps we don't really need these three fields?
oh I was not aware those were included on plan_hash, so that makes it easier and I can remove it from here, np!
pkg/server/serverpb/status.proto, line 1333 at r1 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Most of the protobuf in CRDB define the "time" type using
google.protobuf.Timestamp
type with somegogoproto
override. Perhaps we can use that here instead ofint64
.E.g.
google.protobuf.Timestamp created_at = 5 [(gogoproto.stdtime) = true];
I'm following the value used on the combinedstats endpoint, because what I'm doing is getting the same value from there to use here. If we keep each one with one format it would be messy to align. We can reconsider on a future PR, but I'll keep this one as is.
pkg/server/serverpb/status.proto, line 1350 at r1 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Hmm is this field meant to be used to store the sampled plans? I think historically we have been using the
ExplainTreePlanNode
for this field.
That is correct for plans saved on our stats, where you select the plan and it's returned in one line as a json that you can decode to tree, but this one here comes form the crdb_internal.decode_plan_gist()
, which returns a row per line, with them being strings, they're not rendered
pkg/server/serverpb/status.proto, line 1351 at r1 (raw file):
Previously, Azhng (Archer Zhang) wrote…
also might worth a comment here noting what it means when this field is null.
I split into two formats here, to make it more clear
pkg/server/serverpb/status.proto, line 1358 at r1 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Hmm is this field populated for "every distinct plan hash" or "every distinct plan hash and every distinct aggregated_ts" ? Might worth a comment here.
Improved the comments
pkg/server/serverpb/status.proto, line 1361 at r1 (raw file):
Previously, Azhng (Archer Zhang) wrote…
hmm do we want to allow this to be
nullable
?
It can be nullable, which is also the pattern for all other internal_app_name_prefix usages in proto files.
dbbdc1d
to
7721082
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I'll slack a separate question about transaction details 🙂.
I don't see an existing combined_statement_stats_test.go
; do we not usually test endpoints (perhaps those that are simple enough)? Or did we just happen to not test the combined statements endpoint? If it's the latter, I think the new details endpoint should be tested, whether now or soon after the release.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The frontend code has some parts to make testing endpoint easier, which is where we have the tests for the combined statements, so I added tests for the endpoint on my next PR
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @maryliag, and @matthewtodd)
pkg/server/combined_statement_stats.go, line 351 at r2 (raw file):
} func getStatementDetailsQueryClausesAndArgs(
nit: maybe move this function below getStatementDetails
for reading flow?
pkg/server/combined_statement_stats.go, line 357 at r2 (raw file):
buffer.WriteString(testingKnobs.GetAOSTClause()) buffer.WriteString(fmt.Sprintf(" WHERE encode(fingerprint_id, 'hex') = $%d", len(args)+1))
Can you say more about the len(args)+1
here and below? Is this code building a query "template" with embedded $1
placeholders that will be filled in at execution time? If so, it's not clear to me why the same value is used throughout. EDIT: I see now that this function is also modifying args
as it goes along, and that behavior surprised me. I hope there's a plainer way to write the code (I haven't taken a moment to think yet), but if not it's worth a comment highlighting the mechanism.
pkg/server/combined_statement_stats.go, line 372 at r2 (raw file):
appNames := strings.Split(req.AppName, ",") if strings.Contains(req.AppName, "(unset)") { appNames = append(appNames, "")
Does this mean we'll pass along both (unset)
and `` (the empty string)? That feels suspicious to me.
pkg/server/combined_statement_stats.go, line 376 at r2 (raw file):
buffer.WriteString(" AND (") for i, app := range appNames { if i != 0 {
Can this be done with a strings.Join
or something? Except the modifying args in-place prevents that. Hmm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Do you know how long does it take to fetch this endpoint on the cluster with a lot of data?
Reviewed 2 of 5 files at r1, 2 of 3 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng, @maryliag, and @matthewtodd)
pkg/server/combined_statement_stats.go, line 384 at r2 (raw file):
} } buffer.WriteString(" )")
How about using the IN
operator here.
WHERE app_name IN (<val1>, <val2>, <val3>)
pkg/server/combined_statement_stats.go, line 459 at r2 (raw file):
var statement serverpb.StatementDetailsResponse_CollectedStatement it, err := ie.QueryIteratorEx(ctx, "combined-stmts-details-total", nil,
If we are only querying for one row, then we can use QueryRowEx
here instead.
pkg/server/combined_statement_stats.go, line 479 at r2 (raw file):
var row tree.Datums if row = it.Cur(); row == nil { return statement, errors.New("unexpected null row")
probably should be returning serverError
here and below too as well
pkg/server/combined_statement_stats.go, line 525 at r2 (raw file):
}, AppNames: appNames, AggregationInterval: time.Duration(aggInterval.Nanos()),
Simply assigning the aggInterval
directly here
pkg/server/combined_statement_stats.go, line 527 at r2 (raw file):
AggregationInterval: time.Duration(aggInterval.Nanos()), Stats: metadata.Stats, }
nit: override the metadata.Key.Query = queryPrettify
first, then you can simplify the assignment to be
statement = serverpb.StatementDetailsResponse_CollectedStatement{
KeyData: metadata.Key,
AppNames: appNames,
AggregationInterval: time.Duration(aggInterval.Nanos()),
Stats: metadata.Stats,
}
also, the metadata
object here in fact also contains stats, might as well just call it stats
. (I know we have been calling it metadata
in other places of the code base, but that's really a mistake)
pkg/server/combined_statement_stats.go, line 632 at r2 (raw file):
func getExplainPlanFromGist(ctx context.Context, ie *sql.InternalExecutor, planGist string) string { planError := "Error collecting Explain Plan." var args []interface{}
no need to use a slice here, since we know we only have 1 arg.
pkg/server/serverpb/status.proto, line 1326 at r2 (raw file):
// query, failed, implicitTxn and database. So we don't need to add them // to the request. string fingerprint_id = 1;
nit: use the gogoproto.customname
to override the Golang's field name to be FingerprintID
, otherwise protoc will generate field name as FingerprintId
pkg/server/serverpb/status.proto, line 1327 at r2 (raw file):
// to the request. string fingerprint_id = 1; string app_name = 2;
Just realized that this is not a single app_name
, but a list of app_name
s sent from the frontend. My would much prefer to have repeated string app_names = 2;
here.
String manipulation and the special logic to deal with the (unset)
case is exclusively frontend concept. Leaking frontend concept into the backend code is only going to make the code harder to maintain going forward.
pkg/server/serverpb/status.proto, line 1330 at r2 (raw file):
// Unix time range for aggregated statements. int64 start = 6 [(gogoproto.nullable) = true]; int64 end = 7 [(gogoproto.nullable) = true];
nit: the field value should be 3
and 4
respectively here.
pkg/server/serverpb/status.proto, line 1334 at r2 (raw file):
message StatementDetailsResponse { message CollectedStatement {
(optional) nit: just as a personal preference, how about CollectedStatementSummary
?
7721082
to
bacb557
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's similar to the overview statements, if the period selected has small amount of data it's fast, otherwise it might take several minutes
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @matthewtodd)
pkg/server/combined_statement_stats.go, line 351 at r2 (raw file):
Previously, matthewtodd (Matthew Todd) wrote…
nit: maybe move this function below
getStatementDetails
for reading flow?
Done
pkg/server/combined_statement_stats.go, line 357 at r2 (raw file):
Previously, matthewtodd (Matthew Todd) wrote…
Can you say more about the
len(args)+1
here and below? Is this code building a query "template" with embedded$1
placeholders that will be filled in at execution time? If so, it's not clear to me why the same value is used throughout. EDIT: I see now that this function is also modifyingargs
as it goes along, and that behavior surprised me. I hope there's a plainer way to write the code (I haven't taken a moment to think yet), but if not it's worth a comment highlighting the mechanism.
Added a comment on the function description, explaining the format this is creating, also changed the order to add the arg first, so the length can used directly without the +1
pkg/server/combined_statement_stats.go, line 372 at r2 (raw file):
Previously, matthewtodd (Matthew Todd) wrote…
Does this mean we'll pass along both
(unset)
and `` (the empty string)? That feels suspicious to me.
My idea here was that we treat `` as unset
in the ui, so the ui might send `unset` if selected, and in this case we want to search for statements without app name, but in the very unlikely case that the actual name of the app is `unset` I decided to still keep it here.
I decided to change and make that decision on the ui side now, so I'll be replacing by empty string before calling the endpoint
pkg/server/combined_statement_stats.go, line 376 at r2 (raw file):
Previously, matthewtodd (Matthew Todd) wrote…
Can this be done with a
strings.Join
or something? Except the modifying args in-place prevents that. Hmm.
The reason for not using the join is exactly because of value on $1
or $2
and so on, as you mentioned.
pkg/server/combined_statement_stats.go, line 384 at r2 (raw file):
Previously, Azhng (Archer Zhang) wrote…
How about using the
IN
operator here.WHERE app_name IN (<val1>, <val2>, <val3>)
That was actually my original solution, but during the replacement, instead of doing app_name IN ('a', 'b', 'c')
it was creating app_name IN (a, b, c)
(which would not return the correct value), even though it was specified as array of strings, I tried with different approaches of replacements and still didn't work, so I went with this options instead.
pkg/server/combined_statement_stats.go, line 459 at r2 (raw file):
Previously, Azhng (Archer Zhang) wrote…
If we are only querying for one row, then we can use
QueryRowEx
here instead.
Done
pkg/server/combined_statement_stats.go, line 479 at r2 (raw file):
Previously, Azhng (Archer Zhang) wrote…
probably should be returning
serverError
here and below too as well
Done
pkg/server/combined_statement_stats.go, line 525 at r2 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Simply assigning the
aggInterval
directly here
aggInterval
is duration.Duration
, the value here requires Time.Duration, so this conversion is required
pkg/server/combined_statement_stats.go, line 527 at r2 (raw file):
Previously, Azhng (Archer Zhang) wrote…
nit: override the
metadata.Key.Query = queryPrettify
first, then you can simplify the assignment to bestatement = serverpb.StatementDetailsResponse_CollectedStatement{ KeyData: metadata.Key, AppNames: appNames, AggregationInterval: time.Duration(aggInterval.Nanos()), Stats: metadata.Stats, }also, the
metadata
object here in fact also contains stats, might as well just call itstats
. (I know we have been calling itmetadata
in other places of the code base, but that's really a mistake)
Done
pkg/server/combined_statement_stats.go, line 632 at r2 (raw file):
Previously, Azhng (Archer Zhang) wrote…
no need to use a slice here, since we know we only have 1 arg.
The QueryIteratorEx requires to be an array, so I would have to transform the planGist to an array anyway
pkg/server/serverpb/status.proto, line 1326 at r2 (raw file):
Previously, Azhng (Archer Zhang) wrote…
nit: use the
gogoproto.customname
to override the Golang's field name to beFingerprintID
, otherwise protoc will generate field name asFingerprintId
If the value is part of the query params , we do the custom name, but when is part of the path itself (you can see other examples on this same file: DiagnosticsRequest
, StoresRequest
, RangeRequest
) we don't to because the gateway doesn't read customNames (there is also a comment on this in this file on the node_id usage for example)
pkg/server/serverpb/status.proto, line 1327 at r2 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Just realized that this is not a single
app_name
, but a list ofapp_name
s sent from the frontend. My would much prefer to haverepeated string app_names = 2;
here.String manipulation and the special logic to deal with the
(unset)
case is exclusively frontend concept. Leaking frontend concept into the backend code is only going to make the code harder to maintain going forward.
Done
pkg/server/serverpb/status.proto, line 1330 at r2 (raw file):
Previously, Azhng (Archer Zhang) wrote…
nit: the field value should be
3
and4
respectively here.
Done
pkg/server/serverpb/status.proto, line 1334 at r2 (raw file):
Previously, Azhng (Archer Zhang) wrote…
(optional) nit: just as a personal preference, how about
CollectedStatementSummary
?
Done
d845092
to
d5a82fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Maybe quickly confirm with @Azhng on his ExplainTreePlanNode
question?)
Reviewed 2 of 5 files at r1, 3 of 3 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @Azhng and @maryliag)
-- commits, line 6 at r4:
nit: s/Aggregates/Aggregated/
d5a82fe
to
f3f1ad3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minus a few nits. Also do we need the release justification now?
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @maryliag)
pkg/server/combined_statement_stats.go, line 480 at r5 (raw file):
if err = sqlstatsutil.DecodeStmtStatsMetadataJSON(metadataJSON, &statistics); err != nil { return statement, err }
nit: more serverError
wrapping for the remaining of this function?
pkg/server/serverpb/status.proto, line 1350 at r1 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
That is correct for plans saved on our stats, where you select the plan and it's returned in one line as a json that you can decode to tree, but this one here comes form the
crdb_internal.decode_plan_gist()
, which returns a row per line, with them being strings, they're not rendered
Ah I see. Might worth it call them plan_gist
for unifying terminology? I'm also fine with it leaving as is.
pkg/server/serverpb/status.proto, line 1337 at r5 (raw file):
cockroach.sql.StatementStatisticsKey key_data = 1 [(gogoproto.nullable) = false]; // Formatted query is the return of the key_data.query after prettify_statement. // The value from the key_data cannot be replaced by the fomratted value, because is used as is for
nit: s/forratted/formatted/
Also I know we are on a crunch right now, but we should improve the test coverage of these status endpoint during stability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm also, is this endpoint omitted from tenant_status.go
on purpose?
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @maryliag)
f3f1ad3
to
1e9016f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for release justification yet.
The tests are coming on the next PR and what do you mean it's omitted on tenant_status? It's here: https://github.com/cockroachdb/cockroach/pull/76592/files#diff-3e9c68308c7fc990de93cc07b0e2a7c2428f507efd59df9037e11027f83da6fdR551-R567
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @matthewtodd)
pkg/server/combined_statement_stats.go, line 480 at r5 (raw file):
Previously, Azhng (Archer Zhang) wrote…
nit: more
serverError
wrapping for the remaining of this function?
Done (for the rest of the file, not just this function)
pkg/server/serverpb/status.proto, line 1350 at r1 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Ah I see. Might worth it call them
plan_gist
for unifying terminology? I'm also fine with it leaving as is.
It is an explain_plan, not a plan_gist. So the naming is using the right terminology, is just the format that is not JSON or ExplainTreePlanNode but a string instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I added release justification
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @matthewtodd)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, interesting. Somehow it's not being rendered on Reviewable 🤔
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @matthewtodd)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @matthewtodd)
New endpoint for Statement Details. The endpoint returns: - Statement: the total stats for that statement fingerprint - StatementsPerAggregatedTs: returns the statement from the period selected but each aggregated timestamp is an entry. This list can be used on the overview page on the Statement Details for the creation of charts by aggregation timestamp. - StatementsPerPlanHash: returns the statement from the period selected but each plan hash is an entry. This list can be used on the Explain Plan tab of the Statement Details, to show the execution stats per Plan. Partially addresses cockroachdb#72129 Release note (api change): New Statement Details endpoint that returns the details for a selected statement, and its execution separated by aggregation timestamp (statementsPerAggregatedTs) and by plan hash (statementsPerPlanHash).
1e9016f
to
caf2ede
Compare
bors r+ |
Build failed (retrying...): |
Build failed (retrying...): |
Seems like this and #74377 skew with each-other in terms of generated code. |
I can see you cancelled the other PR, thank you for looking into it |
Build succeeded: |
New endpoint for Statement Details. The endpoint returns:
selected but each aggregated timestamp is an entry. This list
can be used on the overview page on the Statement Details for
the creation of charts by execution timestamp.
selected but each plan hash is an entry. This list can be used
on the Explain Plan tab of the Statement Details, to show the
execution stats per Plan.
Partially addresses #72129
Release note (api change): New Statement Details endpoint that returns
the details for a selected statement, and its execution separated by
aggregation timestamp (statementsPerAggregatedTs) and by plan hash
(statementsPerPlanHash).
Release Justification: Low risk, high benefit change