Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adapter: record result_size in bytes in the activity log #30547

Merged
merged 11 commits into from
Nov 27, 2024

Conversation

bosconi
Copy link
Member

@bosconi bosconi commented Nov 18, 2024

We should record result_size in bytes along with rows_returned in the activity log.

Motivation

  • This PR adds a known-desirable feature.

https://github.com/MaterializeInc/database-issues/issues/8064

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@bosconi bosconi self-assigned this Nov 18, 2024
@ParkMyCar ParkMyCar force-pushed the result-size-activity-log branch from 7973460 to 5a1e3ea Compare November 25, 2024 17:54
@bosconi bosconi marked this pull request as ready for review November 26, 2024 01:09
@bosconi bosconi requested review from a team as code owners November 26, 2024 01:09
@bosconi bosconi requested review from ParkMyCar and jkosh44 November 26, 2024 01:09
@teskje
Copy link
Contributor

teskje commented Nov 26, 2024

Just checking: Are we able to handle changes like this through persist schema migrations nowadays, or some other mechanism? Or are we going to drop all existing statement log data on upgrade? And if it's the former, could we write an upgrade test?

@jkosh44
Copy link
Contributor

jkosh44 commented Nov 26, 2024

Just checking: Are we able to handle changes like this through persist schema migrations nowadays, or some other mechanism? Or are we going to drop all existing statement log data on upgrade?

For now we are still drop all existing data on an upgrade. It's possible that the pieces are in place to use persist schema migrations, but that hasn't been hooked up to the builtin items.

Comment on lines 495 to 499
let result_size = rows
.clone()
.into_row_iter()
.map(|r| u64::cast_from(r.byte_len()))
.sum();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with SortedRowCollectionIter, is cloning and iterating through it expensive? If so, it feels a little bad to do that just to get the result size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for FastPath peeks, so I don't anticipate the number of rows here being that large. But! Thanks for pushing back because this made me realize we already iterate through all of the rows and get the total response size to check against for the max_query_result_size session var.

Just pushed up a new commit that re-uses the calculation we already take for that.

@@ -34,7 +34,7 @@ pub struct RowCollection {
/// Contiguous blob of encoded Rows.
encoded: Bytes,
/// Metadata about an individual Row in the blob.
metadata: Vec<EncodedRowMetadata>,
metadata: Arc<[EncodedRowMetadata]>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here, why'd we change this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it cheaper to clone a RowCollection. We don't really need the mutability of a Vec so an Arc<[_]> works just as well, and cloning becomes just a ref count increment

Copy link
Contributor

@jkosh44 jkosh44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

bosconi and others added 11 commits November 27, 2024 14:23
* make RowIterator clone-able so we can use it to calculate total result size
* Use Arc<[_]> in a few places so it's cheaper to clone
* bin/fmt
In the switch from CatalogItemId to GlobalId I accidentally broke the legacy builtin item
update path, and this is the first time since then we've needed it. For old storage collections
we switched to passing CatalogItemIds and then eventually looking up their GlobalIds from the
Catalog, but these old items never get inserted into the Catalog so the lookup fails. This
commit changes the legacy builtin item migrations to track old collections via GlobalIds.

FWIW the original design for CatalogItemIds called out the GlobalIds somewhat become
CollectionIds, and here we're tracking storage_collections_to_drop, so the switch back to
GlobalId in the codepath makes sense IMO.
@ParkMyCar ParkMyCar force-pushed the result-size-activity-log branch from 7d68b6f to 0c04312 Compare November 27, 2024 19:31
@ParkMyCar ParkMyCar merged commit 4619e12 into MaterializeInc:main Nov 27, 2024
83 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants