Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdcevent: only fetch relevant columns in the rowfetcher #89329

Merged
merged 1 commit into from
Oct 13, 2022

Conversation

jayshrivastava
Copy link
Contributor

@jayshrivastava jayshrivastava commented Oct 4, 2022

Previously, a rowfetcher would be configured to fetch all public columns in a
table, including virtual columns. Then, if we only wanted to read columns
belonging to a particular family, we would index into this row. This change
updates row fetchers to only fetch relevant data, including only the primary
key columns and columns in a specified family.

This change also updates the row fetcher such that it does not fetch
virtual columns. Instead, they are added on by the event descriptor.
If the descriptor specifies a column outside of the physical row,
the row iterator will fill in a null value upon iteration, representing
a virtual column.

Release note (enterprise change): When a changefeed is run with the option
virtual_columns = "null", the virtual column will be ordered last in each
row.

Informs: #80976
Epic: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jayshrivastava
Copy link
Contributor Author

jayshrivastava commented Oct 4, 2022

Starting with this small PR to work towards #80976.

Next steps (https://github.com/jayshrivastava/cockroach/commits/rowfetcher-2):

  • ensure they key-only envelope means the rowfetcher only fetches primary keys
  • don't invalidate cache for one family when a user-defined type in another family changes
  • remove rf.IgnoreUnexpectedNulls done

@jayshrivastava jayshrivastava marked this pull request as ready for review October 4, 2022 21:06
@jayshrivastava jayshrivastava requested a review from a team as a code owner October 4, 2022 21:06
@jayshrivastava jayshrivastava changed the title cdcevent: only fetch relevant columns in the rowfetcher cdcevent: filter out non-target column families and virtual columns in the rowfetcher Oct 4, 2022
@jayshrivastava jayshrivastava changed the title cdcevent: filter out non-target column families and virtual columns in the rowfetcher cdcevent: only fetch relevant columns in the rowfetcher Oct 10, 2022
Copy link
Contributor

@HonoreDB HonoreDB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! A couple of minor line comments. Also, please add a few tests in changefeed_test.go with tables with multiple column families and undergoing schema changes in ways that could change column ordering.

Reviewed 2 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)


pkg/ccl/changefeedccl/cdcevent/event.go line 284 at r1 (raw file):

		addColumn(col, last)
		sd.valueCols = append(sd.valueCols, last)
		last++

Maybe instead of giving these real-looking ordinals, they should have an ordinal of 2^32-1 which you can check for explicitly instead of looking for out-of-bounds indices.


pkg/ccl/changefeedccl/cdcevent/event.go line 663 at r1 (raw file):

	columns := tableDesc.GetPrimaryIndex().CollectKeyColumnIDs()
	for _, colID := range familyDesc.ColumnIDs {
		col, err := tableDesc.FindColumnWithID(colID)

Nit: FindColumnWithID is linear in the number of columns, so this method is O(n^2). Probably better to build another column set from tableDesc.PublicColumnIDs, intersect it with familyDesc.ColumnIDs, and then union the result with the key column ids.

Previously, a rowfetcher would be configured to fetch all public columns in a
table, including virtual columns. Then, if we only wanted to read columns
belonging to a particular family, we would index into this row. This change
updates row fetchers to only fetch relevant data, including only the primary
key columns and columns in a specified family.

This change also updates the row fetcher such that it does not fetch
virtual columns. Instead, they are added on by the event descriptor.
If the descriptor specifies a column outside of the physical row,
the row iterator will fill in a null value upon iteration, representing
a virtual column.

Release note (enterprise change): When a changefeed is run with the option
virtual_columns = "null", the virtual column will be ordered last in each
row.
Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 edit: sorry I accidentally messed up my code quotes in reviewable.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB)


pkg/ccl/changefeedccl/cdcevent/event.go line 663 at r1 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

Nit: FindColumnWithID is linear in the number of columns, so this method is O(n^2). Probably better to build another column set from tableDesc.PublicColumnIDs, intersect it with familyDesc.ColumnIDs, and then union the result with the key column ids.

I considered doing these set operations, but I realized it sometimes returns the wrong result.

Firstly, during a schema change, it's possible for familyDesc to have a column which is not public. Since the familyDesc does not have information about public columns, you have to check the tableDesc.
Secondly, when you change the type of a column (ie. change its position), calling tableDesc.PublicColumnIDs still keeps the column in its original position (although with a different ID, sorting by PGAttributeNum internally). If we use tableColSet.Ordered, it will sort by columnID.

To preserve correctness during schema changes and also the behavior of what happens in master (which uses tableDesc.PublicColumnIDs), I think what I have now is the best way to go forward.


pkg/ccl/changefeedccl/cdcevent/event.go line 133 at r2 (raw file):

// forEachColumn is a helper which invokes fn for reach column in the ordColumn list.
func (r Row) forEachDatum(fn DatumFn, colIndexes []int) error {

forEachDatum and NewEventDescriptor now use virtualColOrd to put virtual cols in the correct spot instead of sticking them on the end of the row. This is closer to the behavior of master.

Code quote:

forEachDatum

pkg/ccl/changefeedccl/cdcevent/event_test.go line 393 at r2 (raw file):

				"INSERT INTO foo (i,j,a,b) VALUES (1,2,'a1','b1')",
			},
			expectMainFamily: []decodeExpectation{

This behavior aligns with master. My previous code did not.

Copy link
Contributor

@HonoreDB HonoreDB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@jayshrivastava
Copy link
Contributor Author

TYFR!

@jayshrivastava
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 13, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Oct 13, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants