-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cdcevent: only fetch relevant columns in the rowfetcher #89329
Conversation
dd08b3d
to
4f77618
Compare
Starting with this small PR to work towards #80976. Next steps (https://github.com/jayshrivastava/cockroach/commits/rowfetcher-2):
|
4f77618
to
fb45570
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! A couple of minor line comments. Also, please add a few tests in changefeed_test.go with tables with multiple column families and undergoing schema changes in ways that could change column ordering.
Reviewed 2 of 3 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)
pkg/ccl/changefeedccl/cdcevent/event.go
line 284 at r1 (raw file):
addColumn(col, last) sd.valueCols = append(sd.valueCols, last) last++
Maybe instead of giving these real-looking ordinals, they should have an ordinal of 2^32-1 which you can check for explicitly instead of looking for out-of-bounds indices.
pkg/ccl/changefeedccl/cdcevent/event.go
line 663 at r1 (raw file):
columns := tableDesc.GetPrimaryIndex().CollectKeyColumnIDs() for _, colID := range familyDesc.ColumnIDs { col, err := tableDesc.FindColumnWithID(colID)
Nit: FindColumnWithID
is linear in the number of columns, so this method is O(n^2). Probably better to build another column set from tableDesc.PublicColumnIDs, intersect it with familyDesc.ColumnIDs, and then union the result with the key column ids.
fb45570
to
9d4d5c5
Compare
Previously, a rowfetcher would be configured to fetch all public columns in a table, including virtual columns. Then, if we only wanted to read columns belonging to a particular family, we would index into this row. This change updates row fetchers to only fetch relevant data, including only the primary key columns and columns in a specified family. This change also updates the row fetcher such that it does not fetch virtual columns. Instead, they are added on by the event descriptor. If the descriptor specifies a column outside of the physical row, the row iterator will fill in a null value upon iteration, representing a virtual column. Release note (enterprise change): When a changefeed is run with the option virtual_columns = "null", the virtual column will be ordered last in each row.
9d4d5c5
to
8029ba7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 edit: sorry I accidentally messed up my code quotes in reviewable.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB)
pkg/ccl/changefeedccl/cdcevent/event.go
line 663 at r1 (raw file):
Previously, HonoreDB (Aaron Zinger) wrote…
Nit:
FindColumnWithID
is linear in the number of columns, so this method is O(n^2). Probably better to build another column set from tableDesc.PublicColumnIDs, intersect it with familyDesc.ColumnIDs, and then union the result with the key column ids.
I considered doing these set operations, but I realized it sometimes returns the wrong result.
Firstly, during a schema change, it's possible for familyDesc
to have a column which is not public. Since the familyDesc does not have information about public columns, you have to check the tableDesc.
Secondly, when you change the type of a column (ie. change its position), calling tableDesc.PublicColumnIDs
still keeps the column in its original position (although with a different ID, sorting by PGAttributeNum internally). If we use tableColSet.Ordered
, it will sort by columnID.
To preserve correctness during schema changes and also the behavior of what happens in master (which uses tableDesc.PublicColumnIDs
), I think what I have now is the best way to go forward.
pkg/ccl/changefeedccl/cdcevent/event.go
line 133 at r2 (raw file):
// forEachColumn is a helper which invokes fn for reach column in the ordColumn list. func (r Row) forEachDatum(fn DatumFn, colIndexes []int) error {
forEachDatum
and NewEventDescriptor
now use virtualColOrd
to put virtual cols in the correct spot instead of sticking them on the end of the row. This is closer to the behavior of master.
Code quote:
forEachDatum
pkg/ccl/changefeedccl/cdcevent/event_test.go
line 393 at r2 (raw file):
"INSERT INTO foo (i,j,a,b) VALUES (1,2,'a1','b1')", }, expectMainFamily: []decodeExpectation{
This behavior aligns with master. My previous code did not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained
TYFR! |
bors r+ |
Build failed (retrying...): |
Build succeeded: |
Previously, a rowfetcher would be configured to fetch all public columns in a
table, including virtual columns. Then, if we only wanted to read columns
belonging to a particular family, we would index into this row. This change
updates row fetchers to only fetch relevant data, including only the primary
key columns and columns in a specified family.
This change also updates the row fetcher such that it does not fetch
virtual columns. Instead, they are added on by the event descriptor.
If the descriptor specifies a column outside of the physical row,
the row iterator will fill in a null value upon iteration, representing
a virtual column.
Release note (enterprise change): When a changefeed is run with the option
virtual_columns = "null", the virtual column will be ordered last in each
row.
Informs: #80976
Epic: none