-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workload: fix --splits regression introduced in #35349 #37401
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I'd be straight-up lying if I claimed that I really knew what was going on here. I assume you'll be removing the hack not too far down the road, looking forward to more details on this then.
No plans to remove the hack. It would introduce quite a bit of complexity that we don't need until someone tries to use a BYTES column in a primary key of some workload. |
What will happen when someone does use a bytes column? Will it fail in a hard-to-diagnose way? I'm fine punting this work, but it'd be good to know that the failure mode will be somewhat actionable (will they be told they're in unchartered territory, or at least be able to find your message, basically) If you have the time, add a bit of color to the commit message before you merge because there's very little context here -- I know something fails around splits, I assume it's because a string portion of a primary key got printed as |
workload's Table schemas are SQL schemas, but cockroachdb#35349 switched the initial data to be returned as a coldata.Batch, which has a more limited set of types. (Or, in the case of simple workloads that return a []interface{}, it's roundtripped through coldata.Batch by the `Tuples` helper.) Notably, this means a SQL STRING column is represented the same as a BYTES column (ditto UUID, etc). This caused a regression in splits, which received some []byte data for a column tried to hand it to SPLIT as a SQL BYTES datum. This didn't work for the UUID column in tpcc's history table nor the VARCHAR in ycsb's usertable. Happily, a STRING works for both of these. It also seems to work for BYTES columns, so it seems like the ambiguity is fine in this case. When/if someone wants to add a workload that splits a BYTES primary key column containing non-utf8 data, we'll may need to revisit. A more principled fix would be to get the fidelity back by parsing the SQL schema, which in fact we do in `importccl.makeDatumFromColOffset`. However, at the moment, this hack works and avoids the complexity and the undesirable pkg/sql/parser dep. Closes cockroachdb#37383 Closes cockroachdb#37382 Closes cockroachdb#37381 Closes cockroachdb#37380 Closes cockroachdb#37379 Closes cockroachdb#37378 Closes cockroachdb#37377 Closes cockroachdb#37393 Release note: None
Thanks for pushing me on this, I just took another swing at it that I think is better. RFAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, appreciate the second pass!
TFTR! bors r=tbg |
37401: workload: fix --splits regression introduced in #35349 r=tbg a=danhhz workload's Table schemas are SQL schemas, but #35349 switched the initial data to be returned as a coldata.Batch, which has a more limited set of types. (Or, in the case of simple workloads that return a []interface{}, it's roundtripped through coldata.Batch by the `Tuples` helper.) Notably, this means a SQL STRING column is represented the same as a BYTES column (ditto UUID, etc). This caused a regression in splits, which received some []byte data for a column tried to hand it to SPLIT as a SQL BYTES datum. This didn't work for the UUID column in tpcc's history table nor the VARCHAR in ycsb's usertable. Happily, a STRING works for both of these. It also seems to work for BYTES columns, so it seems like the ambiguity is fine in this case. When/if someone wants to add a workload that splits a BYTES primary key column containing non-utf8 data, we'll may need to revisit. A more principled fix would be to get the fidelity back by parsing the SQL schema, which in fact we do in `importccl.makeDatumFromColOffset`. However, at the moment, this hack works and avoids the complexity and the undesirable pkg/sql/parser dep. Closes #37383 Closes #37382 Closes #37381 Closes #37380 Closes #37379 Closes #37378 Closes #37377 Closes #37393 Release note: None Co-authored-by: Daniel Harrison <[email protected]>
Build succeeded |
workload's Table schemas are SQL schemas, but #35349 switched the
initial data to be returned as a coldata.Batch, which has a more limited
set of types. (Or, in the case of simple workloads that return a
[]interface{}, it's roundtripped through coldata.Batch by the
Tuples
helper.) Notably, this means a SQL STRING column is represented the same
as a BYTES column (ditto UUID, etc).
This caused a regression in splits, which received some []byte data for
a column tried to hand it to SPLIT as a SQL BYTES datum. This didn't
work for the UUID column in tpcc's history table nor the VARCHAR in
ycsb's usertable. Happily, a STRING works for both of these. It also
seems to work for BYTES columns, so it seems like the ambiguity is fine
in this case. When/if someone wants to add a workload that splits a
BYTES primary key column containing non-utf8 data, we'll may need to
revisit.
A more principled fix would be to get the fidelity back by parsing the
SQL schema, which in fact we do in
importccl.makeDatumFromColOffset
.However, at the moment, this hack works and avoids the complexity and
the undesirable pkg/sql/parser dep.
Closes #37383
Closes #37382
Closes #37381
Closes #37380
Closes #37379
Closes #37378
Closes #37377
Closes #37393
Release note: None