-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importer/kv: add ImportEpoch field to MVCCValueHeader and write to it during IMPORT INTO #85138
importer/kv: add ImportEpoch field to MVCCValueHeader and write to it during IMPORT INTO #85138
Conversation
5d48309
to
dd1843b
Compare
This is ready for a look!
|
Looks good at a high level. Wonder if any of the @cockroachdb/storage peeps or @nvanbenschoten have opinions on Also, we need to coordinate this with @gh-casper's work on tenant streaming. I believe the tenant streaming will also strip out the
Might make sense with a couple of commits: one to plumb through
Only for non-empty headers though, right? So in the common case this won't add on anything at all? |
dd1843b
to
f941a9e
Compare
@erikgrinaker Responding to a few points you made:
Done.
Done.
This prints out:
Did I miss something in how I defined the proto message? Does it have something to do with this ? |
Ah, I was thinking more about this bit here, which just omits the entire header if it doesn't contain anything: So that seems fine.
I think you're missing the encoding of the embedded messages. Both Let's have a look at the encoding here. The MVCCValueHeader{
LocalTimestamp: hlc.ClockTimestamp{WallTime: 9},
} For reference, the Protobuf schemas are: message MVCCValueHeader {
util.hlc.Timestamp local_timestamp = 1;
ClientMeta client_meta = 2;
}
message ClientMeta {
int64 import_job_id = 1;
}
message Timestamp {
int64 wall_time = 1;
int32 logical = 2;
bool synthetic = 3;
} The encoded value before we added the
Ok, so far so good. We see that it omitted fields 2 and 3 of Now let's add an empty
Ok, so the first 4 bytes are the same as before. The 2 added bytes are:
Ok, so that makes sense -- it's saying that there's an embedded But in any case, this isn't terrible, because it's still omitting the actual message contents. So even if we're paying a fixed cost of 2 bytes for the message, at least we can keep adding on fields to FWIW, we can also see this with the |
wow, thanks for really digging into this! I learned a bunch from this write up. My takeaways are:
|
f941a9e
to
38ee242
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to specify the overhead here, specifically the typical number of bytes added per key, as well as the per-value encoding CPU/latency overhead.
38ee242
to
a391869
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I addressed everything your latest round of feedback except for reporting benchmark results. Should I report those in the commit message or in the docstring? As expected, the latency percent increase for Encode/DecodeMVCC value varies depending on the amount of data already in the MVCC value. When I benchmarked EncodedMVCCValue for example, a header with a JobId was 22% slower than an empty header with a short
encoded MVCCValue.value, and 4% slower with a long
encoded MVCCValue.value.
a391869
to
15147b2
Compare
When making changes like these, it's usually nice to explain the cost in the commit message. What the reader/reviewer really wants to know is how expensive this is going to be. How much additional data are we storing? Are imports now 1% slower? 10% slower? Not really slower at all? Numbers like these help paint a picture of the cost. 22% slower encoding might matter, but how much does the encoding cost make up of the per-key processing cost? The absolute value per encoding might be helpful to understand the which order of magnitude we're talking (i.e. nanoseconds vs milliseconds), although I'm guessing we're talking O(100 ns) per key here. |
15147b2
to
ac85d19
Compare
Given yesterday's meeting, here's the updated design:
|
eh, I say merge this one as-is now and add follow-ups to follow-ups. I easier to review that way and we’ll start getting some nightly runs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! could you run this through the mixed version roachtest before merging?
"jobID": {ImportEpoch: 3}, | ||
"local walltime": {LocalTimestamp: hlc.ClockTimestamp{WallTime: 1643550788737652545}}, | ||
"local walltime+logical": {LocalTimestamp: hlc.ClockTimestamp{WallTime: 1643550788737652545, Logical: 4096}}, | ||
"omit in rangefeeds": {OmitInRangefeeds: true}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add the bench results to the commit msg?
f7e396a
to
b8ae5f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I can't approve because of github shenanigans.
Sorry, still making my way through the queue, will get to this in the morning. |
b8ae5f0
to
1ec8a18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the minor comments below are addressed. I've only superficially reviewed this, particularly the DR parts.
Reviewed 4 of 19 files at r6, 7 of 8 files at r7, 2 of 11 files at r8, 15 of 15 files at r17, 10 of 10 files at r18, 3 of 3 files at r19, 7 of 7 files at r20, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @msbutler)
pkg/ccl/backupccl/backup_processor.go
line 485 at r20 (raw file):
for _, span := range spans { for len(span.span.Key) != 0 { includeMVCCValueHeader := clusterSettings.Version.IsActive(ctx, clusterversion.V24_1)
Is it ok to flip this in the middle of a backup? Can we get into weird states where part of the backup has value headers but the other part doesn't? Can we race with an import starting to write import epochs, but the backup processor not seeing them for part of the exports?
Actually, why do we even need to version gate this? A 23.2 node will ignore the field, but also won't ever export any non-empty value headers. A 24.1 node may have non-empty value headers after the 24.1 gate has switched and an import has started writing them, but at that point all binaries are prepared to handle them anyway. So I think we can just set this unconditionally, and avoid any races altogether?
pkg/ccl/backupccl/restore_data_processor.go
line 591 at r19 (raw file):
// InitChecksum don't copy/reallocate the slice they // were given. if err := batcher.AddMVCCKey(ctx, key, valueScratch); err != nil {
It's not obvious at all here that the checksum rewriting above modifies valueScratch
, it's worth spelling this out.
pkg/kv/kvserver/kvserverbase/bulk_adder.go
line 77 at r18 (raw file):
// Adder's SSTBatcher will write the import epoch to each versioned value's // metadata. ImportEpoch uint32
Is it worth calling out here that this must only be used with 24.1 clusters, since it's the caller's responsibility to enforce the version gate?
pkg/storage/mvcc_value.go
line 144 at r17 (raw file):
// Consider a fast path, where only the roachpb.Value gets exported. // Currently, this only occurs if the value was not imported. if mvccValue.ImportEpoch == 0 {
This seems a bit brittle, in that we could accidentally omit fields that are added later. Should we invert the fast path such that we first strip fields (i.e. the LocalTimestamp
) and then compare the stripped value header with MVCCValueHeader{}
to omit the encode?
func EncodeMVCCValueForExport(mvccValue MVCCValue) ([]byte, error) { | ||
// Consider a fast path, where only the roachpb.Value gets exported. | ||
// Currently, this only occurs if the value was not imported. | ||
if mvccValue.ImportEpoch == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is weird to me that we get the whole header or none of it just depending on this one field?
If we're going to condition on this field, it sorta feels like this should be the only field you get, or, more generally, you should only get fields that are in the condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I think when this was written ImportEpoch and Timestamp did represent the entire value header. It looks like EncodeMVCCValue
already has a special case for the empty case. So I think we can just have this function zero out fields not appropriate for export and then pass it directly to EncodeMVCCValue.
Edit: Although, perhaps for now the blast radius is smaller if we really do just give you back something with the ImportEpoch so we don't find out what happens if we start importing keys with OmitInRangefeeds set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now only export fields in the condition. I've left the condition in place because the benchmark shows it really does help in the most common case.
pkg/sql/importer/import_processor.go
Outdated
@@ -389,6 +390,14 @@ func ingestKvs( | |||
// will hog memory as it tries to grow more aggressively. | |||
minBufferSize, maxBufferSize := importBufferConfigSizes(flowCtx.Cfg.Settings, | |||
true /* isPKAdder */) | |||
|
|||
var bulkAdderImportEpoch uint32 | |||
if flowCtx.Cfg.Settings.Version.IsActive(ctx, clusterversion.V24_1) && len(spec.Tables) == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The len()==1 makes me nervous.
What about in the loop body:
if bulkAdderImportEpoch == 0 { bulkAdderImportEpoch = v.Desc.ImportEpoch } else if bulkAdderImportEpoch != v.Desc.InportEpoc { return errors.Assertion....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it. Done.
I also moved the version check. Now, we only bump the import epoch on 24.1 and then can here just rely on the default value being zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @erikgrinaker, @michae2, and @msbutler)
pkg/ccl/backupccl/backup_processor.go
line 485 at r20 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Is it ok to flip this in the middle of a backup? Can we get into weird states where part of the backup has value headers but the other part doesn't? Can we race with an import starting to write import epochs, but the backup processor not seeing them for part of the exports?
Actually, why do we even need to version gate this? A 23.2 node will ignore the field, but also won't ever export any non-empty value headers. A 24.1 node may have non-empty value headers after the 24.1 gate has switched and an import has started writing them, but at that point all binaries are prepared to handle them anyway. So I think we can just set this unconditionally, and avoid any races altogether?
I think we do need the version check here but you are right that we don't want it changing through the life of the backport. I'll add some state to our spec.
We need the version because if our version is 23.2 we expect to be able to restore the artifact on 23.2 nodes, but 23.2 nodes won't be able to read our SSTs.
1ec8a18
to
ae9a188
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @erikgrinaker, @michae2, and @msbutler)
pkg/ccl/backupccl/restore_data_processor.go
line 591 at r19 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
It's not obvious at all here that the checksum rewriting above modifies
valueScratch
, it's worth spelling this out.
Added a comment.
pkg/storage/mvcc_value.go
line 144 at r17 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This seems a bit brittle, in that we could accidentally omit fields that are added later. Should we invert the fast path such that we first strip fields (i.e. the
LocalTimestamp
) and then compare the stripped value header withMVCCValueHeader{}
to omit the encode?
I keep going back and forth on this. I've changed the code so that you only get this single field exported. But I do see the potential value in exporting everything.
ae9a188
to
0fa4f2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 9 files at r21, 19 of 19 files at r23, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @michae2, @miraradeva, and @msbutler)
pkg/ccl/backupccl/backup_processor.go
line 485 at r20 (raw file):
Previously, stevendanna (Steven Danna) wrote…
I think we do need the version check here but you are right that we don't want it changing through the life of the backport. I'll add some state to our spec.
We need the version because if our version is 23.2 we expect to be able to restore the artifact on 23.2 nodes, but 23.2 nodes won't be able to read our SSTs.
Right, makes sense.
pkg/storage/mvcc_value.go
line 144 at r17 (raw file):
Previously, stevendanna (Steven Danna) wrote…
I keep going back and forth on this. I've changed the code so that you only get this single field exported. But I do see the potential value in exporting everything.
Ok. What about omit_in_rangefeeds
, should those not be preserved across a backup/restore? I haven't been involved with that work, so I don't know what semantics we want.
cc @miraradeva
David and I discussed this is chat and for now I think we'll exclude it. We can perhaps revisit in a follow-up later this week. |
bors r=dt,erikgrinaker |
bors cancel |
Canceled. |
This patch adds an ImportEpoch field to an MVCCValue's MVCCValueHeader, which allows KV clients (namely the sst_batcher in an IMPORT INTO) to write the importing table's ImportEpoch to the metadata of each ingesting MVCCValue. Unlike the MVCCValueHeader.LocalTimestamp field, the ImportEpoch field should be exported to other clusters (e.g. via ExportRequests from BACKUP/RESTORE and streaming). Consequently, this PR relaxes the invariant that the MVCCValueHeader field must be stripped in an ExportRequest and must be empty in an AddSSTable Request. Now, ExportRequest emits MVCCValueHeaders with ImportEpoch set if it was set in the original value and AddSSTable only requires the LocalTimestamp to be empty. Epic: none Release note: none Co-authored-by: Steven Danna <[email protected]>
This patch makes IMPORT INTO on a non-empty table write the table's ImportEpoch to each ingested MVCC Value, via the SSTBatcher. In a future PR, the ImportEpoch will be used rollback an IMPORT INTO in some cases. This additional information will allow IMPORTing tables to be backed up and restored. As part of this change we now also assume we might see an MVCCValue during restore. * Version Gating Previously, callers could (and did) assume that the values present in the SSTs returned by export request could be interpreted directly as roachpb.Value objects using code like: roachpb.Value{RawBytes: valBytes} For MVCCValueHeaders to be exported by ExportRequest all callers need to be updated: 1. ExportRequest on system.descriptors in sql/catalog/lease 2. ExportRequest on system.descriptors in ccl/changefeedccl/schemafeed 3. ExportRequest used by `FINGERPRINT` 4. ExportRequest used by old binaries in a mixed-version cluster. (1) and (2) will be easy to update and likely don't matter in practice moment as those tables do not include values with exportable value headers at the moment. (3) will be easy to update, but we still need an option to exclude value headers (a) until value headers are included in rangefeeds and (b) so long as we want to compare fingerprints with 23.2 versions. (4) is impossible to update so if we want BACKUP/RESTORE to round-trip in mixed version cluster we must version gate including them in backups until the cluster is on a single version. To account for this we only increment ImportEpoch during IMPORTs that start on 24.1 or greater and we only request MVCCValueHeaders on BACKUPs that start on 24.1 or greater. The first condition is important to ensure that we can later detect a table that can be fully rolled back using the new rollback method. Note that this also marks a hard backward incompatibility for backup artifacts. Backups for 24.1 cannot be restored on 23.2 or older. This was already the case by policy. 23.2 backups should still work fine on 24.1 since all roachpb.Value's should properly decode as MVCCValue's. Informs cockroachdb#76722 Release note: None Co-authored-by: Steven Danna <[email protected]>
0fa4f2e
to
2d4b601
Compare
bors r=dt,erikgrinaker |
Build succeeded: |
storage: add ImportEpoch field to MVCCValueHeader
This patch adds the ImportEpoch field to an MVCCValue's MVCCValueHeader,
which allows kv clients (namely the sst_batcher in an IMPORT INTO) to write
the importing table's ImportEpoch to the metadata of each ingesting MVCCValue.
Unlike the MVCCValueHeader.LocalTimestamp field, the ImportEpoch field should
be exported to other clusters (e.g. via ExportRequests from BACKUP/RESTORE and
streaming). Consequently, this PR relaxes the invariant that the
MVCCValueHeader field must be stripped in an Export Request and must be empty
in an AddSSTable Request. Now, Export Request only strips the
MVCCValueHeader.LocalTimestamp field and AddSSTable will only require the
LocalTimestamp to be empty.
Release note: none
bulk/kv write the table's ImportEpoch to each MVCCValue during IMPORT
This patch makes IMPORT INTO on a non-empty table write the table's ImportEpoch
to each ingested MVCC Value, via the SSTBatcher. In a future PR, the
ImportEpoch will be used to track and rollback an IMPORT INTO. This additional
information will allow IMPORTing tables to be backed up and restored, as
described in this RFC.
Informs #76722
Release note: None