Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importer: add ImportEpoch table descriptor field #85692

Closed

Conversation

msbutler
Copy link
Collaborator

@msbutler msbutler commented Aug 5, 2022

This patch introduces the new ImportEpoch table descriptor field which
provides a count for the number of IMPORT INTO jobs that have been attempted on
the table after it already had data, including any in-progress jobs.

This work is apart of a larger project to roll back imports without MVCC
timestamps. In a future pr (#85138), this epoch will get written to the
imported MVCCValueHeader metadata, which in turn allows import rollbacks to
use the epoch to determine which keys to delete in the table.

We considered binding the import job id to the table descriptor, and in turn
to each imported KV's metadata, however, the expected proto encoded size of a
job ID was between 7 - 12 bytes, which seemed uncessary. Meanwhile the 4 byte
uint32 epoch will encode to 1 byte for the first 256 IMPORT INTO jobs on the
table. 2,147,483,648 IMPORT INTO jobs must occur on the table before we worry
about integer overflow -- a problem for a future release :D.

Informs #76722

Release note: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@msbutler
Copy link
Collaborator Author

msbutler commented Aug 5, 2022

I realize the unit test gets called within the bulk data driven driver. Once I address the failures in #85637, I'll move the unit test to pkg/ccl/importccl.

@msbutler msbutler force-pushed the butler-import-descriptor-epoch branch from a2186c3 to 786692e Compare August 6, 2022 17:30
@msbutler msbutler force-pushed the butler-import-descriptor-epoch branch 2 times, most recently from 7b637fa to 7ca826c Compare August 8, 2022 13:38
@msbutler msbutler marked this pull request as ready for review August 8, 2022 13:38
@msbutler msbutler requested review from a team, stevendanna, dt and erikgrinaker and removed request for stevendanna August 8, 2022 13:38
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good at a high level, but this is sufficiently far away from KV that someone more familiar with it should review this too.

I'm assuming this will piggyback on the version gate in #85138, so there are no further cross-version concerns here?

@msbutler msbutler marked this pull request as draft August 8, 2022 16:13
@msbutler
Copy link
Collaborator Author

msbutler commented Aug 8, 2022

Note: i think I need to rework this PR such that two import epoch's are stored in the table descriptor. Will update later today.

@dt
Copy link
Member

dt commented Aug 8, 2022

NextImportEpoch -- counter that increments every time to made sure epochs are unique
CurrentImportEpoch -- value of the epoch currently being used to import, or zero if not currently importing (i.e. only ever non-zero when table is offline and was taken from online to offline because it existed prior to import)

@msbutler msbutler force-pushed the butler-import-descriptor-epoch branch from 7ca826c to 0813fbc Compare August 8, 2022 18:47
@msbutler msbutler marked this pull request as ready for review August 8, 2022 18:48
@msbutler msbutler requested a review from a team as a code owner August 8, 2022 18:48
@msbutler
Copy link
Collaborator Author

msbutler commented Aug 8, 2022

RFAL

  • added the cluster version to this PR
  • @dt I didn't quite go with your suggestion, but my approach achieves the same purpose.

@msbutler msbutler force-pushed the butler-import-descriptor-epoch branch from 0813fbc to bb6705f Compare August 8, 2022 21:22
…ields

This patch introduces the new ImportEpoch table descriptor field which
provides a count for the number of IMPORT jobs that have been attempted on the
table and the ImportTypeInProgress enum which indicates the kind of Import
occurring.

This work is apart of a larger project to roll back imports without MVCC
timestamps. In a future pr (cockroachdb#85138), this epoch will get written to the
imported MVCCValueHeader metadata, which in turn allows import rollbacks to
use the epoch to determine which keys to delete in the table.

We considered binding the import job id to the table descriptor, and in turn
to each imported KV's metadata, however, the expected proto encoded size of a
job ID was between 7 - 12 bytes, which seemed uncessary. Meanwhile the 4 byte
uint32 epoch will encode to 1 byte for the first 256 IMPORT jobs on the
table. 2,147,483,648 IMPORT jobs must occur on the table before we worry
about integer overflow -- a problem for a future release :D.

Informs cockroachdb#85138

Release note: none
@msbutler msbutler force-pushed the butler-import-descriptor-epoch branch from bb6705f to 98afb06 Compare August 8, 2022 21:27
@msbutler
Copy link
Collaborator Author

msbutler commented Aug 8, 2022

sorry for the noise -- RFAL!

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll still defer to Bulk/Schema here, but this seems fine as far as I can tell.

I'm assuming we already have some sort of mutual exclusion protocol here to prevent two import jobs from clobbering each others import state?

@@ -1180,7 +1180,23 @@ message TableDescriptor {
// this table, in which case the global setting is used.
optional bool forecast_stats = 52 [(gogoproto.nullable) = true, (gogoproto.customname) = "ForecastStats"];

// Next ID: 54
// ImportIntoEpoch is the count of IMPORT INTO jobs that have been attempted on this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ImportIntoEpoch is the count of IMPORT INTO jobs that have been attempted on this
// ImportEpoch is the count of IMPORT INTO jobs that have been attempted on this

// field gets incremented while preparing the table for ingestion.
optional uint32 import_epoch = 54 [(gogoproto.nullable) = false, (gogoproto.customname) = "ImportEpoch"];

// ImportType indicates the kind of import currently occurring on the table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could explain the differences. Why is IMPORT a separate thing?

@msbutler
Copy link
Collaborator Author

msbutler commented Aug 9, 2022

closing in favor of #85852

For 22.2, we decided to continue rolling back imports with timestamps, so the ImportEpoch field is no longer necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants