Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: wait to start index GC job until schema changer is done #46929

Merged
merged 1 commit into from
Apr 16, 2020

Conversation

thoszhang
Copy link
Contributor

Previously, we were creating and starting the index GC job in a separate
transaction from the one used in PublishMultiple() in
SchemaChanger.done(), meaning that the index GC job could be run
before the original transaction to finalize all schema changes on the
table descriptor had been committed. This led to unpredictable behavior
while testing. It could also potentially cause multiple GC jobs to be
created if the PublishMultiple() closure were retried.

In this PR, the GC job is now created as a StartableJob that is
started only after the table descriptor in the finalized state has been
published.

Release justification: Bug fix for new feature.

Release note (bug fix): Fixed a bug introduced in beta.3 that could
cause multiple index GC jobs to be created for the same schema change in
rare cases.

@thoszhang thoszhang requested review from pbardea and ajwerner April 2, 2020 16:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @lucy-zhang, and @pbardea)


pkg/sql/schema_changer.go, line 981 at r1 (raw file):

		)
	})
	if err != nil {

can you add:

if indexGCJob != nil {
    if rollbackErr := indexGCJob.CleanupOnRollback(ctx); rollbackErr != nil {
        log.Warningf(ctx, "failed to cleanup job: %v", err)
    }
}

pkg/sql/schema_changer.go, line 984 at r1 (raw file):

		return nil, err
	}
	// TODO (lucy): Can we do the same thing with the PK drop index job?

same thing meaning... I get that this may make sense in the context of this PR but in general this TODO isn't adequately detailed.

Copy link
Contributor Author

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I screwed up and treated the index GC job as a single job, when in fact there can be multiple indexes per transaction to be dropped. Now it's a slice of jobs.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @pbardea)


pkg/sql/schema_changer.go, line 981 at r1 (raw file):

Previously, ajwerner wrote…

can you add:

if indexGCJob != nil {
    if rollbackErr := indexGCJob.CleanupOnRollback(ctx); rollbackErr != nil {
        log.Warningf(ctx, "failed to cleanup job: %v", err)
    }
}

Done. Unfortunately we still won't run the cleanup if Publish() has to retry, but I guess this is a best-effort thing anyway.


pkg/sql/schema_changer.go, line 984 at r1 (raw file):

Previously, ajwerner wrote…

same thing meaning... I get that this may make sense in the context of this PR but in general this TODO isn't adequately detailed.

Done. (I moved it to where we actually create that index drop job.)

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @pbardea)


pkg/sql/schema_changer.go, line 981 at r1 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

Done. Unfortunately we still won't run the cleanup if Publish() has to retry, but I guess this is a best-effort thing anyway.

yeah... about that.. I've had that on my TODO list for a bit now.

The non-idempotent part of all of this is the fact that we create a new job id every time. If we created the ID above the call to create then it'd all be good. I'll go refactor this eventually. Until then it's potentially a very slow leak that does make me sad.

Or in this case I think that would look like having a slice of ids which we allocate the first pass through and then reuse on retries. That's a bit nasty but I don't have a better answer.

Copy link
Contributor

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I left a comment, but if you think it would be best to fix that in a separate PR I'm happy with that too.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @lucy-zhang)


pkg/sql/schema_changer.go, line 1415 at r2 (raw file):

			descriptorIDs = append(descriptorIDs, table.ID)
		}
	}

I think we may have a missing case here to properly handle the empty database case. Something like

} else if details.ParentID != sqlbase.InvalidID {
  descriptorIDs = []sqlbase.ID{details.ParentID}
}

so that jobs dropping empty databases aren't cleaned up accidentally.

Also happy to dig into the drop empty database case in a separate PR.

@thoszhang thoszhang closed this Apr 15, 2020
@thoszhang thoszhang deleted the index-gc-job branch April 15, 2020 01:22
@thoszhang thoszhang restored the index-gc-job branch April 15, 2020 02:23
@thoszhang thoszhang reopened this Apr 15, 2020
@thoszhang thoszhang force-pushed the index-gc-job branch 2 times, most recently from 9d9e89b to e27c69b Compare April 15, 2020 02:40
Previously, we were creating and starting the index GC job in a separate
transaction from the one used in `PublishMultiple()` in
`SchemaChanger.done()`, meaning that the index GC job could be run
before the original transaction to finalize all schema changes on the
table descriptor had been committed. This led to unpredictable behavior
while testing. It could also potentially cause multiple GC jobs to be
created if the `PublishMultiple()` closure were retried.

In this PR, the GC job is now created as a `StartableJob` that is
started only after the table descriptor in the finalized state has been
published.

Release note (bug fix): Fixed a bug introduced in 20.1 that could cause
multiple index GC jobs to be created for the same schema change in rare
cases.
Copy link
Contributor Author

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @ajwerner and @pbardea)


pkg/sql/schema_changer.go, line 1415 at r2 (raw file):

Previously, pbardea (Paul Bardea) wrote…

I think we may have a missing case here to properly handle the empty database case. Something like

} else if details.ParentID != sqlbase.InvalidID {
  descriptorIDs = []sqlbase.ID{details.ParentID}
}

so that jobs dropping empty databases aren't cleaned up accidentally.

Also happy to dig into the drop empty database case in a separate PR.

Dropping an empty database doesn't cause a GC job to be created (either in 19.2 or now; in 19.2, there's no job at all).

I think an earlier iteration of this PR might have accidentally changed this (or maybe that was on my local branch, I don't remember). But in any case I've now added a test to ensure there's no GC job for an empty database.

@blathers-crl
Copy link

blathers-crl bot commented Apr 15, 2020

❌ The GitHub CI (Cockroach) build has failed on f8582f1e.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@thoszhang
Copy link
Contributor Author

Hmm. I managed to also repro the TestProtectedTimestampsDuringBackup CI flake on master, on my laptop, but the results were a bit weird: in one attempt it took >20k runs over 6 hours (I left it running overnight) and in another attempt my computer went to sleep in the middle, which might have affected the timeouts or something. I'm going to try again on roachprod-stress and open an issue once I see what happens in a more controlled environment. (This failure is different from #45932.)

@thoszhang
Copy link
Contributor Author

Filed #47522 for the TestProtectedTimestampsDuringBackup flake (and #47532, which I encountered in testing), merging now.

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 15, 2020

Build failed

@thoszhang
Copy link
Contributor Author

The flake was #47546.

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 16, 2020

Build failed

@thoszhang
Copy link
Contributor Author

This TestScrubFKConstraintFKNulls flake is persistent. I'll revisit this later.

@thoszhang
Copy link
Contributor Author

bors r+

@thoszhang
Copy link
Contributor Author

???

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 16, 2020

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants