sql: wait to start index GC job until schema changer is done #46929

thoszhang · 2020-04-02T16:25:39Z

Previously, we were creating and starting the index GC job in a separate
transaction from the one used in PublishMultiple() in
SchemaChanger.done(), meaning that the index GC job could be run
before the original transaction to finalize all schema changes on the
table descriptor had been committed. This led to unpredictable behavior
while testing. It could also potentially cause multiple GC jobs to be
created if the PublishMultiple() closure were retried.

In this PR, the GC job is now created as a StartableJob that is
started only after the table descriptor in the finalized state has been
published.

Release justification: Bug fix for new feature.

Release note (bug fix): Fixed a bug introduced in beta.3 that could
cause multiple index GC jobs to be created for the same schema change in
rare cases.

cockroach-teamcity · 2020-04-02T16:25:47Z

This change is

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @lucy-zhang, and @pbardea)

pkg/sql/schema_changer.go, line 981 at r1 (raw file):

		)
	})
	if err != nil {

can you add:

if indexGCJob != nil {
    if rollbackErr := indexGCJob.CleanupOnRollback(ctx); rollbackErr != nil {
        log.Warningf(ctx, "failed to cleanup job: %v", err)
    }
}

pkg/sql/schema_changer.go, line 984 at r1 (raw file):

		return nil, err
	}
	// TODO (lucy): Can we do the same thing with the PK drop index job?

same thing meaning... I get that this may make sense in the context of this PR but in general this TODO isn't adequately detailed.

thoszhang

I screwed up and treated the index GC job as a single job, when in fact there can be multiple indexes per transaction to be dropped. Now it's a slice of jobs.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @pbardea)

pkg/sql/schema_changer.go, line 981 at r1 (raw file):

Previously, ajwerner wrote…

can you add:

if indexGCJob != nil {
    if rollbackErr := indexGCJob.CleanupOnRollback(ctx); rollbackErr != nil {
        log.Warningf(ctx, "failed to cleanup job: %v", err)
    }
}

Done. Unfortunately we still won't run the cleanup if Publish() has to retry, but I guess this is a best-effort thing anyway.

pkg/sql/schema_changer.go, line 984 at r1 (raw file):

Previously, ajwerner wrote…

same thing meaning... I get that this may make sense in the context of this PR but in general this TODO isn't adequately detailed.

Done. (I moved it to where we actually create that index drop job.)

ajwerner

Reviewed 2 of 2 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @pbardea)

pkg/sql/schema_changer.go, line 981 at r1 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

Done. Unfortunately we still won't run the cleanup if Publish() has to retry, but I guess this is a best-effort thing anyway.

yeah... about that.. I've had that on my TODO list for a bit now.

The non-idempotent part of all of this is the fact that we create a new job id every time. If we created the ID above the call to create then it'd all be good. I'll go refactor this eventually. Until then it's potentially a very slow leak that does make me sad.

Or in this case I think that would look like having a slice of ids which we allocate the first pass through and then reuse on retries. That's a bit nasty but I don't have a better answer.

pbardea

I left a comment, but if you think it would be best to fix that in a separate PR I'm happy with that too.

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @lucy-zhang)

pkg/sql/schema_changer.go, line 1415 at r2 (raw file):

			descriptorIDs = append(descriptorIDs, table.ID)
		}
	}

I think we may have a missing case here to properly handle the empty database case. Something like

} else if details.ParentID != sqlbase.InvalidID {
  descriptorIDs = []sqlbase.ID{details.ParentID}
}

so that jobs dropping empty databases aren't cleaned up accidentally.

Also happy to dig into the drop empty database case in a separate PR.

Previously, we were creating and starting the index GC job in a separate transaction from the one used in `PublishMultiple()` in `SchemaChanger.done()`, meaning that the index GC job could be run before the original transaction to finalize all schema changes on the table descriptor had been committed. This led to unpredictable behavior while testing. It could also potentially cause multiple GC jobs to be created if the `PublishMultiple()` closure were retried. In this PR, the GC job is now created as a `StartableJob` that is started only after the table descriptor in the finalized state has been published. Release note (bug fix): Fixed a bug introduced in 20.1 that could cause multiple index GC jobs to be created for the same schema change in rare cases.

thoszhang

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @ajwerner and @pbardea)

pkg/sql/schema_changer.go, line 1415 at r2 (raw file):

Previously, pbardea (Paul Bardea) wrote…

I think we may have a missing case here to properly handle the empty database case. Something like
} else if details.ParentID != sqlbase.InvalidID {
  descriptorIDs = []sqlbase.ID{details.ParentID}
}
so that jobs dropping empty databases aren't cleaned up accidentally.

Also happy to dig into the drop empty database case in a separate PR.

Dropping an empty database doesn't cause a GC job to be created (either in 19.2 or now; in 19.2, there's no job at all).

I think an earlier iteration of this PR might have accidentally changed this (or maybe that was on my local branch, I don't remember). But in any case I've now added a test to ensure there's no GC job for an empty database.

blathers-crl · 2020-04-15T03:00:54Z

❌ The GitHub CI (Cockroach) build has failed on f8582f1e.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

thoszhang · 2020-04-15T14:23:05Z

Hmm. I managed to also repro the TestProtectedTimestampsDuringBackup CI flake on master, on my laptop, but the results were a bit weird: in one attempt it took >20k runs over 6 hours (I left it running overnight) and in another attempt my computer went to sleep in the middle, which might have affected the timeouts or something. I'm going to try again on roachprod-stress and open an issue once I see what happens in a more controlled environment. (This failure is different from #45932.)

thoszhang · 2020-04-15T23:06:27Z

Filed #47522 for the TestProtectedTimestampsDuringBackup flake (and #47532, which I encountered in testing), merging now.

bors r+

craig · 2020-04-15T23:37:31Z

Build failed

GitHub CI (Cockroach)

thoszhang · 2020-04-15T23:47:27Z

The flake was #47546.

bors r+

craig · 2020-04-16T00:27:51Z

Build failed

GitHub CI (Cockroach)

thoszhang · 2020-04-16T02:25:28Z

This TestScrubFKConstraintFKNulls flake is persistent. I'll revisit this later.

thoszhang · 2020-04-16T18:29:24Z

bors r+

thoszhang · 2020-04-16T18:46:00Z

???

bors r+

craig · 2020-04-16T19:57:45Z

Build succeeded

GitHub CI (Cockroach)

thoszhang requested review from pbardea and ajwerner April 2, 2020 16:25

thoszhang force-pushed the index-gc-job branch from 3427130 to 99a2615 Compare April 2, 2020 16:27

ajwerner reviewed Apr 2, 2020

View reviewed changes

thoszhang force-pushed the index-gc-job branch from 99a2615 to 5bc274b Compare April 2, 2020 18:52

thoszhang commented Apr 2, 2020

View reviewed changes

thoszhang mentioned this pull request Apr 2, 2020

sql: (temporary) Hang when dropping unique index created after ALTER PRIMARY KEY #45150

Closed

ajwerner approved these changes Apr 2, 2020

View reviewed changes

pbardea approved these changes Apr 2, 2020

View reviewed changes

thoszhang closed this Apr 15, 2020

thoszhang deleted the index-gc-job branch April 15, 2020 01:22

thoszhang restored the index-gc-job branch April 15, 2020 02:23

thoszhang reopened this Apr 15, 2020

thoszhang force-pushed the index-gc-job branch 2 times, most recently from 9d9e89b to e27c69b Compare April 15, 2020 02:40

thoszhang force-pushed the index-gc-job branch from e27c69b to f8582f1 Compare April 15, 2020 02:41

thoszhang commented Apr 15, 2020

View reviewed changes

craig bot merged commit 4d7c4ef into cockroachdb:master Apr 16, 2020

thoszhang mentioned this pull request Apr 22, 2020

release-20.1: use StartableJobs for GC and index drop jobs created at the end of the schema changer job #47818

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: wait to start index GC job until schema changer is done #46929

sql: wait to start index GC job until schema changer is done #46929

thoszhang commented Apr 2, 2020

cockroach-teamcity commented Apr 2, 2020

ajwerner left a comment

thoszhang left a comment

ajwerner left a comment

pbardea left a comment

thoszhang left a comment

blathers-crl bot commented Apr 15, 2020

thoszhang commented Apr 15, 2020

thoszhang commented Apr 15, 2020

craig bot commented Apr 15, 2020

thoszhang commented Apr 15, 2020

craig bot commented Apr 16, 2020

thoszhang commented Apr 16, 2020

thoszhang commented Apr 16, 2020

thoszhang commented Apr 16, 2020

craig bot commented Apr 16, 2020

sql: wait to start index GC job until schema changer is done #46929

sql: wait to start index GC job until schema changer is done #46929

Conversation

thoszhang commented Apr 2, 2020

cockroach-teamcity commented Apr 2, 2020

ajwerner left a comment

Choose a reason for hiding this comment

thoszhang left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

pbardea left a comment

Choose a reason for hiding this comment

thoszhang left a comment

Choose a reason for hiding this comment

blathers-crl bot commented Apr 15, 2020

thoszhang commented Apr 15, 2020

thoszhang commented Apr 15, 2020

craig bot commented Apr 15, 2020

Build failed

thoszhang commented Apr 15, 2020

craig bot commented Apr 16, 2020

Build failed

thoszhang commented Apr 16, 2020

thoszhang commented Apr 16, 2020

thoszhang commented Apr 16, 2020

craig bot commented Apr 16, 2020

Build succeeded