Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: fix and test a bogus source of replica divergence errors #37668

Merged
merged 5 commits into from
May 21, 2019

Conversation

tbg
Copy link
Member

@tbg tbg commented May 21, 2019

An incompatibility in the consistency checks was introduced between v2.1 and v19.1.
See individual commit messages and #37425 for details.

Release note (bug fix): Fixed a potential source of (faux) replica
inconsistencies that can be reported while running a mixed v19.1 / v2.1
cluster. This error (in that situation only) is benign and can be
resolved by upgrading to the latest v19.1 patch release. Every time this
error occurs a "checkpoint" is created which will occupy a large amount
of disk space and which needs to be removed manually (see /auxiliary/checkpoints).

Release note (bug fix): Fixed a case in which ./cockroach quit would
return success even though the server process was still running in a
severely degraded state.

tbg added 3 commits May 21, 2019 14:36
There are "known unknown" problems regarding shutting down gRPC or the
stopper, which can leave the process dangling while its sockets are
already closed. The UX is poor and it's likely very annoying to find,
fix, and regress against the root cause. Luckily, all we want to achieve
is that the process is dead soon after the client disconnects, and we
can do that.

If I were to rewrite this code, I would probably not even bother with
stopping the stopper or grpc, but just call `os.Exit` straight away.
I'm not doing this right now to minimize fallout since this change
will be backported to release-19.1.

Release note (bug fix): Fixed a case in which `./cockroach quit` would
return success even though the server process was still running in a
severely degraded state.
In cockroachdb#35861, I made changes to the consistency checksum computation that
were not backwards-compatible. When a 19.1 node asks a 2.1 node for a
fast SHA, the 2.1 node would run a full computation and return a
corresponding SHA which wouldn't match with the leaseholder's.

Bump ReplicaChecksumVersion to make sure that we don't attempt to
compare SHAs across these two releases.

Fixes cockroachdb#37425.

Release note (bug fix): Fixed a potential source of (faux) replica
inconsistencies that can be reported while running a mixed v19.1 / v2.1
cluster. This error (in that situation only) is benign and can be
resolved by upgrading to the latest v19.1 patch release. Every time this
error occurs a "checkpoint" is created which will occupy a large amount
of disk space and which needs to be removed manually (see <store
directory>/auxiliary/checkpoints).
The mixed version test was always verifying the first node by accident.

Release note: None
@tbg tbg requested a review from a team May 21, 2019 12:48
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from a team May 21, 2019 12:49
@tbg tbg force-pushed the fix/conscheck-version branch from 572b8c9 to 326a213 Compare May 21, 2019 13:24
@tbg tbg requested a review from nvanbenschoten May 21, 2019 15:23
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, 1 of 1 files at r5.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)


pkg/cmd/roachtest/cluster.go, line 1058 at r4 (raw file):

		// TODO(tbg): the checks can fail for silly reasons like missing gossiped
		// descriptors, etc. -- not worth failing the test for. Ideally this would
		// be rock solid.

Is it still worth logging?


pkg/cmd/roachtest/cluster.go, line 1090 at r4 (raw file):

	}

	var db *gosql.DB

Comment that you're trying to find a live node and that this isn't actually the consistency check.

tbg added 2 commits May 21, 2019 22:02
This regression tests cockroachdb#37425, which exposed an incompatibility between
v19.1 and v2.1.

`./bin/roachtest run --local version/mixed/nodes=3` ran successfully
after these changes.

I took the opportunity to address a TODO in FailOnReplicaDivergence.

Release note: None
@tbg tbg force-pushed the fix/conscheck-version branch from 326a213 to e3ae436 Compare May 21, 2019 20:03
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments addressed, TFTR!

bors r=nvanbenschoten

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)

craig bot pushed a commit that referenced this pull request May 21, 2019
37668: storage: fix and test a bogus source of replica divergence errors r=nvanbenschoten a=tbg

An incompatibility in the consistency checks was introduced between v2.1 and v19.1.
See individual commit messages and #37425 for details.

Release note (bug fix): Fixed a potential source of (faux) replica
inconsistencies that can be reported while running a mixed v19.1 / v2.1
cluster. This error (in that situation only) is benign and can be
resolved by upgrading to the latest v19.1 patch release. Every time this
error occurs a "checkpoint" is created which will occupy a large amount
of disk space and which needs to be removed manually (see <store
directory>/auxiliary/checkpoints).

Release note (bug fix): Fixed a case in which `./cockroach quit` would
return success even though the server process was still running in a
severely degraded state.

37701: workloadcccl: fix two regressions in fixtures make/load r=nvanbenschoten a=danhhz

The SQL database for all the tables in the BACKUPs created by `fixtures
make` used to be "csv" (an artifact of the way we made them), but as
of #37343 it's the name of the generator. This seems better so change
`fixtures load` to match.

The same PR also (accidentally) started adding foreign keys in the
BACKUPs, but since there's one table per BACKUP (another artifact of the
way we used to make fixtures), we can't restore the foreign keys. It'd
be nice to switch to one BACKUP with all tables and get the foreign
keys, but the UX of the postLoad hook becomes tricky and I don't have
time right now to sort it all out. So, revert to the previous behavior
(no fks in fixtures) for now.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: Daniel Harrison <[email protected]>
@craig
Copy link
Contributor

craig bot commented May 21, 2019

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants