Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backport-19.1: storage: fix and test a bogus source of replica divergence errors #37722

Merged
merged 3 commits into from
May 22, 2019

Conversation

tbg
Copy link
Member

@tbg tbg commented May 22, 2019

Backport 3/5 commits from #37668.

/cc @cockroachdb/release


An incompatibility in the consistency checks was introduced between v2.1 and v19.1.
See individual commit messages and #37425 for details.

Release note (bug fix): Fixed a potential source of (faux) replica
inconsistencies that can be reported while running a mixed v19.1 / v2.1
cluster. This error (in that situation only) is benign and can be
resolved by upgrading to the latest v19.1 patch release. Every time this
error occurs a "checkpoint" is created which will occupy a large amount
of disk space and which needs to be removed manually (see /auxiliary/checkpoints).

Release note (bug fix): Fixed a case in which ./cockroach quit would
return success even though the server process was still running in a
severely degraded state.

tbg added 2 commits May 22, 2019 14:56
In cockroachdb#35861, I made changes to the consistency checksum computation that
were not backwards-compatible. When a 19.1 node asks a 2.1 node for a
fast SHA, the 2.1 node would run a full computation and return a
corresponding SHA which wouldn't match with the leaseholder's.

Bump ReplicaChecksumVersion to make sure that we don't attempt to
compare SHAs across these two releases.

Fixes cockroachdb#37425.

Release note (bug fix): Fixed a potential source of (faux) replica
inconsistencies that can be reported while running a mixed v19.1 / v2.1
cluster. This error (in that situation only) is benign and can be
resolved by upgrading to the latest v19.1 patch release. Every time this
error occurs a "checkpoint" is created which will occupy a large amount
of disk space and which needs to be removed manually (see <store
directory>/auxiliary/checkpoints).
@tbg tbg requested review from nvanbenschoten and a team May 22, 2019 12:56
@cockroach-teamcity
Copy link
Member

This change is Reviewable

There are "known unknown" problems regarding shutting down gRPC or the
stopper, which can leave the process dangling while its sockets are
already closed. The UX is poor and it's likely very annoying to find,
fix, and regress against the root cause. Luckily, all we want to achieve
is that the process is dead soon after the client disconnects, and we
can do that.

If I were to rewrite this code, I would probably not even bother with
stopping the stopper or grpc, but just call `os.Exit` straight away.
I'm not doing this right now to minimize fallout since this change
will be backported to release-19.1.

Release note (bug fix): Fixed a case in which `./cockroach quit` would
return success even though the server process was still running in a
severely degraded state.
@tbg
Copy link
Member Author

tbg commented May 22, 2019

Forgot the fix for ./cockroach quit (commit added).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants