Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

closedts: raft closed timestamp regression #70894

Open
andreimatei opened this issue Sep 29, 2021 · 26 comments · May be fixed by #75298
Open

closedts: raft closed timestamp regression #70894

andreimatei opened this issue Sep 29, 2021 · 26 comments · May be fixed by #75298
Labels
A-kv-closed-timestamps Relating to closed timestamps C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@andreimatei
Copy link
Contributor

andreimatei commented Sep 29, 2021

Sentry is getting crash-loop reports from a handful of cluster about closed timestamp regressions
Example

store_raft.go:524: log.Fatal: ???: raft closed timestamp regression in cmd: "_3\"*\x83\xe6\xd2s" (term: 23, index: 415262); batch state: 1626099041.832244381,0, command: 1626092389.971770216,0, lease: repl=(n2,s2):3 seq=17 start=1626091869.496126607,0 exp=1626099053.832244381,4775 pro=1626099044.832244381,4775, req: ConditionalPut [×,×), EndTxn(commit:true modified-span (node-liveness)) [×], [txn: fa86e74a], [can-forward-ts], applying at LAI: 183761.

The assertion includes the tail of the Raft log, which should be very valuable in understanding what is going wrong, but that's redacted in Sentry.

Until version 21.1.4, it was possible to hit this assertion on lease acquisitions. But that was fixed (and backported) through #66512 (comment)
Indeed, some crashes from older version appear to be on lease acquisitions. But we also have crashes on newer version coming from commands proposed by other types of requests.

I've looked at reports from a couple of clusters, and the only commonality I see, which is also weird in itself, is that in multi-node cluster, only 2 nodes report the crash, when you'd expect all the (3) replicas to crash just the same. But, to make it extra confusing, there's also single-node clusters where this happens, so it both looks and doesn't look like a replication issue.

For analyzing the Sentry reports, one thing that helps is going to the very first set of crashes from a cluster (i.e. first crash from every node). Out of those, one crash will correspond to the leaseholder, and that one will have extra information in it: the request that corresponds to the command with the regression. Getting that first crash in Sentry is tricky: you have to get to a view that has a paginated view of all the events for one cluster (example) and then you have to go to the last page by guessing the "cursur" URL argument. The argument is encoded as 0%3A followed by the actual index.

Looking at a few of these requests causing the regression, I don't see a pattern. I've seen EndTxn and I've seen a batch of Merges.

But one thing I do see and find curious is that I've looked at a few cases, and in every one, the regression was for hours. I don't know what to make of that. I'm thinking that a clock juimp perhaps has something to do with it, although I don't see how.

Jira issue: CRDB-10274

@andreimatei andreimatei added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 29, 2021
@sentry-io
Copy link

sentry-io bot commented Sep 29, 2021

Sentry issue: COCKROACHDB-3HX

@andreimatei andreimatei added the A-kv-closed-timestamps Relating to closed timestamps label Sep 29, 2021
@tbg
Copy link
Member

tbg commented Jan 17, 2022

I just managed to reproduce this, just running this file:

#!/bin/bash

set -euo pipefail

make buildshort
roachprod destroy local || true
roachprod create -n 3 local

roachprod put local cockroach
roachprod start local:1-3
roachprod run local:1 -- ./cockroach workload init kv
roachprod run local:1 -- ./cockroach workload run kv --max-rate 100 --read-percent 0 --duration=10s {pgurl:1-3}

roachprod stop local
env COCKROACH_CLOCK_JUMP=8h roachprod start local

roachprod run local:1 -- ./cockroach workload run kv --max-rate 100 --read-percent 0 --duration=10s {pgurl:1-3}

roachprod stop local
roachprod start local

roachprod run local:1 -- ./cockroach workload run kv --max-rate 100 --read-percent 0 --duration=10s {pgurl:1-3}

and this diff

diff --git a/pkg/util/hlc/hlc.go b/pkg/util/hlc/hlc.go
index 075b461ecb..34abfd342d 100644
--- a/pkg/util/hlc/hlc.go
+++ b/pkg/util/hlc/hlc.go
@@ -15,6 +15,7 @@ import (
 	"sync/atomic"
 	"time"
 
+	"github.com/cockroachdb/cockroach/pkg/util/envutil"
 	"github.com/cockroachdb/cockroach/pkg/util/log"
 	"github.com/cockroachdb/cockroach/pkg/util/syncutil"
 	"github.com/cockroachdb/cockroach/pkg/util/timeutil"
@@ -186,11 +187,13 @@ func (m *HybridManualClock) Resume() {
 	m.mu.Unlock()
 }
 
+var hackClockJump = envutil.EnvOrDefaultDuration("COCKROACH_CLOCK_JUMP", 0)
+
 // UnixNano returns the local machine's physical nanosecond
 // unix epoch timestamp as a convenience to create a HLC via
 // c := hlc.NewClock(hlc.UnixNano, ...).
 func UnixNano() int64 {
-	return timeutil.Now().UnixNano()
+	return timeutil.Now().UnixNano() + hackClockJump.Nanoseconds()
 }
 
 // NewClock creates a new hybrid logical clock associated with the given

on top of e84001d.

‹/System/{NodeLive…-tsd}›,raft] 43 ???: raft closed timestamp regression in cmd: ">\xa3\x9a\x17#pN\xf9" (term: 8, index: 122); batch state: 164>

@tbg
Copy link
Member

tbg commented Jan 17, 2022

Here's n1's log: https://gist.github.com/tbg/1dfc6ee1d324a2971307e3165448275d. The other two nodes survived.

@tbg
Copy link
Member

tbg commented Jan 17, 2022

Got this again so I think this repro is pretty good. In fact I'm going to have to disable the assertion so that I can go about what I was actually looking into (#74909) :-)

@tbg
Copy link
Member

tbg commented Jan 17, 2022

Note, when I disable the assertions it still crashes:

F220117 14:07:42.887954 249 kv/kvserver/replica.go:1269 ⋮ [n1,s1,r3/1:‹/System/{NodeLive…-tsd}›] 142 on-disk and in-memory state diverged: [Raft
ClosedTimestamp.WallTime: 1642428459884897696 != 1642457252063853439]

In other words, something pretty bad is going on.

andreimatei added a commit to andreimatei/cockroach that referenced this issue Jan 21, 2022
Before this patch, the following scenario was possible:
- a node is stopped
- the clock jumps backwards
- the node is restarted
- a replica from the node proposes a command with a closed timestamp
  lower than timestamps closed before the restart. This causes an
  assertion to fire on application.

The problem is that, after the restart, the propBuf, which is in charge
of assigning closed timestamps to proposals, doesn't have info on what
had been closed prior to the restart. The propBuf maintains the
b.assignedClosedTimestamp field, which is supposed to be in advance of
r.mu.state.RaftClosedTimestamp on leaseholder replicas, but nobody
initializes that field on restart. For ranges with epoch-based leases, I
believe we don't have a problem because the range will need to acquire a
new lease after restart before proposing any commands - and lease
acquisitions initialize b.assignedClosedTimestamp. But for
expiration-based leases, I believe a lease from before the restart
might be used(*).

(*) This is probably a bad thing which we should discuss separately. I
think we don't want leases to be used after a restart for multiple
reasons and we have the r.minLeaseProposedTS[1] guard that supposed to
protect against using them (in addition to the epoch protection we have
for epoch-based leases, I think). But I believe this protection can be
elided by a backwards clock jump - we refuse to use leases acquired
*before* minLeaseProposedTS, and minLPTS is assigned to time.Now() on
start; if the clock went backwards, the leases will appear to be good.

[1] https://github.com/cockroachdb/cockroach/blob/6664d0c34df0fea61de4fff1e97987b7de609b9e/pkg/kv/kvserver/replica.go#L468

Release note: A bug causing nodes to repeatedly crash with a "raft
closed timestamp regression" was fixed. The bug occurred when the node
in question had been stopped and the machine clock jumped backwards
before the node was restarted.

Fixes cockroachdb#70894
@yuzefovich
Copy link
Member

Pinging this issue since we had a few more sentry reports of this error.

@andreimatei
Copy link
Contributor Author

I think #75298 has something to say about this. Tobi can I pass it to you? :P

@pav-kv
Copy link
Collaborator

pav-kv commented Oct 31, 2022

Tried repro #70894 (comment) a few times, no luck catching it so far. Also tried with a negative time shift.

The description of #75298 mentions that only expiration-based leases are affected, so I tried also making (almost) all leases expiration-based to increase chances for the repro to catch this. No luck either.

Looking closer at the failure report #90682, the lease is epoch-based, so maybe both types are affected.

@sentry-io
Copy link

sentry-io bot commented Jan 30, 2023

Sentry issue: COCKROACHDB-6BJ

@tbg
Copy link
Member

tbg commented Feb 15, 2023

I can reproduce this using #97173, looking into it.

@tbg
Copy link
Member

tbg commented Feb 27, 2023

I ended up instead sorting the reproposal slice by (LAI, ClosedTimestamp, Lease) which I thought would avoid this issue. I never saw it again during my testing after that until last night where I randomly got it while repro'ing for #97287 (with the sorting). So, I don't seem to have solved it completely and there's something to be understood yet. Possibly this is then also going to answer how we're seeing this in the wild.

Looking at what I wrote there again, I don't know why that would avoid the issue. We can still have the same problem:

  • assign LAI 12345, CT=1000 to log position 100
  • assign LAI 99999, CT=500 to log position 101
  • boom

So I don't know what shifted to hide this problem in my reproductions, but it should still exist (at least when invalid LAIs are introduced).

@tbg
Copy link
Member

tbg commented Feb 27, 2023

Ah - I know. By sorting by closed timestamp, the LAI=1234 block would now pick the smallest possible closed timestamp for slot LAI=1234. One of them had to be smaller than whatever we assigned to LAI=1235 (or higher). So we avoided the assertion, but not for any sound reason. We just reduced the chances of it a ton. Which explains why I haven't been seeing it any more except that one time. I should've looked into that one repro more but didn't: it might've been a different mode, perhaps more relevant to the issue at hand.

@pav-kv
Copy link
Collaborator

pav-kv commented Feb 28, 2023

@tbg How about trying to catch the (MaxLeaseIndex, ClosedTimestamp) inversion early (by adding more assertions 😬 of the invariant mli1 <= mli2 ==> ct1 <= ct2), e.g. right after Sort has reordered the proposals, and/or at Flush time? Maybe your repro will then catch it earlier in the pipeline (rather than only at application time, after it got through Raft).

Is closed timestamp assigned once, during the first proposal flush time? Or is it reassigned on: a) reproposal, b) reproposal with a higher MaxLeaseIndex?

@tbg tbg removed their assignment Mar 21, 2023
@tbg
Copy link
Member

tbg commented Mar 21, 2023

Earlier assertions sound good.

Is closed timestamp assigned once, during the first proposal flush time? Or is it reassigned on: a) reproposal, b) reproposal with a higher MaxLeaseIndex?

It's currently not reassigned, but just because that ended up being slightly more convenient in the current code, not because there's some reason not to. (But there also isn't a reason to, the less change across proposal reuse the better).

I'm going to unassign for now, we'll pick this back up in https://cockroachlabs.atlassian.net/browse/CRDB-25287.

@tbg
Copy link
Member

tbg commented Jul 18, 2023

We didn't get around to getting to the bottom of this. The comment1 explains the understanding I have: there is a monotonicity requirement between MLAI and closed timestamp assignment - the assigned (MLAI,CT) pairs need to be totally ordered by the component-wise order (i.e. if MLAI > MLAI', then CT > CT'). This wasn't true in the experiment I ran (I was artificially introducing backward jumps in MLAI) and I suspect it may not always be true in practice either, though I'm hard-pressed to find an example. One thing to look into could be special commands like lease requests, etc, which don't fully use the LAI mechanism. Could these cause this reversion? I don't think so, but still worth double checking. Similarly, I don't think rejected commands (illegal lease index, invalid proposer lease) can trigger closed timestamp updates.

Pavel's suggestion above has merit, we can verify the total order before commands enter raft, and this could give us an idea of where the problem arises.

It's also totally possible that the problem occurs in practice only due to some corruption in the system ("Ex falso sequitur quodlibet") or that we've since "accidentally" fixed it in the course of simplifying tryReproposeWithNewLeaseIndex2, though I don't recall reproposals being involved at all (refreshes were involved in my experiment).

One immediate thing we might do is to adopt must3 once it's available and to report the error to sentry without triggering the crash in production builds4. We don't even have to wait for must. It is pretty clear that users don't benefit from the assertion; we'll usually just tell them to disable it and go on with their lives.

Footnotes

  1. https://github.com/cockroachdb/cockroach/issues/70894#issuecomment-1433244880

  2. https://cockroachlabs.atlassian.net/browse/CRDB-25287

  3. https://github.com/cockroachdb/cockroach/pull/106508

  4. https://github.com/cockroachdb/cockroach/issues/107064

@pav-kv
Copy link
Collaborator

pav-kv commented Dec 14, 2023

From a recent repro:

$ grep -o "\(Term.*Index.*\|max_lease_index.*closed_timestamp.*replic\)" closed-timestamp-log.txt | sed 's/ replic//g'

Term:6 Index:15 Type:EntryNormal  (ID 0ad4ef7915f2d9f7) by lease #6
max_lease_index:2 closed_timestamp:<wall_time:1699467454251881189 >
Term:6 Index:14 Type:EntryNormal  (ID 4a37f23fa745a2e1) by lease #6
max_lease_index:1 closed_timestamp:<wall_time:1699467454363124760 >
Term:6 Index:13 Type:EntryNormal  (ID 174586ad919d0c61) by lease #6
max_lease_index:1 closed_timestamp:<wall_time:1699467454220166867 >
Term:6 Index:12 Type:EntryNormal  (ID 4a37f23fa745a2e1) by lease #6
max_lease_index:1 closed_timestamp:<wall_time:1699467454363124760 >
Term:6 Index:11 Type:EntryNormal : EMPTY [recovered]

All proposals seem to be made under the same leadership and lease.
All proposals are made right in the beginning of a range lifecycle (log indices start at 10 and higher). The EMPTY log entry (index 11) is committed when the raft leader has just been elected.

Observations on the proposals:

  • Proposals at index 12 and 14 have the same ID, so likely are full-copy reproposals.
  • Proposal at index 13 was done with the same MLI = 1, so is applied as no-op.
  • For some reason, proposal at index 15 was made with a lower closed timestamp than proposals 12 and 14.

Probably self-inflicted by the test-only LAI override that returns 1 in some cases when it wanted to return a LAI > 1.

Probably the original ordering had entry 12 and 14 submitted at LAI 3 (since its closed timestamp is the biggest of all), but the test interceptor randomly assigned LAI 1 to it. This is a bug in the test harness.

The failures reported via Sentry fail at higher indices and LAIs though, so the issue is real. The repro is irrelevant.

@pav-kv
Copy link
Collaborator

pav-kv commented Jan 8, 2024

Latest occurrence of this panic in kvnemesis failure here: #116541 (comment).

$ grep -o "\(Term.*Index.*\|max_lease_index.*closed_timestamp.*replic\)" ct.txt | sed 's/ replic//g'                                                        backport23.1-115595-114191!?
Term:6 Index:15 Type:EntryNormal  (ID 450747ef87ce7014) by lease #9
max_lease_index:2 closed_timestamp:<wall_time:1704640779372779513 >
Term:6 Index:14 Type:EntryNormal  (ID 6f1758389525834e) by lease #9
max_lease_index:1 closed_timestamp:<wall_time:1704640779402664522 >
Term:6 Index:13 Type:EntryNormal  (ID 5c2068c6aee7e0e0) by lease #9
max_lease_index:1 closed_timestamp:<wall_time:1704640779361734754 >
Term:6 Index:12 Type:EntryNormal  (ID 6f1758389525834e) by lease #9
max_lease_index:1 closed_timestamp:<wall_time:1704640779402664522 >
Term:6 Index:11 Type:EntryNormal : EMPTY

Possibly same test harness bug as #70894 (comment).

Entries 12 and 14 were probably intended at MLI 3, but the test assigned MLI=1 to them, and one accidentally got submitted before entry 12 (which originally had MLI=1).

The fix would be to disallow the MLI=1 injection until the LAI has definitely passed 1. Then we'll avoid the unintended command reordering. Once we have this fix, nothing stops us from generalizing it: the MLI can be overridden to any value <= submitted LAI, and the command should not be applied in this case.

@nicktrav
Copy link
Collaborator

Likely closed as part of #123442. Sentry will tell us if / when we see this again.

@pav-kv
Copy link
Collaborator

pav-kv commented Jun 28, 2024

#123442 saw a regression in closed timestamp sidetransport, while this issue is about a regression in the log. Not sure these are the same.

@nicktrav
Copy link
Collaborator

nicktrav commented Jul 1, 2024

cc: @arulajmani - didn't we decide these were similar enough to warrant closing?

@nicktrav
Copy link
Collaborator

nicktrav commented Jul 1, 2024

Re-opening just in case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-closed-timestamps Relating to closed timestamps C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

Successfully merging a pull request may close this issue.

7 participants