-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rangefeed: don't advance resolved ts on txn abort #111218
rangefeed: don't advance resolved ts on txn abort #111218
Conversation
a206214
to
116578e
Compare
Previously when rangefeed received aborted status for pushed transaction it would eagerly dropped all known intents that belonged to the transactions from its cache. This was proven incorrect because we can get ABORTED status event when transaction is comitted if leaseholder already removed transaction record. This commit removes the shortcut and always relies on intent operations to advance resolved timestamp. Epic: none Release note: None
116578e
to
97a30d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense as an immediate fix for the incorrect rangefeed RTS advance, and the effect should be no worse than that of committed transactions which will hold up the RTS until the intents are resolved.
However, it seems generally problematic that SynthesizeTxnFromMeta
returns ABORTED
for committed transactions. Is this something the txns group should look into @nvanbenschoten?
We should backport this after some baking time as well.
Reviewed 8 of 9 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, and @nvanbenschoten)
-- commits
line 15 at r1:
This deserves a release note, since it fixes a potential correctness bug with changefeeds.
pkg/kv/kvserver/rangefeed/task.go
line 326 at r1 (raw file):
ops := make([]enginepb.MVCCLogicalOp, len(pushedTxns)) var intentsToCleanup []roachpb.LockUpdate pushed := 0
nit: we can make ops
zero-length (with len(pushedTxns)
capacity) and append to it, rather than tracking pushed
separately.
pkg/kv/kvserver/rangefeed/task_test.go
line 527 at r1 (raw file):
updateIntentOp(txn2, hlc.Timestamp{WallTime: 2}), // abortTxnOp(txn3), // abortTxnOp(txn4),
nit: remove these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS: I'd like an approval from KV transactions as well, either @nvanbenschoten or @arulajmani.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, and @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this PR is correct as is currently written. See my comment below about transaction records for ABORTED transactions that do not carry any lock spans. In those cases, no intent resolution is on its way, so we (the rangefeed processor) needs to do something about these intents. But we are also not told of where the intents are in the push response.
As written, I believe resolved timestamps are going to get stuck for abandoned transactions which have some written intents (e.g. after a gateway crash).
However, it seems generally problematic that
SynthesizeTxnFromMeta
returnsABORTED
for committed transactions. Is this something the txns group should look into @nvanbenschoten?
I don't know of other cases where this is more generally problematic. Do you? Recall that a transaction can only get into this state once all of its intents have been resolved. After that state, we need some eventual way to GC its state, which requires losing information.
The problem is that a rangefeed processor may be operating on a stale view of the range, where a completed intent resolution may not have yet propagated. For operations like follower reads, this works out fine because the operation routes back through the leaseholder in these cases, which detects the stale case and handles it correctly. Rangefeed is unique in operating on stale intent information but then not validating that information with the leaseholder before making decisions based on it. See my comment below about how we could correct this by waiting for the follower to catch up to (through application) the leaseholder and avoid this problem.
Reviewed 1 of 9 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @itsbilal)
-- commits
line 9 at r1:
The other condition is that every intent has already been resolved. The problem raised in #104309 is that an intent may have been resolved on the leaseholder of its range, but not on the follower that is serving the rangefeed.
Spelling this out reveals a second solution, which is that we could keep the same behavior as we have today if, after finding an ABORTED transaction record status, we waited for the follower of the rangefeed range to catch up to the leaseholder. Something like a WaitForApplication
call would be enough to achieve this.
After waiting for the follower to catch up, we could then perform the same MVCCAbortTxnOp
to handle the case of an abandoned transaction.
pkg/kv/kvserver/rangefeed/task.go
line 368 at r1 (raw file):
However, if the txn happens to have its LockSpans populated, then lets
clean up the intents within the processor's range as an optimization to
help others and to prevent any rangefeed reconnections from needing to
push the same txn. If we aborted the txn, then it won't have its
LockSpans populated. If, however, we ran into a transaction that its
coordinator tried to rollback but didn't follow up with garbage
collection, then LockSpans will be populated.
This paragraph was important, because it reveals the possibility that a transaction record is aborted without carrying any lock spans. Is that case, no intent resolution is coming and the transaction record isn't telling us where the intents are located. So either we need to find them through a lock table scan or we need to stop relying on reference counting and forget about the transaction though a mechanism like MVCCAbortTxnOp
.
As is currently written, I think this will lead to resolved timestamp stalls in cases like gateway node crashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch Nathan, thanks. We should try to construct a test case for this scenario.
I don't know of other cases where this is more generally problematic. Do you?
I don't know of a concrete problem, no, but this seems like an accident waiting to happen. Ideally, we would synthesize the meta as COMMITTED
rather than ABORTED
, by addressing this TODO:
cockroach/pkg/kv/kvserver/replica_tscache.go
Lines 536 to 542 in 77640e5
// TODO(andrei): We could keep a bit more info in the tscache to return a | |
// different error for COMMITTED transactions. If the EndTxn(commit) was | |
// the only request in the batch, this this would be sufficient for the | |
// client to swallow the error and declare the transaction as committed. | |
// If there were other requests in the EndTxn batch, then the client would | |
// still have trouble reconstructing the result, but at least it could | |
// provide a non-ambiguous error to the application. |
Reviewed 1 of 9 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, and @nvanbenschoten)
we could keep the same behavior as we have today if, after finding an ABORTED transaction record status, we waited for the follower of the rangefeed range to catch up to the leaseholder
I think I'm partial to this solution. The other option, a lock table scan and intent resolution, can be expensive in the case of many intents. If the follower is lagging far behind the leaseholder, then the resolved timestamp will already be lagging far behind the current time, so waiting out the replication and application seems less problematic.
Can we limit this to the case where we are not the leaseholder and the aborted txn record has no lock spans?
Something like a
WaitForApplication
call would be enough to achieve this.
It would, but it's a bit annoying since we need to know which LAI to wait for, which we'll have to get from the leaseholder. And once we have that, we don't really need to go via a WaitForApplication
RPC to probe the replica state, we can just wait for the local replica to reach it inline (if we're ok with blocking here) -- although we may not have easy access to the replica here, so maybe WaitForApplication
is more straightforward.
We can't know which timestamp the txn would have committed at, can we? If we did, then we could also check if that's already below the closed timestamp, and omit the replication roundtrip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we would synthesize the meta as
COMMITTED
rather thanABORTED
, by addressing this TODO
That TODO could help in some cases, but it's not an exhaustive solution. If we're using the timestamp cache (or any in-memory, unreplicated data structure) to store information about committed transactions then there will be cases where this information is lost. Similarly, if we're storing this information in persistent storage then we eventually need to GC it to avoid unbounded build-up. So there doesn't seem to be an approach that avoids losing information and needing to handle these ambiguous cases.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @itsbilal)
Can we limit this to the case where we are not the leaseholder and the aborted txn record has no lock spans?
I don't think we want to guarantee that if a txn record has some lock spans, that it has every lock span. You could imagine a world where long-running transactions would checkpoint their lock spans periodically to bound the number of locks that they abandon without tracking on failure.
It would, but it's a bit annoying since we need to know which LAI to wait for, which we'll have to get from the leaseholder. And once we have that, we don't really need to go via a
WaitForApplication
RPC to probe the replica state, we can just wait for the local replica to reach it inline (if we're ok with blocking here) -- although we may not have easy access to the replica here, so maybeWaitForApplication
is more straightforward.
Agreed. If we can locate the leaseholder nd grab its LAI, we should have all we need to perform a replica-local wait.
Pushing a write through Raft and waiting for it to come out the other end locally would avoid the clock synchronization dependency, but that case is probably not worth considering. I suspect that we already have a dependency on clocks from the PushTxn.
We can't know which timestamp the txn would have committed at, can we? If we did, then we could also check if that's already below the closed timestamp, and omit the replication roundtrip.
I don't understand this idea, but it sounds interesting. Mind spelling it out further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, and @nvanbenschoten)
I don't understand this idea, but it sounds interesting. Mind spelling it out further?
This was pretty half-baked. The idea was that if we know the commit timestamp, and the closed timestamp has advanced past the commit timestamp, then we know that we've seen all intents written by the transaction, so we can flush them and advance the resolved timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @itsbilal, and @nvanbenschoten)
If we can locate the leaseholder nd grab its LAI, we should have all we need to perform a replica-local wait.
FWIW, I don't think we have a way to do this currently, but we can consider piggybacking on the LeaseInfo
RPC by adding a field for the LAI. This complicates backports a little, in that we have to handle leaseholders that don't support it, but that seems ok.
cockroach/pkg/kv/kvpb/api.proto
Lines 1631 to 1650 in 4518a99
// LeaseInfoResponse is the response to a LeaseInfo() operation. | |
message LeaseInfoResponse{ | |
ResponseHeader header = 1 [(gogoproto.nullable) = false, (gogoproto.embed) = true]; | |
// The last lease known by the replica serving the request. It can also be the | |
// tentative future lease, if a lease transfer is in progress. | |
Lease lease = 2 [(gogoproto.nullable) = false]; | |
// current_lease is set if `lease` represents a tentative future lease. In | |
// that case, current_lease represents the lease that's currently in effect. | |
Lease current_lease = 3; | |
// evaluated_by returns the store that evaluated this request. This | |
// corresponds to the leaseholder unless ReadConsistency=INCONSISTENT was | |
// used. The response reflects the evaluator's view of the lease. When the | |
// client cares to see a particular node's view, it can use this field to | |
// check whether the node it intended query (by sending the request to that | |
// node and using ReadConsistency=INCONSISTENT) indeed served it - it's | |
// possible that even if ReadConsistency=INCONSISTENT was used, the request is | |
// still not evaluated by the node it was sent to if that node's replica is a | |
// learner or the node doesn't have a replica at all. | |
int32 evaluated_by = 4 [(gogoproto.casttype) = "github.com/cockroachdb/cockroach/pkg/roachpb.StoreID"]; | |
} |
What if we have a way to indicate that aborted transaction record was synthesized? Because that is the only special case that we need to handle. For "real" aborted transactions we can just ignore as we used to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually do expose this through the TxnRecordExists
flag on the QueryTxnResponse
. It would be reasonable to add the same flag on PushTxnResponse
.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @itsbilal)
When we synthesise transaction records from metadata in ABORT case, if we have an oldest write timestamp for the transaction then we don't need to use LAI and wait for it specifically. We receive closed timestamp notifications in processor so we can just get rid of aborted transactions when we reach that timestamp in processor. |
The problem is that the txn meta the pusher sees can be outdated if the txn was anchored on a different range and has since been pushed by someone else. So if the rangefeed pusher is looking at an old intent at t1, but the transaction instead committed and cleaned up intents at t2, then pushing it at t1 and waiting for the closed timestamp to reach t1 will remove the intents from tracking, allowing it to emit a closed timestamp for t3 > t2 before emitting the values at t2. |
I was thinking that when we push, it is replica that owns transaction anchor synthesised transaction record for pusher. So it must know olderst possible timestamp that it has written to. Even if that's a different range, once "our" leaseholder signals its closedTS is past transaction TS it should be safe to assume we won't receive intents with earlier timestamps. |
But how do we know what that timestamp is? The transaction record is gone, and I don't believe we keep track of the latest write timestamp (although it would be useful to have). We could of course use the replica's current time, but we're incurring a ~3 second wait in that case. |
Isn't it the same if we want to use |
No, with the LAI we'll typically not wait more than a replication roundtrip (if that), which is significantly less than 3 seconds. And I don't think we should use |
Fair point. I was referring to |
Replaced by #114487. |
Previously when rangefeed received aborted status for pushed transaction it would eagerly dropped all known intents that belonged to the transactions from its cache.
This was proven incorrect because we can get ABORTED status event when transaction is comitted if leaseholder already removed transaction record.
This commit removes the shortcut and always relies on intent operations to advance resolved timestamp.
This is largely a revert of #35889
Epic: none
Fixes: #104309
Release note: None