-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
supply local time to InternalPushTxn when initiating the request #1102
Conversation
expiry := args.Timestamp | ||
// Compute heartbeat expiration (all replicas must see the same result). | ||
expiry := args.Now // caller can set this to his wall time. | ||
expiry.Forward(args.Timestamp) // if Now is not set, fallback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to send a push txn request without setting Now
? It might be simpler to require that Now
be set in all cases. (On the other hand, this does ensure that we use max(Now, Timestamp)
if there is any risk that the pusher's clock could be behind)
LGTM. Do we also need to set Now when pushing from gc_queue.go? (which looks like the only other non-test use of InternalPushTxnRequest) |
previously args.Timestamp was used, but there are situations in which that never changes, meaning that infinite retry loops on an abandoned Txn descriptor would ensue in storage/store, for instance in cockroachdb#877 and the bank example.
I've made |
{nil, 0, proto.PUSH_TIMESTAMP, false}, | ||
{nil, 0, proto.ABORT_TXN, false}, | ||
{nil, 0, proto.CLEANUP_TXN, false}, | ||
{nil, 1, proto.PUSH_TIMESTAMP, false}, // using 0 is awkward |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment won't make sense in the future: using 0 for what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added to the comment.
LGTM |
supply local time to InternalPushTxn when initiating the request
This commit deprecates PushTxnRequest.Now and gives its responsibility to the batch header timestamp. The biggest reason to do this is so that PushTxn requests properly update their receiver's clock. This is critical because a PushTxn request can result in a timestamp cache entry to be created with a value up to this time, so for safety, we need to ensure that the leaseholder updates its clock at least to this time _before_ evaluating the request. Otherwise, a lease transfer could miss the request's effect on the timestamp cache and result in a lost push/abort. The comment on PushTxnRequest.Now mentioned that the header timestamp couldn't be used because the header's timestamp "does not necessarily advance with the node clock across retries and hence cannot detect abandoned transactions." This dates back all the way to cockroachdb#1102. I haven't been able to piece together what kind of retries this is referring to, but I'm almost positive that they don't still apply. Release note: None
This commit deprecates PushTxnRequest.Now and gives its responsibility to the batch header timestamp. The biggest reason to do this is so that PushTxn requests properly update their receiver's clock. This is critical because a PushTxn request can result in a timestamp cache entry to be created with a value up to this time, so for safety, we need to ensure that the leaseholder updates its clock at least to this time _before_ evaluating the request. Otherwise, a lease transfer could miss the request's effect on the timestamp cache and result in a lost push/abort. The comment on PushTxnRequest.Now mentioned that the header timestamp couldn't be used because the header's timestamp "does not necessarily advance with the node clock across retries and hence cannot detect abandoned transactions." This dates back all the way to cockroachdb#1102. I haven't been able to piece together what kind of retries this is referring to, but I'm almost positive that they don't still apply. Release note: None
raft: add a one node bench
previously args.Timestamp was used, but there are situations
in which that never changes, meaning that infinite retry loops
on an abandoned Txn descriptor would ensue in storage/store,
for instance in #877 and the bank example.
This needs a test. Should be fairly straightforward, so if anybody wants to help out the debugging bonanza, that would be appreciated (but probably want to wait for feedback to see if this would actually land).
After this change, the following hums along relatively nicely on a three node cluster:
/bank --num-accounts=2 --num-parallel-transfers=2 --db-name=http://root@localhost:8001
Every couple of minutes one of the writers' transactions will still deadlock, but times out after
10 seconds (after which process resumes). When that happens, the busy retry loop will run against it
approximately 14000 times until the 10s are over (yeah, it's really busy - 0s backoff, I think). I've checked all the pushes in such an example, and none of them go through since the priority is pretty high:
So we only have to figure out why the occasional transaction in the bank example doesn't finish - it clearly seems to be pretty high priority wise. I've checked the logs, it doesn't seem to ever have unsuccessfully pushed someone during my run: