Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supply local time to InternalPushTxn when initiating the request #1102

Merged
merged 3 commits into from
May 26, 2015

Conversation

tbg
Copy link
Member

@tbg tbg commented May 21, 2015

previously args.Timestamp was used, but there are situations
in which that never changes, meaning that infinite retry loops
on an abandoned Txn descriptor would ensue in storage/store,
for instance in #877 and the bank example.

This needs a test. Should be fairly straightforward, so if anybody wants to help out the debugging bonanza, that would be appreciated (but probably want to wait for feedback to see if this would actually land).

After this change, the following hums along relatively nicely on a three node cluster:

/bank --num-accounts=2 --num-parallel-transfers=2 --db-name=http://root@localhost:8001

Every couple of minutes one of the writers' transactions will still deadlock, but times out after
10 seconds (after which process resumes). When that happens, the busy retry loop will run against it
approximately 14000 times until the 10s are over (yeah, it's really busy - 0s backoff, I think). I've checked all the pushes in such an example, and none of them go through since the priority is pretty high:

# count, pusher prio, pushee prio
6297 21781967 < 2132569132
   1 271073717 < 2132569132
3816 1372547470 < 2132569132
3815 2132569131 < 2132569132

So we only have to figure out why the occasional transaction in the bank example doesn't finish - it clearly seems to be pretty high priority wise. I've checked the logs, it doesn't seem to ever have unsuccessfully pushed someone during my run:

6297 21781967 < 2132569132
   1 271073717 < 2132569132
3816 1372547470 < 2132569132
3815 2132569131 < 2132569132

@tbg tbg added the PTAL label May 21, 2015
expiry := args.Timestamp
// Compute heartbeat expiration (all replicas must see the same result).
expiry := args.Now // caller can set this to his wall time.
expiry.Forward(args.Timestamp) // if Now is not set, fallback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to send a push txn request without setting Now? It might be simpler to require that Now be set in all cases. (On the other hand, this does ensure that we use max(Now, Timestamp) if there is any risk that the pusher's clock could be behind)

@bdarnell
Copy link
Contributor

LGTM. Do we also need to set Now when pushing from gc_queue.go? (which looks like the only other non-test use of InternalPushTxnRequest)

tbg added 2 commits May 26, 2015 14:51
previously args.Timestamp was used, but there are situations
in which that never changes, meaning that infinite retry loops
on an abandoned Txn descriptor would ensue in storage/store,
for instance in cockroachdb#877 and the bank example.
@tbg
Copy link
Member Author

tbg commented May 26, 2015

I've made Now mandatory and updated the code. PTAL.

{nil, 0, proto.PUSH_TIMESTAMP, false},
{nil, 0, proto.ABORT_TXN, false},
{nil, 0, proto.CLEANUP_TXN, false},
{nil, 1, proto.PUSH_TIMESTAMP, false}, // using 0 is awkward
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment won't make sense in the future: using 0 for what?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to the comment.

@bdarnell
Copy link
Contributor

LGTM

tbg added a commit that referenced this pull request May 26, 2015
supply local time to InternalPushTxn when initiating the request
@tbg tbg merged commit c562014 into cockroachdb:master May 26, 2015
@tbg tbg removed the PTAL label May 26, 2015
@tbg tbg deleted the push_expired branch May 26, 2015 20:14
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 1, 2019
This commit deprecates PushTxnRequest.Now and gives its responsibility
to the batch header timestamp. The biggest reason to do this is so that
PushTxn requests properly update their receiver's clock. This is critical
because a PushTxn request can result in a timestamp cache entry to be
created with a value up to this time, so for safety, we need to ensure
that the leaseholder updates its clock at least to this time _before_
evaluating the request. Otherwise, a lease transfer could miss the
request's effect on the timestamp cache and result in a lost push/abort.

The comment on PushTxnRequest.Now mentioned that the header timestamp
couldn't be used because the header's timestamp "does not necessarily
advance with the node clock across retries and hence cannot detect
abandoned transactions." This dates back all the way to cockroachdb#1102. I haven't
been able to piece together what kind of retries this is referring to,
but I'm almost positive that they don't still apply.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 2, 2019
This commit deprecates PushTxnRequest.Now and gives its responsibility
to the batch header timestamp. The biggest reason to do this is so that
PushTxn requests properly update their receiver's clock. This is critical
because a PushTxn request can result in a timestamp cache entry to be
created with a value up to this time, so for safety, we need to ensure
that the leaseholder updates its clock at least to this time _before_
evaluating the request. Otherwise, a lease transfer could miss the
request's effect on the timestamp cache and result in a lost push/abort.

The comment on PushTxnRequest.Now mentioned that the header timestamp
couldn't be used because the header's timestamp "does not necessarily
advance with the node clock across retries and hence cannot detect
abandoned transactions." This dates back all the way to cockroachdb#1102. I haven't
been able to piece together what kind of retries this is referring to,
but I'm almost positive that they don't still apply.

Release note: None
pav-kv pushed a commit to pav-kv/cockroach that referenced this pull request Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants