-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: TransactionStatusError: transaction deadline exceeded #18684
Comments
I believe that message means a table descriptor lease has expired. cc @vivekmenezes @LEGO |
@jordanlewis how frequent is the error? If it is frequent enough it will hopefully be easy to test whether the lease expiring and re-acquisition is the issue since I have an in-flight PR right now (relevant PR #18606, issue #17227). That is peculiar that it happens specifically with more than one thread. There are a few places in the lease acquisition where I can see race conditions causing starvation but only if the lease is immediately released or maybe under heavy load. This could go either two ways: the above PR fixes it because lease acquisition is taking too long, or it's a race-condition causing a thread to be starved. I would think the former would happen with also one thread though, so it's less likely to be that. (scratch that on the race conditions, I was misunderstanding parts of the code 😅) |
This bug is happening because we are using a lease very close to its expiration deadline (we changed that recently to do that in #16842 by taking out the notion of MinLeaseDuration). I suspect we are seeing this here because a transaction with SNAPSHOT ISOLATION is using a table descriptor with an unexpired lease but is having its COMMIT pushed beyond the lease expiration (which is what the txn expiration is set to). lego's PR will fix this because it will reacquire a lease way before it expires and thus transactions will use leases with expirations further out into the future (like a minute). |
All transactions in TPCC are set to Serializable isolation. |
if you can log the txn.OrigTimestamp() and expiration in getTableVersion() after the call to AcquireByName() and panic instead of returning the error with the deadline timestamp and the txn timestamp we might get a better clue. Thanks! |
of course there is also the possibility that we should not be using OrigTimestamp() when acquiring leases. I think I need to ask around as to what API to use to get the COMMIT timestamp for a transaction as it's running |
Turns out using a timestamp near the deadline, while valid when picked, is bad in general, because the transaction timestamp can be pushed in the background by another transaction and a COMMIT will see the new pushed timestamp (when it reads the transaction record from the store) and return a violation of the deadline. So #18606 is a good fix for this issue |
@LEGO it's pretty easy to reproduce that error with TPCC. Perhaps you could try it out on your branch? Instructions:
|
#18606 seems like an improvement to things but ideally we wouldn't rely on some async process to ensure that our transactions can commit. |
Prior to #16842, we checked the time left on the lease and would (synchronously) renew it if there was less than a minute left. Why was this removed? Without it it's easy to run into this error because there's no guarantee that there's a useful amount of time left on the lease. This is a regression compared to 1.0 so I'm tagging this for 1.1. I think we need both MinLeaseDuration and #18606: synchronously renew the lease when it's too close to expiring, and asynchronously renew it somewhat before it reaches that point. Additionally, to support arbitrarily-long-running transactions, we would need some way for the renewed lease to be fed into the coordinator (and from there probably written to the transaction record via HeartbeatTxn). That's much lower priority, though. |
Third, I'd like to respond to Ben's message:
What you say is true, but just in case we're not all on the same page, note that this deadline does not prevent "arbitrarily-long-running transactions". It just prevents arbitrarily-big-pushes. You can have a transaction run arbitrarily long without changing its timestamp (and the deadline is checked against the txn timestamp, not the clock). So I don't think that what we want to do is synchronously renew leases (or, rather, I'd ask you when do you renew the lease - when it's close to expiring or when the txn in question has been pushed close to the deadline, or even beyond it?). In all this discussion, keep in mind that the client may not even know that the transaction has been pushed. Another thing I want to note is that, for a transaction to be able to commit at a particular timestamp, it doesn't necessarily need for the gateway to have valid leases on the descriptors that it used. Remember from the newer version of the descriptors and leases RFC that the conditions sufficient for a transaction to commit are divorced from the leases; they're about the validity window of the descriptors it used. How about this: we augment the Otherwise, what you're hinting at around renewing leases in the background and updating txn deadlines in the background sounds good - it'd prevent the "deadline error" from happening in most cases. But note that a lease expiration is not trivially translatable into a txn deadline: the deadline for a txn is set based on a set of descriptors not just one (and also, again, based on validity windows which may not be related to leases). |
Agreed.
I'd only synchronously get the lease when it is first accessed. After that we can asynchronously renew before expiration for as long as the transaction's open.
I don't like the idea of retrying after an EndTransaction. Per comments in the code, we'd like to formally abort the transaction when this happens (but this is currently prevented by other issues). If we did that, we wouldn't be able to retry. Maybe we just change the comments and guarantee that attempting to commit with a failed deadline is safe, but this seems subtle to me. |
The comments you're referring to were written (presumably) under the belief that what we want to do when a transaction's deadline was exceeded is to abort the transaction. I don't think they should be used to inform decisions in the situation where we decide that we, in fact, do not want to abort. |
Merged #18824 which will help resolve the immediate issue at hand, the |
This issue has not been resolved by #18824; your PR has only reduced the incidence of the problem (a txn pushed sufficiently will still encounter the same error; you've just made it so that the required push is now expected to be larger than before). We can remove the 1.1 milestone, but otherwise the issue is still valid. |
…iration Before this patch, if a Serializable txn was pushed and the push caused the txn to exceed its deadline, the client would get a "deadline exceeded" error back. What the client should have gotten is the retriable error, as that would allow it to retry. This patch fixes this by inverting the order of error generation. As a result, Serializable txns shouldn't see "deadline exceeded" errors any more. Touches cockroachdb#18684
…iration Before this patch, if a Serializable txn was pushed and the push caused the txn to exceed its deadline, the client would get a "deadline exceeded" error back. What the client should have gotten is the retriable error, as that would allow it to retry. This patch fixes this by inverting the order of error generation. As a result, Serializable txns shouldn't see "deadline exceeded" errors any more. Touches cockroachdb#18684
One plan here is to always set the deadline at commit time so that the transaction uses the deadline of a later refreshed lease with an updated (refreshed) deadline that is more likely to succeed. This can be achieved by ensuring that the transaction Commit API always uses a deadline instead of making the deadline a property of the transaction itself. At commit time the sql layer can call the API with the latest refreshed deadline for all the descriptors used in the transaction. The above will also get rid of needing to pass the transaction as a flag to the schema resolver API. |
FWIW, I just hit this in a local run of the |
@danhhz any chance you can add the error message because it now provides more information. Thanks |
Yup. I actually saw this twice. Here are both:
|
thanks. This surely indicates to me that I need to first extend the deadline (reducing the chance of this error) before making this a retryable error. |
Before this patch, they were opaque TransactionStatusErrors. The belief is that we should only be seeing such errors when a transaction is pushed by minutes. Shockingly, this seems to hapen enough in our tests, for example as described here: cockroachdb#18684 (comment) This patch marks the error as retriable, since it technically is. Touches cockroachdb#18684 Release note (sql change): "transaction deadline exceeded" errors are now returned to the client with a retriable code.
Before this patch, they were opaque TransactionStatusErrors. The belief is that we should only be seeing such errors when a transaction is pushed by minutes. Shockingly, this seems to hapen enough in our tests, for example as described here: cockroachdb#18684 (comment) This patch marks the error as retriable, since it technically is. Touches cockroachdb#18684 Release note (sql change): "transaction deadline exceeded" errors are now returned to the client with a retriable code.
Before this patch, they were opaque TransactionStatusErrors. The belief is that we should only be seeing such errors when a transaction is pushed by minutes. Shockingly, this seems to hapen enough in our tests, for example as described here: cockroachdb#18684 (comment) This patch marks the error as retriable, since it technically is. This patch also changes the semantics of the EndTransactionRequest.Deadline field to make it exclusive so that it matches the nature of SQL leases. No migration needed. Touches cockroachdb#18684 Release note (sql change): "transaction deadline exceeded" errors are now returned to the client with a retriable code.
Before this patch, they were opaque TransactionStatusErrors. The belief is that we should only be seeing such errors when a transaction is pushed by minutes. Shockingly, this seems to hapen enough in our tests, for example as described here: cockroachdb#18684 (comment) This patch marks the error as retriable, since it technically is. This patch also changes the semantics of the EndTransactionRequest.Deadline field to make it exclusive so that it matches the nature of SQL leases. No migration needed. Touches cockroachdb#18684 Release note (sql change): "transaction deadline exceeded" errors are now returned to the client with a retriable code.
35284: storage,kv: make transaction deadline exceeded errors retriable r=andreimatei a=andreimatei Before this patch, they were opaque TransactionStatusErrors. The belief is that we should only be seeing such errors when a transaction is pushed by minutes. Shockingly, this seems to hapen enough in our tests, for example as described here: #18684 (comment) This patch marks the error as retriable, since it technically is. This patch also changes the semantics of the EndTransactionRequest.Deadline field to make it exclusive so that it matches the nature of SQL leases. No migration needed. Touches #18684 Release note (sql change): "transaction deadline exceeded" errors are now returned to the client with a retriable code. 35793: storage: fix TestRangeInfo flake and re-enable follower reads by default r=ajwerner a=ajwerner This PR addresses a test flake introduced by enabling follower reads in conjunction with #35130 which makes follower reads more generally possible in the face of lease transfer. Fixes #35758. Release note: None 35865: roachtest: Skip flaky jepsen nemesis r=tbg a=bdarnell See #35599 Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Ben Darnell <[email protected]>
@andreimatei I've confirmed that the tpcc tests in the presence of schema changes where schema versions are changes will retry transactions when they see this error. I think we should close this issue. |
Well there's a lot of history in here about how we shouldn't be pushing transactions over their deadline and also an analysis of very long transactions: #18684 (comment) If we close this, we should be opening separate issues for those I think. |
I don't think this issue is closeable until random workloads don't see "transaction deadline exceeded" anymore. #35337 (comment) |
@andreimatei I've created an issue for early detection of the deadline exceeded condition. The other long running transaction analysis is no longer applicable. @jordanlewis we are no longer seeing test flakes Closing this issue! |
Could it be true? The troll is slain? Congratulations! |
@nvanbenschoten do you want to extract the intent resolving pathology in #18684 (comment) into a separate issue? |
Pulled into #36876. |
When running TPCC with more than a single thread, occasionally a transaction will fail with the following error:
pq: TransactionStatusError: transaction deadline exceeded
What is this about? What is the referenced deadline? It seems like a bug that this message is propagated to the user, but what internal condition is causing this scenario and why?
Any ideas? cc @andreimatei @bdarnell
The text was updated successfully, but these errors were encountered: