-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update statement does not always block when a different transaction has an exclusive lock for a row #88995
Comments
Hello, I am Blathers. I am here to help you get the issue triaged. Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here. I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
By the way, I'm running Cockroach like this, if that helps:
|
Also note that in most of my test runs, the transaction already fails at step 5. because it seems that a transaction TX2 blocks when reading a row, if TX1 has an exclusive lock on it. |
Cockroach's select for update locks are best-effort. They are an optimization but not a proof. The reason is that the implementation uses non-replicated, in-memory locks. If a lease transfer for the underlying data occurs, the lock will be lost. There is nothing fundamentally incorrect about not locking for SELECT FOR UPDATE because other mechanisms will enforce serializability. When the locks fail, it just means that retries may be more common. #75456 tracks the issue to make the locks survive lease transfers. |
Let me ask a yes/no question to make it clear what this means. If I change the scenario to
Will CockroachDB always fail at step 7. with a non-serializable transaction error? i.e. will TX2 always fail? |
In this case, it is possible for both of these transaction to succeed. If, however, If |
That's good to know. So this means, the Just out of curiosity, is this still ACID compliant? Consider the following pseudo-program:
In every other database that respects locks, the possible end states are
Now you are telling me that CockroachDB also allows the following end state?
Doesn't this violate consistency/isolation somehow? |
I'm confused as to how your outcome Let's assume If tx2 happens first, then you'll get outcome 2. If tx1 happens first, then you'll move to |
Yeah sorry for that confusing example. During the construction of the example I kind of messed up. My expectation is that TX2 runs while TX1 is running. Something like this:
|
Not sure if I can come up with a good example here, but it's confusing that this last pseudo-program can "finish" with TX2 succeeding in CockroachDB but not in other DBs. |
Thinking about this a bit more, the current behavior seems to allow for multiple transactions to return the same rows for select for update queries, which surely must be problematic, right? What does the SQL spec say about the effects of for update? |
You made us curious so we did some digging. The 2008 SQL spec does not mention FOR UPDATE in the select grammar anywhere, so there's nothing to deviate from in terms of the spec itself. The spec also has a fun note about locking and determinism not being something that it specifies. As it stands, you're absolutely right that this deviates from postgres, and that's not great in terms of user experience. @michae2 linked #88262 which seems like one relevant gap. I linked #75456, which is another. In a distributed and fault-tolerant system like cockroach, in order to ensure that the locks remained in the face of server failures, we'd need to replicate the read locks, which would have a cost. In today's cockroach, if you want to ensure that there is the desired pessimism, you'll need to incur a real write-write conflict. |
Not sure what consequences a write-write conflict has, but what I generally want here is that TX2 blocks rather than fails. Not sure how accurate the following is, but it states that the for update clause is specified for cursors, so maybe you can find the definition in a different chapter: |
It is specified for cursors. We looked at that. Technically, we're not talking about cursors here. As or more importantly, that specification doesn't mention blocking as far as I could find. It just talks about the operations supported on that cursor. |
Ok, I guess my expectation/intuition that Regardless, I think that doing a Think of the following program
On other databases, after a
I guess this is even more problematic when thinking about the Is it guaranteed that when two transactions execute an update statement for the same row, that only one transaction will successfully return from that update statement? Or is it possible that both transactions appear to have updated the row and only one transaction might succeed on commit? If both transactions can get past the update statement, then the locking scheme is not implementable with CockroachDB, not even if we emit actual updates instead of |
Side note: there's also some related discussion in #13546 about |
After some internal debate, @bdarnell has suggested that it could be considered a bug if you can have two In general, the feeling is that while it's convenient to rely on locking semantics of databases for mutual exclusion to then coordinate with other systems (as you seem to desire), it's not bulletproof there either. For example, your client may become partitioned from the database and not learn that it has failed. Given cockroach is a distributed system and other single-node systems are not, we also have to contend with asynchronous network partitions between other parts of the system. There is a tradeoff between providing the safety property you want and retaining system liveness in the face of such modes of failure. Also, deadlocks can happen such that one query is aborted. Admittedly both of those situations are rare. Another factor here is that cockroach has to contend with the fact that it seeks to keep client sessions available even when nodes crash or go away. We should, to the extent possible, provide the locking semantics you seek. We don't. It's not wrong for our definitions of correctness, but it is painful for the types of use you imagine. We have #88262 as open issues to cover common causes of locking failures #75456. Neither of these will prevent the loss of locks in the face of un-graceful loss of nodes. In general, we're not eager to pay the price of replicating the lock metadata for every As @michae2 points out, while we have no implemented it, ideally we'd focus on making |
Since the I guess that everyone would understand that transactions which do locking might fail when a network/cluster partition occurs while the transaction is running. |
We have marked this issue as stale because it has been inactive for |
I'm just a Hibernate ORM developer that wanted to report this IMO subtle behavior, so I personally don't care about whether CockroachDB behaves like all the other SQL databases. Feel free to close the issue if you don't plan to fix this. |
@beikov this issue was definitely worth reporting. After you filed this, we did some significant work around SELECT FOR UPDATE and lock replication. In 23.2 and 24.1, under read committed isolation, SELECT FOR UPDATE should now behave as it does in other databases: it should always block another SELECT FOR UPDATE or a mutation on the same row. In 23.2 and 24.1, under serializable isolation, using the following settings will also get the same behavior:
So under serializable isolation, with those settings, this should be fixed now. |
I'm going to close this for now, since this should be fixed in 23.2 when using those settings mentioned above. |
Describe the problem
Update statement does not always block when a different transaction has an exclusive lock for a row.
To Reproduce
select * from tbl1 t where t.id = 1 for update of t
to lock a single rowSET statement_timeout TO 100
select * from tbl1 t where t.id = 1
update tbl1 t set t.value = 'changed' where t.id = 1
Expected behavior
TX2 should block at step 6. since TX1 has a lock on the row, and TX2 should run into a timeout. It seems though, that "sometimes" the statement at step 6. succeeds.
As it is with timeout based stuff, this hard to reproduce reliably, but I wanted to report this anyway, as I believe there might be a concurrency issue lurking in CockroachDB that you might be able to find by looking at the code with this given example. The issue happened after a lot of other Hibernate testcases were executed i.e. after a lot of statements like create table, insert, select, update, delete, drop table.
Environment:
Here are SQL log excerpts from the run:
Jira issue: CRDB-20070
The text was updated successfully, but these errors were encountered: