-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FOR UPDATE
throws a single WriteTooOldError
with many parallel connections
#77119
Comments
Hello, I am Blathers. I am here to help you get the issue triaged. Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here. I was unable to automatically find someone to ping. If we have not gotten back to your issue within a few business days, you can try the following:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
cc @nvanbenschoten to triage - i think you might have the most knowledge on what's expected here. |
FOR UPDATE provides mutual exclusion over the locked rows for the duration of the locking transaction. Once that transaction commits, the lock is removed. If I understand this example correctly, you are running N concurrent transactions that all try to lock the same row. The expectation is that one txn will hold the lock at a time and that others will block until the lock is released. This means that the transactions will be serialized, but that they should all complete successfully. @Nican is this issue questioning why any transaction hits a WriteTooOld error, or why only one transaction hit a WriteTooOld error? In other words, is your expectation that 300/300 txns should have succeeded, or 1/300 txns should have succeeded? Also, how many nodes are in this cluster? |
@nvanbenschoten Thanks for taking the time to reply. This is a 1-node unsecure cluster that I use for testing purposes. I expect all the 300 workers to succeed. And sometimes all 300 workers do succeed, but other times 1 of the 300 workers will fail. I expect
|
@Nican The way to look at this is that |
Once again @nvanbenschoten, thanks for the reply. This failure happens a little more often than just a cluster or range event. This is a test table with probably <100kb of data, on a single node environment. I just tested how often the lock fails, and I think I found some additional behavior that can help narrow this down. Out of running it 100 times, trying to lock the row with 10 parallel connections each time, a total of 21 times it failed to preserve the lock correctly. I wrote the code as following: for (int i = 0; i < 100; i++)
{
await conn.QueryAsync("DROP TABLE IF EXISTS products");
await conn.QueryAsync("CREATE TABLE products (id INT PRIMARY KEY, quantity INT NOT NULL)");
await conn.QueryAsync($"INSERT INTO products(id,quantity) VALUES (1,{quantity})");
// Run parallel FOR UPDATE test
} (Success: 79 out of 100) But if I write the code as following by using await conn.QueryAsync("DROP TABLE IF EXISTS products");
await conn.QueryAsync("CREATE TABLE products (id INT PRIMARY KEY, quantity INT NOT NULL)");
for (int i = 0; i < 100; i++)
{
await conn.QueryAsync($"DELETE FROM products");
await conn.QueryAsync($"INSERT INTO products(id,quantity) VALUES (1,{quantity})");
// Run parallel FOR UPDATE test
} (Success: 99 out of 100 or 100 out of 100- sometimes the first one may fail) There seems to be some narrow case about using Even writing the code as seems to bring the success rate to near perfect: for (int i = 0; i < 100; i++)
{
await conn.QueryAsync("DROP TABLE IF EXISTS products");
await conn.QueryAsync("CREATE TABLE products (id INT PRIMARY KEY, quantity INT NOT NULL)");
await Task.Delay(TimeSpan.FromMilliseconds(300)); // Wait 300ms
await conn.QueryAsync($"INSERT INTO products(id,quantity) VALUES (1,{quantity})");
// Run parallel FOR UPDATE test
} (Success: 95 out of 100) You may close this issue if you wish, as it is a pretty narrow case. |
Hi @Nican, those experiments are very helpful! I think what's going on is that after you create a new table, there is an asynchronous range split event that is triggered to place the table in its own range. If you start acquiring locks before this range split, those will be lost during the split. This is mentioned in #75456, where a range split is considered one of the range events that currently lose best-effort unreplicated locks. Your experiment to DELETE rows from the table instead of creating a new table points to this. So does your experiment to add a delay between table creation time and testing time, which gives the table time to split before you begin acquiring unreplicated locks. |
Alright. I am going to ahead and close the issue. Looking forward for #75456 to get fixed. |
Describe the problem
I am running tests with
FOR UPDATE
, and I have reason to believe that under high loads, a narrow edge case causes the lock to be given out more than once.I created a
products
table with aid
andquantity
columns. I ran 10 parallel connections, such that a transaction starts, grabs a lock on the product row, decreases thequantity
by 1, and then commits the transaction.The behavior that I am noticing, that independed if I do 10 parallel connections, or 300 parallel connections, only a single connection will fail with
WriteTooOldError
.(A transaction is required in this use case because I also expect an insert into the
orderItems
table to be atomic.)To Reproduce
I created sample code to debug the issue here: https://gist.github.com/Nican/5764d109e1ffc322d2f8b0ba88044fc1
The code sometimes it will succeed, and you may have to run the code multiple times. Sometimes it will fail with the output:
or
Notice that with 10 connection, or 300 connections, only a single request will fail with
WriteTooOldError
, and it is usually a pretty lowid
(early connection).Sample error message:
Expected behavior
I expect
FOR UPDATE
to always give out a single lock.Environment:
Additional context
What was the impact?
Causing unnecessary restarts on transactions. I have fixed the issue with client-side retries, but I believe this is an issue that the CRDB team could look into.
Jira issue: CRDB-13417
The text was updated successfully, but these errors were encountered: