Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database is locked during Rancher (moderate) stress test #8234

Closed
moio opened this issue Aug 24, 2023 · 12 comments · Fixed by k3s-io/kine#211
Closed

database is locked during Rancher (moderate) stress test #8234

moio opened this issue Aug 24, 2023 · 12 comments · Fixed by k3s-io/kine#211
Assignees
Milestone

Comments

@moio
Copy link
Contributor

moio commented Aug 24, 2023

Environmental Info:
K3s Version:
v1.24.12+k3s1

Node(s) CPU architecture, OS, and Version:
Linux lenovo 5.14.21-150400.24.60-default #1 SMP PREEMPT_DYNAMIC Wed Apr 12 12:13:32 UTC 2023 (93dbe2e) x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
1 server, no agents, k3d. Default SQLite store

Describe the bug:
database is locked errors are reliably produced within seconds in Rancher logs running a (moderate) stress test script. Such errors do not appear with embedded etcd (specifying --cluster-init).

Steps To Reproduce:

  1. Copy-paste commands below (takes few minutes)
# create k3d cluster
k3d cluster create k3s-test --image docker.io/rancher/k3s:v1.24.12-k3s1 --api-port 6550 --port 9443:443@server:0:direct --no-lb --k3s-arg --tls-san=k3s.local.gd@server:0

# install cert-manager
helm install --namespace=cert-manager cert-manager https://charts.jetstack.io/charts/cert-manager-v1.8.0.tgz --create-namespace --set-json=installCRDs=true

# install Rancher
helm install --namespace=cattle-system rancher https://releases.rancher.com/server-charts/latest/rancher-2.7.5.tgz --create-namespace --set-json='bootstrapPassword="admin",hostname="k3s.local.gd",replicas=1'

# wait for Rancher to be up
kubectl wait deployment/rancher --namespace cattle-system --for condition=Available=true --timeout=1h
  1. Visit https://k3s.local.gd:9443/, login with the bootstrap password "admin", set "adminadminadmin" as the first user password
  2. Follow logs in one terminal:
kubectl logs -f -n cattle-system $(kubectl get pods -n cattle-system --no-headers -l app=rancher | cut -d ' ' -f1) | grep "database is locked"
  1. Run load test script (k6) in another terminal:
# download k6, check https://github.com/grafana/k6/releases for different OSs/archs 
wget https://github.com/grafana/k6/releases/download/v0.46.0/k6-v0.46.0-linux-amd64.tar.gz
tar xf ./k6-v0.46.0-linux-amd64.tar.gz

# download load test script
wget https://raw.githubusercontent.com/moio/scalability-tests/2a75c3fd130a20c842842dc8ce15d647c6963a1f/k6/create_roles_users.js

# run test
./k6-v0.46.0-linux-amd64/k6 run -e BASE_URL=https://k3s.local.gd:9443 -e USERNAME=admin -e PASSWORD=adminadminadmin -e ROLE_COUNT=10 -e USER_COUNT=1000 ./create_roles_users.js

Use k3d cluster delete k3s-test to delete the test environment after use.

Expected behavior:
No log lines produced at point 4.

Actual behavior:
Lines appear like the following:

2023/08/24 11:58:30 [ERROR] error syncing 'u-r2xmc': handler mgmt-auth-users-controller: rpc error: code = Unknown desc = database is locked, requeuing
        * rpc error: code = Unknown desc = database is locked

Additional context / logs:
I understand the reproducer is a bit convoluted, but it's been very reliable so far (and I minimized it to the extent I was able to).

I looked at past history and found quite some other previous similar reports were closed because of staleness or suspected unrelated configuration issue - I hope this helps isolating the actual root cause.

Also note that, as suspected, adding --k3s-arg --cluster-init@server:0 to the commandline in step 1 (thereby forcing embedded etcd instead of SQLite) makes the issue go away.

@moio
Copy link
Contributor Author

moio commented Aug 24, 2023

I am available in case any further support from my part could help with investigation and resolution here!

@brandond
Copy link
Member

brandond commented Aug 24, 2023

I am a little surprised that the database locked error is getting passed all the way back up the stack. It might be more appropriate for kine to retry internally when this happens instead of passing it back up, as it is not a normal etcd error that most clients are going to know how to respond to.

What sort of underlying disk and filesystem are you running this on? Is the response time acceptable using etcd on the same hardware, or do you see latency warnings in the apiserver logs?

@moio
Copy link
Contributor Author

moio commented Aug 24, 2023

I am using two local SSDs (one SCSI, one NVME) in md-raid 0 on XFS.

Performance is adequate under etcd to the extent I can tell. I did not notice anything suspicious in logs.

As for whether it's good to pass the error up the stack, I'm unsure. IIUC, the only way error can happen is if at some point there are multiple open connections to the same database file at the same time (either from different goroutines or processes) - and at least two writing at the same time. That is something I naively thought should never happen (in my mind: one kine - one process - one connection) are there known cases/reasons where there can be multiple connections opened?

@brandond
Copy link
Member

brandond commented Aug 24, 2023

There should never be concurrent access to the database from multiple processes unless you're doing something weird with trying to run multiple copies of K3s backed by the same disk and path. I think you might be thinking about SQLITE_BUSY, which is not what you're getting - you're getting SQLITE_LOCKED, which is different, and can happen within a single process. go-sqlite3 does allow for multiple concurrent reads from different goroutines within the same process.

See the discussion at mattn/go-sqlite3#632 - we use WAL mode with a shared cache: https://github.com/k3s-io/kine/blob/7f6ba014aa711f83baf4b875181f8e44d5419d48/pkg/drivers/sqlite/sqlite.go#L60C2-L60C2

I suspect that you might be running into issues with the compact transaction locking the database for short periods while it cleans up old rows? That is to be expected, as the database would grow without bound if we didn't compact it periodically.

@moio
Copy link
Contributor Author

moio commented Aug 25, 2023

I believe what we are seeing is actually SQLITE_BUSY, identified by the error string "database is locked" - and not SQLITE_LOCKED identified by the string "database table is locked":

    /* SQLITE_BUSY        */ "database is locked",
    /* SQLITE_LOCKED      */ "database table is locked",

Source: https://github.com/sqlite/sqlite/blob/5ecb51c260d49edbac05d7f0197af3d07560a188/src/main.c#L1601-L1602

So we should be in this case:

The SQLITE_BUSY result code indicates that the database file could not be written (or in some cases read) because of concurrent activity by some other database connection, usually a database connection in a separate process.
[...]
The SQLITE_BUSY result code differs from SQLITE_LOCKED in that SQLITE_BUSY indicates a conflict with a separate database connection, probably in a separate process, whereas SQLITE_LOCKED indicates a conflict within the same database connection

Source: https://www.sqlite.org/rescode.html

In this case I believe we can exclude the possibility of multiple processes acting on the same SQLite database - there is only one k3s process in the k3d container, and no shared volumes are mounted so the actual filesystem paths must be separate. OTOH, we do expect multiple open DB connections as automatically managed by sql.DB itself.

Turns out there is a busy timeout retry mechanism as a SQLite PRAGMA, that can be set in the connection string. I wonder if that could help.

@brandond
Copy link
Member

brandond commented Aug 25, 2023

It was unclear to me if the "usually in a separate process" comment indicated that this error is more likely to be caused by another process for architectural reasons, or if that is just a more frequently observed use pattern that causes that error.

If setting a timeout would help, it should be easy enough to see if that makes a difference:
k3s server '--datastore-endpoint=sqlite:///var/lib/rancher/k3s/server/db_state.db?_journal=WAL&cache=shared&_busy_timeout=30000'

@moio
Copy link
Contributor Author

moio commented Aug 28, 2023

It was unclear to me if the "usually in a separate process" comment indicated that this error is more likely to be caused by another process for architectural reasons, or if that is just a more frequently observed use pattern that causes that error.

I believe it indicates a frequently observed use pattern. In this case there is just one process - I even reproduced it with a standalone kine just to double check.

I did try with --datastore-endpoint and I do not get any more database is locked errors in logs, so it seems like it worked. By the way performance did not decrease in my micro-benchmark - if anything it improved slightly. And as a side effect, less HTTP requests fail (that's only partially related with the errors investigated here, but still nice):

Without _busy_timeout=30000:

http_req_duration..............: avg=464.84ms min=4.54ms   med=199.52ms max=5.24s   p(90)=1.36s p(95)=1.91s
  { expected_response:true }...: avg=465.11ms min=7.33ms   med=200.41ms max=4.51s   p(90)=1.36s p(95)=1.91s
http_req_failed................: 0.24%  ✓ 15        ✗ 5998
     
http_req_duration..............: avg=412.92ms min=3.1ms   med=172.94ms max=6.11s   p(90)=1.2s  p(95)=1.75s
  { expected_response:true }...: avg=420.38ms min=5.5ms   med=176.8ms  max=6.11s   p(90)=1.23s p(95)=1.76s
http_req_failed................: 1.82%  ✓ 110       ✗ 590

With _busy_timeout=30000:

http_req_duration..............: avg=413.91ms min=6.56ms  med=181.85ms max=6.09s   p(90)=1.17s p(95)=1.64s
  { expected_response:true }...: avg=413.91ms min=6.56ms  med=181.85ms max=6.09s   p(90)=1.17s p(95)=1.64s
http_req_failed................: 0.00%   ✓ 0        ✗ 6013

http_req_duration..............: avg=417.16ms min=5.5ms   med=185.3ms  max=5.99s   p(90)=1.14s p(95)=1.67s
  { expected_response:true }...: avg=417.16ms min=5.5ms   med=185.3ms  max=5.99s   p(90)=1.14s p(95)=1.67s
http_req_failed................: 0.00%   ✓ 0        ✗ 6013

(note that p(90) and p(95) numbers improved, moreover http_req_failed is consistently zero)

JFI I also re-ran the test with embedded etcd (k3s server --cluster-init) and that's significantly faster - but that's for another issue:

http_req_duration..............: avg=108.33ms min=4.76ms  med=53.84ms  max=926.69ms p(90)=288.51ms p(95)=408.63ms
  { expected_response:true }...: avg=108.34ms min=4.76ms  med=53.84ms  max=926.69ms p(90)=288.53ms p(95)=408.77ms
http_req_failed................: 0.03%  ✓ 2         ✗ 6011

So from all I can tell from this microbenchmark, it would help and solve this problem to have _busy_timeout=30000 set as the default.

Thanks for the guidance!

moio added a commit to moio/kine that referenced this issue Aug 28, 2023
Under stress, concurrent operations (notably while garbage collection is in progress) can result in failed calls with error SQLITE_BUSY (database is locked).

The SQLITE_BUSY result code is an expected outcome when the database file could not be written (or in some cases read) because of concurrent activity: https://www.sqlite.org/rescode.html

SQLite ships a busy timeout retry mechanism as a PRAGMA (https://www.sqlite.org/pragma.html#pragma_busy_timeout), that can be set via go-sqlite3's connection string.

That makes go-sqlite3 retry operations that fail with SQLITE_BUSY transparently to users.

Fixes k3s-io/k3s#8234
moio added a commit to moio/kine that referenced this issue Aug 28, 2023
Under stress, concurrent operations (notably while garbage collection is in progress) can result in failed calls with error SQLITE_BUSY (database is locked).

The SQLITE_BUSY result code is an expected outcome when the database file could not be written (or in some cases read) because of concurrent activity: https://www.sqlite.org/rescode.html

SQLite ships a busy timeout retry mechanism as a PRAGMA (https://www.sqlite.org/pragma.html#pragma_busy_timeout), that can be set via go-sqlite3's connection string.

That makes go-sqlite3 retry operations that fail with SQLITE_BUSY transparently to users.

Fixes k3s-io/k3s#8234

Signed-off-by: Silvio Moioli <[email protected]>
@brandond
Copy link
Member

@moio thanks for the research on this! Did you evaluate any different values for the timeout? 30 seconds was a somewhat arbitrary choice; I'm curious if there is a better value we could use as the default.

@moio
Copy link
Contributor Author

moio commented Aug 29, 2023

Not really.

As you mentioned, the value is arbitrary - It's a halting problem, undecidable by definition.

My understanding is: the longer the time the less likely SQLITE_BUSY is produced - so there is an incentive to set the value as high as possible. That's because there will always be combinations of high load, slow hardware and bad timing that make a query take long, an we do not want it failed just because we were not patient enough.

OTOH, there can be pathological cases in which bugs result in particularly problematic usage patterns that cause more contention than necessary - and the only way for us to discover is to see SQLITE_BUSY errors. That's an opposite incentive to set the value as low as possible.

FMPOV, I do not expect dramatic changes in access patterns at this point of maturity of k3s and kine - garbage collection is "a known quantity". 30s ought to be enough for the vast majority of cases, and the occasional error should be generally tolerated by software layers above.

I can only suggest to keep an eye on user reports, and tweak depending on what others see. I do not think it makes sense to over-fit a value on my current microbenchmark.

HTH!

brandond pushed a commit to k3s-io/kine that referenced this issue Aug 29, 2023
Under stress, concurrent operations (notably while garbage collection is in progress) can result in failed calls with error SQLITE_BUSY (database is locked).

The SQLITE_BUSY result code is an expected outcome when the database file could not be written (or in some cases read) because of concurrent activity: https://www.sqlite.org/rescode.html

SQLite ships a busy timeout retry mechanism as a PRAGMA (https://www.sqlite.org/pragma.html#pragma_busy_timeout), that can be set via go-sqlite3's connection string.

That makes go-sqlite3 retry operations that fail with SQLITE_BUSY transparently to users.

Fixes k3s-io/k3s#8234

Signed-off-by: Silvio Moioli <[email protected]>
@github-project-automation github-project-automation bot moved this from New to Done Issue in K3s Development Aug 29, 2023
@brandond
Copy link
Member

brandond commented Aug 29, 2023

Reopening to track updating kine in K3s; didn't notice that the Kine PR used the Fixes keyword to auto-close this issue.

@brandond
Copy link
Member

brandond commented Oct 19, 2023

This was bumped and backported in #8323

@moio are you able to retest on the September releases?

@moio
Copy link
Contributor Author

moio commented Oct 20, 2023

I tested this on k3s v1.26.9-k3s1 / Rancher 2.7.6 and confirm this issue is fixed.

Good to close!

Thanks!

@github-project-automation github-project-automation bot moved this from To Test to Done Issue in K3s Development Oct 20, 2023
gshilei pushed a commit to gshilei/kine that referenced this issue Oct 31, 2023
Under stress, concurrent operations (notably while garbage collection is in progress) can result in failed calls with error SQLITE_BUSY (database is locked).

The SQLITE_BUSY result code is an expected outcome when the database file could not be written (or in some cases read) because of concurrent activity: https://www.sqlite.org/rescode.html

SQLite ships a busy timeout retry mechanism as a PRAGMA (https://www.sqlite.org/pragma.html#pragma_busy_timeout), that can be set via go-sqlite3's connection string.

That makes go-sqlite3 retry operations that fail with SQLITE_BUSY transparently to users.

Fixes k3s-io/k3s#8234

Signed-off-by: Silvio Moioli <[email protected]>
Signed-off-by: zhiyuan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants