`database is locked` during Rancher (moderate) stress test #8234

moio · 2023-08-24T12:34:29Z

Environmental Info:
K3s Version:
v1.24.12+k3s1

Node(s) CPU architecture, OS, and Version:
Linux lenovo 5.14.21-150400.24.60-default #1 SMP PREEMPT_DYNAMIC Wed Apr 12 12:13:32 UTC 2023 (93dbe2e) x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
1 server, no agents, k3d. Default SQLite store

Describe the bug:
database is locked errors are reliably produced within seconds in Rancher logs running a (moderate) stress test script. Such errors do not appear with embedded etcd (specifying --cluster-init).

Steps To Reproduce:

Copy-paste commands below (takes few minutes)

# create k3d cluster
k3d cluster create k3s-test --image docker.io/rancher/k3s:v1.24.12-k3s1 --api-port 6550 --port 9443:443@server:0:direct --no-lb --k3s-arg --tls-san=k3s.local.gd@server:0

# install cert-manager
helm install --namespace=cert-manager cert-manager https://charts.jetstack.io/charts/cert-manager-v1.8.0.tgz --create-namespace --set-json=installCRDs=true

# install Rancher
helm install --namespace=cattle-system rancher https://releases.rancher.com/server-charts/latest/rancher-2.7.5.tgz --create-namespace --set-json='bootstrapPassword="admin",hostname="k3s.local.gd",replicas=1'

# wait for Rancher to be up
kubectl wait deployment/rancher --namespace cattle-system --for condition=Available=true --timeout=1h

Visit https://k3s.local.gd:9443/, login with the bootstrap password "admin", set "adminadminadmin" as the first user password
Follow logs in one terminal:

kubectl logs -f -n cattle-system $(kubectl get pods -n cattle-system --no-headers -l app=rancher | cut -d ' ' -f1) | grep "database is locked"

Run load test script (k6) in another terminal:

# download k6, check https://github.com/grafana/k6/releases for different OSs/archs 
wget https://github.com/grafana/k6/releases/download/v0.46.0/k6-v0.46.0-linux-amd64.tar.gz
tar xf ./k6-v0.46.0-linux-amd64.tar.gz

# download load test script
wget https://raw.githubusercontent.com/moio/scalability-tests/2a75c3fd130a20c842842dc8ce15d647c6963a1f/k6/create_roles_users.js

# run test
./k6-v0.46.0-linux-amd64/k6 run -e BASE_URL=https://k3s.local.gd:9443 -e USERNAME=admin -e PASSWORD=adminadminadmin -e ROLE_COUNT=10 -e USER_COUNT=1000 ./create_roles_users.js

Use k3d cluster delete k3s-test to delete the test environment after use.

Expected behavior:
No log lines produced at point 4.

Actual behavior:
Lines appear like the following:

2023/08/24 11:58:30 [ERROR] error syncing 'u-r2xmc': handler mgmt-auth-users-controller: rpc error: code = Unknown desc = database is locked, requeuing
        * rpc error: code = Unknown desc = database is locked

Additional context / logs:
I understand the reproducer is a bit convoluted, but it's been very reliable so far (and I minimized it to the extent I was able to).

I looked at past history and found quite some other previous similar reports were closed because of staleness or suspected unrelated configuration issue - I hope this helps isolating the actual root cause.

Also note that, as suspected, adding --k3s-arg --cluster-init@server:0 to the commandline in step 1 (thereby forcing embedded etcd instead of SQLite) makes the issue go away.

The text was updated successfully, but these errors were encountered:

moio · 2023-08-24T12:35:26Z

I am available in case any further support from my part could help with investigation and resolution here!

brandond · 2023-08-24T16:48:06Z

I am a little surprised that the database locked error is getting passed all the way back up the stack. It might be more appropriate for kine to retry internally when this happens instead of passing it back up, as it is not a normal etcd error that most clients are going to know how to respond to.

What sort of underlying disk and filesystem are you running this on? Is the response time acceptable using etcd on the same hardware, or do you see latency warnings in the apiserver logs?

moio · 2023-08-24T20:31:07Z

I am using two local SSDs (one SCSI, one NVME) in md-raid 0 on XFS.

Performance is adequate under etcd to the extent I can tell. I did not notice anything suspicious in logs.

As for whether it's good to pass the error up the stack, I'm unsure. IIUC, the only way error can happen is if at some point there are multiple open connections to the same database file at the same time (either from different goroutines or processes) - and at least two writing at the same time. That is something I naively thought should never happen (in my mind: one kine - one process - one connection) are there known cases/reasons where there can be multiple connections opened?

brandond · 2023-08-24T20:52:54Z

There should never be concurrent access to the database from multiple processes unless you're doing something weird with trying to run multiple copies of K3s backed by the same disk and path. I think you might be thinking about SQLITE_BUSY, which is not what you're getting - you're getting SQLITE_LOCKED, which is different, and can happen within a single process. go-sqlite3 does allow for multiple concurrent reads from different goroutines within the same process.

See the discussion at mattn/go-sqlite3#632 - we use WAL mode with a shared cache: https://github.com/k3s-io/kine/blob/7f6ba014aa711f83baf4b875181f8e44d5419d48/pkg/drivers/sqlite/sqlite.go#L60C2-L60C2

I suspect that you might be running into issues with the compact transaction locking the database for short periods while it cleans up old rows? That is to be expected, as the database would grow without bound if we didn't compact it periodically.

moio · 2023-08-25T09:09:39Z

I believe what we are seeing is actually SQLITE_BUSY, identified by the error string "database is locked" - and not SQLITE_LOCKED identified by the string "database table is locked":

    /* SQLITE_BUSY        */ "database is locked",
    /* SQLITE_LOCKED      */ "database table is locked",

Source: https://github.com/sqlite/sqlite/blob/5ecb51c260d49edbac05d7f0197af3d07560a188/src/main.c#L1601-L1602

So we should be in this case:

The SQLITE_BUSY result code indicates that the database file could not be written (or in some cases read) because of concurrent activity by some other database connection, usually a database connection in a separate process.
[...]
The SQLITE_BUSY result code differs from SQLITE_LOCKED in that SQLITE_BUSY indicates a conflict with a separate database connection, probably in a separate process, whereas SQLITE_LOCKED indicates a conflict within the same database connection

Source: https://www.sqlite.org/rescode.html

In this case I believe we can exclude the possibility of multiple processes acting on the same SQLite database - there is only one k3s process in the k3d container, and no shared volumes are mounted so the actual filesystem paths must be separate. OTOH, we do expect multiple open DB connections as automatically managed by sql.DB itself.

Turns out there is a busy timeout retry mechanism as a SQLite PRAGMA, that can be set in the connection string. I wonder if that could help.

brandond · 2023-08-25T17:21:21Z

It was unclear to me if the "usually in a separate process" comment indicated that this error is more likely to be caused by another process for architectural reasons, or if that is just a more frequently observed use pattern that causes that error.

If setting a timeout would help, it should be easy enough to see if that makes a difference:
k3s server '--datastore-endpoint=sqlite:///var/lib/rancher/k3s/server/db_state.db?_journal=WAL&cache=shared&_busy_timeout=30000'

moio · 2023-08-28T10:09:57Z

It was unclear to me if the "usually in a separate process" comment indicated that this error is more likely to be caused by another process for architectural reasons, or if that is just a more frequently observed use pattern that causes that error.

I believe it indicates a frequently observed use pattern. In this case there is just one process - I even reproduced it with a standalone kine just to double check.

I did try with --datastore-endpoint and I do not get any more database is locked errors in logs, so it seems like it worked. By the way performance did not decrease in my micro-benchmark - if anything it improved slightly. And as a side effect, less HTTP requests fail (that's only partially related with the errors investigated here, but still nice):

Without _busy_timeout=30000:

http_req_duration..............: avg=464.84ms min=4.54ms   med=199.52ms max=5.24s   p(90)=1.36s p(95)=1.91s
  { expected_response:true }...: avg=465.11ms min=7.33ms   med=200.41ms max=4.51s   p(90)=1.36s p(95)=1.91s
http_req_failed................: 0.24%  ✓ 15        ✗ 5998
     
http_req_duration..............: avg=412.92ms min=3.1ms   med=172.94ms max=6.11s   p(90)=1.2s  p(95)=1.75s
  { expected_response:true }...: avg=420.38ms min=5.5ms   med=176.8ms  max=6.11s   p(90)=1.23s p(95)=1.76s
http_req_failed................: 1.82%  ✓ 110       ✗ 590

With _busy_timeout=30000:

http_req_duration..............: avg=413.91ms min=6.56ms  med=181.85ms max=6.09s   p(90)=1.17s p(95)=1.64s
  { expected_response:true }...: avg=413.91ms min=6.56ms  med=181.85ms max=6.09s   p(90)=1.17s p(95)=1.64s
http_req_failed................: 0.00%   ✓ 0        ✗ 6013

http_req_duration..............: avg=417.16ms min=5.5ms   med=185.3ms  max=5.99s   p(90)=1.14s p(95)=1.67s
  { expected_response:true }...: avg=417.16ms min=5.5ms   med=185.3ms  max=5.99s   p(90)=1.14s p(95)=1.67s
http_req_failed................: 0.00%   ✓ 0        ✗ 6013

(note that p(90) and p(95) numbers improved, moreover http_req_failed is consistently zero)

JFI I also re-ran the test with embedded etcd (k3s server --cluster-init) and that's significantly faster - but that's for another issue:

http_req_duration..............: avg=108.33ms min=4.76ms  med=53.84ms  max=926.69ms p(90)=288.51ms p(95)=408.63ms
  { expected_response:true }...: avg=108.34ms min=4.76ms  med=53.84ms  max=926.69ms p(90)=288.53ms p(95)=408.77ms
http_req_failed................: 0.03%  ✓ 2         ✗ 6011

So from all I can tell from this microbenchmark, it would help and solve this problem to have _busy_timeout=30000 set as the default.

Thanks for the guidance!

Under stress, concurrent operations (notably while garbage collection is in progress) can result in failed calls with error SQLITE_BUSY (database is locked). The SQLITE_BUSY result code is an expected outcome when the database file could not be written (or in some cases read) because of concurrent activity: https://www.sqlite.org/rescode.html SQLite ships a busy timeout retry mechanism as a PRAGMA (https://www.sqlite.org/pragma.html#pragma_busy_timeout), that can be set via go-sqlite3's connection string. That makes go-sqlite3 retry operations that fail with SQLITE_BUSY transparently to users. Fixes k3s-io/k3s#8234

Under stress, concurrent operations (notably while garbage collection is in progress) can result in failed calls with error SQLITE_BUSY (database is locked). The SQLITE_BUSY result code is an expected outcome when the database file could not be written (or in some cases read) because of concurrent activity: https://www.sqlite.org/rescode.html SQLite ships a busy timeout retry mechanism as a PRAGMA (https://www.sqlite.org/pragma.html#pragma_busy_timeout), that can be set via go-sqlite3's connection string. That makes go-sqlite3 retry operations that fail with SQLITE_BUSY transparently to users. Fixes k3s-io/k3s#8234 Signed-off-by: Silvio Moioli <[email protected]>

brandond · 2023-08-28T18:08:58Z

@moio thanks for the research on this! Did you evaluate any different values for the timeout? 30 seconds was a somewhat arbitrary choice; I'm curious if there is a better value we could use as the default.

moio · 2023-08-29T08:06:37Z

Not really.

As you mentioned, the value is arbitrary - It's a halting problem, undecidable by definition.

My understanding is: the longer the time the less likely SQLITE_BUSY is produced - so there is an incentive to set the value as high as possible. That's because there will always be combinations of high load, slow hardware and bad timing that make a query take long, an we do not want it failed just because we were not patient enough.

OTOH, there can be pathological cases in which bugs result in particularly problematic usage patterns that cause more contention than necessary - and the only way for us to discover is to see SQLITE_BUSY errors. That's an opposite incentive to set the value as low as possible.

FMPOV, I do not expect dramatic changes in access patterns at this point of maturity of k3s and kine - garbage collection is "a known quantity". 30s ought to be enough for the vast majority of cases, and the occasional error should be generally tolerated by software layers above.

I can only suggest to keep an eye on user reports, and tweak depending on what others see. I do not think it makes sense to over-fit a value on my current microbenchmark.

HTH!

Under stress, concurrent operations (notably while garbage collection is in progress) can result in failed calls with error SQLITE_BUSY (database is locked). The SQLITE_BUSY result code is an expected outcome when the database file could not be written (or in some cases read) because of concurrent activity: https://www.sqlite.org/rescode.html SQLite ships a busy timeout retry mechanism as a PRAGMA (https://www.sqlite.org/pragma.html#pragma_busy_timeout), that can be set via go-sqlite3's connection string. That makes go-sqlite3 retry operations that fail with SQLITE_BUSY transparently to users. Fixes k3s-io/k3s#8234 Signed-off-by: Silvio Moioli <[email protected]>

brandond · 2023-08-29T15:42:20Z

Reopening to track updating kine in K3s; didn't notice that the Kine PR used the Fixes keyword to auto-close this issue.

brandond · 2023-10-19T21:59:03Z

This was bumped and backported in #8323

@moio are you able to retest on the September releases?

moio · 2023-10-20T13:05:16Z

I tested this on k3s v1.26.9-k3s1 / Rancher 2.7.6 and confirm this issue is fixed.

Good to close!

Thanks!

Under stress, concurrent operations (notably while garbage collection is in progress) can result in failed calls with error SQLITE_BUSY (database is locked). The SQLITE_BUSY result code is an expected outcome when the database file could not be written (or in some cases read) because of concurrent activity: https://www.sqlite.org/rescode.html SQLite ships a busy timeout retry mechanism as a PRAGMA (https://www.sqlite.org/pragma.html#pragma_busy_timeout), that can be set via go-sqlite3's connection string. That makes go-sqlite3 retry operations that fail with SQLITE_BUSY transparently to users. Fixes k3s-io/k3s#8234 Signed-off-by: Silvio Moioli <[email protected]> Signed-off-by: zhiyuan <[email protected]>

github-project-automation bot added this to K3s Development Aug 24, 2023

github-project-automation bot moved this to New in K3s Development Aug 24, 2023

moio mentioned this issue Aug 28, 2023

SQLite: specify a busy timeout k3s-io/kine#211

Merged

brandond closed this as completed in k3s-io/kine#211 Aug 29, 2023

github-project-automation bot moved this from New to Done Issue in K3s Development Aug 29, 2023

brandond reopened this Aug 29, 2023

brandond moved this from Done Issue to Next Up in K3s Development Aug 29, 2023

brandond added this to the v1.28.2+k3s1 milestone Aug 29, 2023

brandond assigned moio Aug 29, 2023

moio mentioned this issue Sep 18, 2023

[BUG] legacy-rolebinding-ownerref-cleaner handlers fire millions of times per hour rancher/rancher#42846

Closed

caroline-suse-rancher modified the milestones: v1.28.2+k3s1, v1.28.3+k3s1 Sep 18, 2023

brandond moved this from Next Up to To Test in K3s Development Oct 19, 2023

brandond self-assigned this Oct 19, 2023

brandond mentioned this issue Oct 19, 2023

Bump kine to v0.10.3 #8323

Merged

brandond closed this as completed Oct 20, 2023

github-project-automation bot moved this from To Test to Done Issue in K3s Development Oct 20, 2023

brandond mentioned this issue Sep 3, 2024

Reject invalid endpoint k3s-io/kine#320

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`database is locked` during Rancher (moderate) stress test #8234

`database is locked` during Rancher (moderate) stress test #8234

moio commented Aug 24, 2023 •

edited

Loading

moio commented Aug 24, 2023

brandond commented Aug 24, 2023 •

edited

Loading

moio commented Aug 24, 2023

brandond commented Aug 24, 2023 •

edited

Loading

moio commented Aug 25, 2023 •

edited

Loading

brandond commented Aug 25, 2023 •

edited

Loading

moio commented Aug 28, 2023

brandond commented Aug 28, 2023

moio commented Aug 29, 2023

brandond commented Aug 29, 2023 •

edited

Loading

brandond commented Oct 19, 2023 •

edited

Loading

moio commented Oct 20, 2023

database is locked during Rancher (moderate) stress test #8234

database is locked during Rancher (moderate) stress test #8234

Comments

moio commented Aug 24, 2023 • edited Loading

moio commented Aug 24, 2023

brandond commented Aug 24, 2023 • edited Loading

moio commented Aug 24, 2023

brandond commented Aug 24, 2023 • edited Loading

moio commented Aug 25, 2023 • edited Loading

brandond commented Aug 25, 2023 • edited Loading

moio commented Aug 28, 2023

brandond commented Aug 28, 2023

moio commented Aug 29, 2023

brandond commented Aug 29, 2023 • edited Loading

brandond commented Oct 19, 2023 • edited Loading

moio commented Oct 20, 2023

`database is locked` during Rancher (moderate) stress test #8234

`database is locked` during Rancher (moderate) stress test #8234

moio commented Aug 24, 2023 •

edited

Loading

brandond commented Aug 24, 2023 •

edited

Loading

brandond commented Aug 24, 2023 •

edited

Loading

moio commented Aug 25, 2023 •

edited

Loading

brandond commented Aug 25, 2023 •

edited

Loading

brandond commented Aug 29, 2023 •

edited

Loading

brandond commented Oct 19, 2023 •

edited

Loading