How to cleanup a Kine sqlite DB ? #213

rhuss · 2023-09-13T06:40:45Z

I have successfully used K3s with a sqlite backend for now 3.5 years constantly on a Raspberry Pi 4 cluster.

Just recently I have sever performance issues with "Slow SQL" queries. Looking into the database I see

SELECT name, COUNT(*) FROM kine GROUP BY name HAVING COUNT(*) > 5000;
/registry/daemonsets/metallb-system/speaker|9757
/registry/daemonsets/prometheus/prometheus-node-exporter|9961
/registry/endpointslices/prometheus/prometheus-node-exporter-nhg47|7515
/registry/leases/kube-node-lease/atlantis|32920
/registry/leases/kube-node-lease/ningaloo|33508
/registry/leases/kube-node-lease/wakatobi|33689
/registry/leases/kube-node-lease/yonaguni|33694
/registry/leases/kube-system/apiserver-u5uxy3pu56cewrfobbktu77aje|34072
/registry/masterleases/10.9.11.7|34594
/registry/minions/yonaguni|5082
/registry/services/endpoints/prometheus/prometheus-node-exporter|7458

I don't think that that many revision of one objects are expected to be here (the leases have all the same 'create_revision' value, but different 'prev_revision' values).

My question: Can I remove all except one at least from the leases ? And why are so many revisions kept like e.g. for the DaemonSets ?

The overall sqlite DB is 4.9 GB in size now, with 640k rows in table 'kine'

The text was updated successfully, but these errors were encountered:

brandond · 2023-09-13T07:24:38Z

You didn't fill out the issue template, so I have to ask - what version of k3s are you on?

rhuss · 2023-09-13T07:40:27Z

sorry (but there was no issue template to select when creating a "New issue")

I'm running the latest stable version of k3s:

k3s --version
k3s version v1.27.5+k3s1 (8d074ecb)
go version go1.20.7

Running on RaspberryPi OS, 64 Bit.

rhuss · 2023-09-13T07:45:57Z

Naively deleting leases with the same revision number except for the latest didn't help, I now get tons of other errors (playing back my backup now).

brandond · 2023-09-13T08:05:29Z

sorry (but there was no issue template to select when creating a "New issue"

Ah, because this issue was probably best asked in the k3s repo, since you're using kine embedded in k3s. We have several issue templates there that ask for relevant information.

Check your k3s journald logs for message that look like these, once every 5 minutes:

INFO[51600] COMPACT compactRev=15073 targetCompactRev=15162 currentRev=16162
INFO[51600] COMPACT deleted 89 rows from 89 revisions in 1.954868ms - compacted to 15162/16162

If you don't see those, and see an error instead, then for some reason your compaction is failing.

Naively deleting leases with the same revision number except for the latest didn't help

Yeah, don't do that. You'd want to shut down k3s and use the sqlite standalone binary to run a manual compact query:

delete from kine where id in (select id from (select id, name from kine where id not in (select max(id) as id from kine group by name)));

rhuss · 2023-09-13T08:17:31Z

Thanks for the quick feedback!

Indeed, I see something like:

Sep 08 01:37:57 ningaloo k3s[10991]: time="2023-09-08T01:37:57+02:00" level=info msg="Slow SQL (started: 2023-09-08 01:36:56.916335528 +0200 CEST m=+176109.661150280) (total time: 1m0.870460472s):  DELETE FROM kine AS kv WHERE kv.id IN ( SELECT kp.prev_revision AS id FROM kine AS kp WHERE kp.name != 'compact_rev_key' AND kp.prev_revision != 0 AND kp.id <= ? UNION SELECT kd.id AS id FROM kine AS kd WHERE kd.deleted != 0 AND kd.id <= ? ) : [[96481855 96481855]]"
Sep 08 01:37:57 ningaloo k3s[10991]: time="2023-09-08T01:37:57+02:00" level=error msg="Compact failed: failed to record compact revision: sql: transaction has already been committed or rolled back"

rhuss · 2023-09-13T08:22:53Z

Running now your proposed cleanup statement, vacuum and then will try again. 'will report back, but might take a bit :-)

rhuss · 2023-09-13T09:57:25Z

@brandond, that helped! I have now back from 4.9 GB to 62 MB and to 20k entries only.

My takeaways for this are:

sqlite is really surprisingly capable of dealing with such an amount of data, I would expected that it croaked much earlier.
When passing a threshold, compaction will run into timeouts, even when compaction would help to get back to normal. I.e. when between two compaction runs a lot of compactable data has been inserted so that the next compaction run will run into transaction timeouts, then you will never get back and the DB will grow forever.

rhuss · 2023-09-13T10:05:39Z

Unfortunately I'm not completely done, because my original issue was that an invalid CR broke the K8s garbage collection loop, so that a tons of completed job and pod objects piled up (from a 5min cron-job) that never get garbage collected. I just noticed it when there were over 2k of those jobs.

I've delete them, but they are still contained in the sqlitedb (3.5k rows for jobs & pods) and now they re-appear as zombies in the journal log with:

9104:Sep  6 00:40:00 ningaloo k3s[4553]: I0906 00:40:00.343824    4553 job_controller.go:523] enqueueing job mqtt/collect-home-metrics-28232560
189105:Sep  6 00:40:00 ningaloo k3s[4553]: I0906 00:40:00.371122    4553 job_controller.go:523] enqueueing job mqtt/collect-home-metrics-28232560
189106:Sep  6 00:40:00 ningaloo k3s[4553]: I0906 00:40:00.388504    4553 job_controller.go:523] enqueueing job mqtt/collect-home-metrics-28232560
189107:Sep  6 00:40:00 ningaloo k3s[4553]: I0906 00:40:00.471934    4553 job_controller.go:523] enqueueing job mqtt/collect-home-metrics-28232560
189138:Sep  6 00:40:01 ningaloo k3s[4553]: I0906 00:40:01.824805    4553 job_controller.go:523] enqueueing job mqtt/collect-home-metrics-28232560
189147:Sep  6 00:40:02 ningaloo k3s[4553]: I0906 00:40:02.860192    4553 job_controller.go:523] enqueueing job mqtt/collect-home-metrics-28232560
189174:Sep  6 00:40:45 ningaloo k3s[4553]: I0906 00:40:45.174885    4553 job_controller.go:523] enqueueing job mqtt/collect-home-metrics-28232560
189175:Sep  6 00:40:46 ningaloo k3s[4553]: I0906 00:40:46.219768    4553 job_controller.go:523] enqueueing job mqtt/collect-home-metrics-28232560
189176:Sep  6 00:40:47 ningaloo k3s[4553]: I0906 00:40:47.081555    4553 job_controller.go:523] enqueueing job mqtt/collect-home-metrics-28232560
189177:Sep  6 00:40:48 ningaloo k3s[4553]: I0906 00:40:48.110016    4553 job_controller.go:523] enqueueing job mqtt/collect-home-metrics-28232560
189178:Sep  6 00:40:48 ningaloo k3s[4553]: I0906 00:40:48.139088    4553 job_controller.go:523] enqueueing job mqtt/collect-home-metrics-28232560
....

There are no corresponding Jobs resources in kubectl get jobs -A.
Is it save to delete rows like these in kine ?

SELECT name, COUNT(*) FROM kine GROUP BY name
....
/registry/jobs/mqtt/collect-home-metrics-28233580|1
/registry/jobs/mqtt/collect-home-metrics-28233585|1
/registry/jobs/mqtt/collect-home-metrics-28233590|1
/registry/jobs/mqtt/collect-home-metrics-28233595|1
/registry/jobs/mqtt/collect-home-metrics-28233600|1
/registry/jobs/mqtt/collect-home-metrics-28233605|1
/registry/jobs/mqtt/collect-home-metrics-28233610|1
/registry/jobs/mqtt/collect-home-metrics-28233615|1
/registry/jobs/mqtt/collect-home-metrics-28233620|1
/registry/jobs/mqtt/collect-home-metrics-28233625|1
/registry/jobs/mqtt/collect-home-metrics-28233630|1
/registry/jobs/mqtt/collect-home-metrics-28233635|1
/registry/jobs/mqtt/collect-home-metrics-28233640|1
/registry/jobs/mqtt/collect-home-metrics-28233645|1
/registry/jobs/mqtt/collect-home-metrics-28233650|1
/registry/jobs/mqtt/collect-home-metrics-28233655|1
/registry/jobs/mqtt/collect-home-metrics-28233660|1
....

Sorry to bother you with this, I know its not originally a kine issue. Feel free to close the issue if this completely off.

rhuss · 2023-09-13T10:07:38Z

But maybe you can help me with the question of what "k8s" as value mean ? All those zombie rows have this as values ...

brandond · 2023-09-13T15:37:51Z

Deleted resources will get cleaned up by compaction eventually. The actual compact query run internally is slightly more complex and includes deleted objects.

We're you able to find the log messages regarding compaction? How long is it currently taking?

rhuss · 2023-09-13T17:27:21Z

@brandond that is all I've found wrt/ compaction: #213 (comment) and as you can see, it ran into a timeout.

(that is what I meant, that the compaction itself ran into a transaction timeout during the delete, which means that from that point on no compaction works anymore because only compaction reduces the data size leading potentially in shorter query times that would fit again into the allowed transaction time)

I think I'm finally good after a lot of trial and error (also getting rid of some container zombies as described in k3s-io/k3s#6185 (comment)) and the cronjob and its enqueue pods).

Thanks again for your support, the manual compaction statement from above to get from 600k to 20k entries was absolute key to getting back into a usable state.

brandond · 2023-09-13T17:45:12Z

Right, I was hoping to hear whether or not you're getting the same errors after running manual compaction, or what the new elapsed time is if compaction is succeeding. If you are still seeing the automatic compact fail, you'll just run into problems again further down the road - if compaction fails regularly, the database will continue to grow until it is multiple GB again.

rhuss · 2023-09-13T18:25:32Z

Ah, ok. No, after the manual compaction, the regular compaction works as expected again:

Sep 13 19:47:17 ningaloo k3s[9992]: time="2023-09-13T19:47:17+02:00" level=info msg="COMPACT compactRev=97137012 targetCompactRev=97137238 currentRev=9713
8238"
Sep 13 19:47:17 ningaloo k3s[9992]: time="2023-09-13T19:47:17+02:00" level=info msg="COMPACT deleted 226 rows from 226 revisions in 75.583591ms - compacte
d to 97137238/97138238"

It was only the transaction timeout (because of a too-large DB, manual compaction took also around 15 minutes) which broke the compaction.

brandond · 2023-09-13T18:34:34Z

OK, great. That's what I was hoping to see.

FWIW, kine will also run a vacuum at startup, so as long as compaction is working and you restart k3s periodically, the database should stay pretty small.

brandond · 2024-09-16T16:39:37Z

@1k-off
Datastore type: MySQL 8

You're commenting on an issue about sqlite, but you're not running sqlite. If you're using an external DB, check the K3s log messages about compaction, and make sure that your database is able to keep up with compact requests, otherwise it'll fill as you are seeing. Please take any further conversation about this to a new issue.

rhuss closed this as completed Sep 13, 2023

brandond mentioned this issue Jan 18, 2024

K3s becomes unresponsive after flooding mariadb backend with specific SQL query k3s-io/k3s#9264

Closed

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to cleanup a Kine sqlite DB ? #213

How to cleanup a Kine sqlite DB ? #213

rhuss commented Sep 13, 2023

brandond commented Sep 13, 2023

rhuss commented Sep 13, 2023

rhuss commented Sep 13, 2023

brandond commented Sep 13, 2023

rhuss commented Sep 13, 2023

rhuss commented Sep 13, 2023

rhuss commented Sep 13, 2023

rhuss commented Sep 13, 2023

rhuss commented Sep 13, 2023

brandond commented Sep 13, 2023 •

edited

Loading

rhuss commented Sep 13, 2023

brandond commented Sep 13, 2023

rhuss commented Sep 13, 2023

brandond commented Sep 13, 2023

This comment was marked as off-topic.

brandond commented Sep 16, 2024

How to cleanup a Kine sqlite DB ? #213

How to cleanup a Kine sqlite DB ? #213

Comments

rhuss commented Sep 13, 2023

brandond commented Sep 13, 2023

rhuss commented Sep 13, 2023

rhuss commented Sep 13, 2023

brandond commented Sep 13, 2023

rhuss commented Sep 13, 2023

rhuss commented Sep 13, 2023

rhuss commented Sep 13, 2023

rhuss commented Sep 13, 2023

rhuss commented Sep 13, 2023

brandond commented Sep 13, 2023 • edited Loading

rhuss commented Sep 13, 2023

brandond commented Sep 13, 2023

rhuss commented Sep 13, 2023

brandond commented Sep 13, 2023

This comment was marked as off-topic.

brandond commented Sep 16, 2024

brandond commented Sep 13, 2023 •

edited

Loading