Improve mysql write performance and stability when offloading #13290

imliuda · 2024-07-02T03:48:18Z

Summary

We enabled node status offload and workflows archiving, and we have observed some performance and stability issues.

there are many slow queries of mysql when running thousands of workflows parallelly, most of which is insert into argo_workflows table, and daily archived workflows cleaning-up may take very long time
if mysql temporarily unavailable, then many workflows will become Error, this is because controller can not hydrate workflow or dehydrate workflow with mysql
there may be workflows failed with message "Dead lock found when trying to get lock; trying restarting transaction"

Use Cases

When running a large cluster, kube-apiserver may be the performance bottleneck, we need offload and archive to improve kube-apiserver performance. currently, it needs some optimisation and improvement to make it stable in a production eviroment.

Some thoughts

add appropriate indexes to archive_workflows and argo_workflows table
compress nodes statues or workflow before save it to mysql
add workflow to work queue again when Hydrate failed, and don't Dehydrate workflows when mysql is unavailable
considering using redis or s3 to save node status or archived workflows

agilgur5 · 2024-07-02T04:21:51Z

there are many slow queries of mysql when running thousands of workflows parallelly, most of which is insert into argo_workflows table

add appropriate indexes to archive_workflows and argo_workflows table

If it's slow on inserts, I don't think an index would improve that performance. Otherwise it would be good to know what slow queries you see on what version of Argo

compress nodes statues or workflow before save it to mysql

We might be able to do this for status.nodes specifically, but other parts of the status for archived workflows are used during queries (see also #13203).

considering using redis or s3 to save node status or archived workflows

There's an existing feature request for blob storage: #4162.
It would be insufficient for archived workflows though, given that they won't be queryable as a blob, as one of the comments lays out (as well as some workarounds). Similarly, I'm not sure Redis/Valkey or similar would suffice for archived workflows queries (it potentially could, but may require some transactions, denormalization, and/or manual indexes).

But both might suffice for status offloading 🤔

and daily archived workflows cleaning-up may take very long time

We might want to add configurable workers for this; i.e. --archive-ttl-workers

imliuda · 2024-07-02T06:20:17Z

If it's slow on inserts, I don't think an index would improve that performance. Otherwise it would be good to know what slow queries you see on what version of Argo

I add following index, workflows that Error caused by Dead lock found when trying to get lock; trying restarting transaction disappear, I guess it is caused by delete query right next to the successful insert sql in the OffloadNodeStatusRepo.Save() method.

alter table argo_workflows add index `argo_workflows_i2` (`clustername`,`uid`, `version`, `updatedat`);

agilgur5 · 2024-07-03T15:20:59Z

Hmm looks like the argo_workflows table may not have indexes? #8860 added them for archived workflows (which do have different query patterns).

Would you like to submit a PR to add them for offloaded Workflows?

agilgur5 · 2024-07-03T15:26:21Z

I guess it is caused by delete query right next to the successful insert sql in the OffloadNodeStatusRepo.Save() method.

Yea that might make more sense since a deletion has to do a lookup. That specific deletion may be removed in #13286

Otherwise it would be good to know what slow queries you see on what version of Argo

This would still be useful -- your Server and Controller logs should show any slow queries

imliuda · 2024-07-05T02:33:19Z

argo version is v3.4.8, most of slow query is insert.

imliuda · 2024-07-05T02:43:11Z

When there are lots of slow queries, that delete may cause deadlock or lock wait timeout, I think #13286 will help, if this pr is merged, I think it is not necessary of above secondary index. But if we remove that delete code, is there a situation that table records increase continuously due to periodic gc speed is slower then insert?

imliuda · 2024-07-07T11:24:18Z

@agilgur5 Anton, I opened a pull request, add some code to compress nodes for both offloading and archiving, can you help review this code when you you have time?

agilgur5 · 2024-07-09T12:13:20Z

argo version is v3.4.8, most of slow query is insert.

@imliuda when providing logs, please use text instead of images, as text is much more accessible.

Otherwise that's a great data point, I'm surprised INSERT is so significantly slower on so many queries. DELETE looks like it could be worthwhile to add an index to as well

I think #13286 will help, if this pr is merged, I think it is not necessary of above secondary index

Mmm the index would still make the periodic GC deletes faster

But if we remove that delete code, is there a situation that table records increase continuously due to periodic gc speed is slower then insert?

Technically, I think that's always been possible.

I mentioned above that we could add --archive-ttl-workers to configure this. Maybe --archive-ttl-frequency too.

Oh, this is for offloading though. We might be able to add something similar but I'm actually not familiar with the code for offloading's periodic GC if it works the same way as all other Controller GC (which generally have a configurable number of workers)

RyanDevlin · 2024-07-19T13:18:46Z

For this proposal, I see it was mentioned "considering using redis or s3 to save node status or archived workflows". Have we considered alternatives to SQL for offloading?

My company is facing similar issues as mentioned here, where we run super large-scale batch workflows which consume more than 50% of etcd under load. Having a performant alternative to etcd as Argo's stateful backing store would greatly benefit us. We also have an organizational aversion to relational databases, so I'm wondering if we've considered things like S3 or DynamoDB for offloading? These are both extremely performant over a network and might be easier to manage than a relational DB?

agilgur5 · 2024-07-19T13:51:55Z

I did respond to that above already. Given that the current performance is being addressed already (i.e. this issue is likely to close out soon), I would suggest opening a separate feature request for that and referencing this issue and #4162

imliuda added the type/feature Feature request label Jul 2, 2024

agilgur5 added the area/workflow-archive label Jul 2, 2024

agilgur5 added the area/controller Controller issues, panics label Jul 3, 2024

agilgur5 changed the title ~~Improve mysql performance and stability when enable offload node statues and archive workflows~~ Improve mysql performance and stability when offload and archive workflows Jul 4, 2024

imliuda linked a pull request Jul 6, 2024 that will close this issue

fix: compress node status when offloading to database. Fixes #13290 #13313

Open

agilgur5 changed the title ~~Improve mysql performance and stability when offload and archive workflows~~ Improve mysql write performance and stability when offload and archive workflows Jul 9, 2024

agilgur5 changed the title ~~Improve mysql write performance and stability when offload and archive workflows~~ Improve mysql write performance and stability when offloading Jul 19, 2024

agilgur5 added the area/offloading Node status offloading label Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve mysql write performance and stability when offloading #13290

Improve mysql write performance and stability when offloading #13290

imliuda commented Jul 2, 2024 •

edited by agilgur5

Loading

agilgur5 commented Jul 2, 2024 •

edited

Loading

imliuda commented Jul 2, 2024 •

edited by agilgur5

Loading

agilgur5 commented Jul 3, 2024

agilgur5 commented Jul 3, 2024

imliuda commented Jul 5, 2024 •

edited by agilgur5

Loading

imliuda commented Jul 5, 2024

imliuda commented Jul 7, 2024

agilgur5 commented Jul 9, 2024 •

edited

Loading

RyanDevlin commented Jul 19, 2024

agilgur5 commented Jul 19, 2024

Improve mysql write performance and stability when offloading #13290

Improve mysql write performance and stability when offloading #13290

Comments

imliuda commented Jul 2, 2024 • edited by agilgur5 Loading

Summary

Use Cases

Some thoughts

agilgur5 commented Jul 2, 2024 • edited Loading

imliuda commented Jul 2, 2024 • edited by agilgur5 Loading

agilgur5 commented Jul 3, 2024

agilgur5 commented Jul 3, 2024

imliuda commented Jul 5, 2024 • edited by agilgur5 Loading

imliuda commented Jul 5, 2024

imliuda commented Jul 7, 2024

agilgur5 commented Jul 9, 2024 • edited Loading

RyanDevlin commented Jul 19, 2024

agilgur5 commented Jul 19, 2024

imliuda commented Jul 2, 2024 •

edited by agilgur5

Loading

agilgur5 commented Jul 2, 2024 •

edited

Loading

imliuda commented Jul 2, 2024 •

edited by agilgur5

Loading

imliuda commented Jul 5, 2024 •

edited by agilgur5

Loading

agilgur5 commented Jul 9, 2024 •

edited

Loading