Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve mysql write performance and stability when offloading #13290

Open
imliuda opened this issue Jul 2, 2024 · 10 comments · May be fixed by #13313
Open

Improve mysql write performance and stability when offloading #13290

imliuda opened this issue Jul 2, 2024 · 10 comments · May be fixed by #13313
Labels
area/controller Controller issues, panics area/offloading Node status offloading area/workflow-archive type/feature Feature request

Comments

@imliuda
Copy link
Contributor

imliuda commented Jul 2, 2024

Summary

We enabled node status offload and workflows archiving, and we have observed some performance and stability issues.

  • there are many slow queries of mysql when running thousands of workflows parallelly, most of which is insert into argo_workflows table, and daily archived workflows cleaning-up may take very long time
  • if mysql temporarily unavailable, then many workflows will become Error, this is because controller can not hydrate workflow or dehydrate workflow with mysql
  • there may be workflows failed with message "Dead lock found when trying to get lock; trying restarting transaction"

Use Cases

When running a large cluster, kube-apiserver may be the performance bottleneck, we need offload and archive to improve kube-apiserver performance. currently, it needs some optimisation and improvement to make it stable in a production eviroment.

Some thoughts

  • add appropriate indexes to archive_workflows and argo_workflows table
  • compress nodes statues or workflow before save it to mysql
  • add workflow to work queue again when Hydrate failed, and don't Dehydrate workflows when mysql is unavailable
  • considering using redis or s3 to save node status or archived workflows
@imliuda imliuda added the type/feature Feature request label Jul 2, 2024
@agilgur5
Copy link

agilgur5 commented Jul 2, 2024

  • there are many slow queries of mysql when running thousands of workflows parallelly, most of which is insert into argo_workflows table
  • add appropriate indexes to archive_workflows and argo_workflows table

If it's slow on inserts, I don't think an index would improve that performance. Otherwise it would be good to know what slow queries you see on what version of Argo

  • compress nodes statues or workflow before save it to mysql

We might be able to do this for status.nodes specifically, but other parts of the status for archived workflows are used during queries (see also #13203).

  • considering using redis or s3 to save node status or archived workflows

There's an existing feature request for blob storage: #4162.
It would be insufficient for archived workflows though, given that they won't be queryable as a blob, as one of the comments lays out (as well as some workarounds). Similarly, I'm not sure Redis/Valkey or similar would suffice for archived workflows queries (it potentially could, but may require some transactions, denormalization, and/or manual indexes).

But both might suffice for status offloading 🤔

and daily archived workflows cleaning-up may take very long time

We might want to add configurable workers for this; i.e. --archive-ttl-workers

@imliuda
Copy link
Contributor Author

imliuda commented Jul 2, 2024

If it's slow on inserts, I don't think an index would improve that performance. Otherwise it would be good to know what slow queries you see on what version of Argo

I add following index, workflows that Error caused by Dead lock found when trying to get lock; trying restarting transaction disappear, I guess it is caused by delete query right next to the successful insert sql in the OffloadNodeStatusRepo.Save() method.

alter table argo_workflows add index `argo_workflows_i2` (`clustername`,`uid`, `version`, `updatedat`);

@agilgur5
Copy link

agilgur5 commented Jul 3, 2024

Hmm looks like the argo_workflows table may not have indexes? #8860 added them for archived workflows (which do have different query patterns).

Would you like to submit a PR to add them for offloaded Workflows?

@agilgur5
Copy link

agilgur5 commented Jul 3, 2024

I guess it is caused by delete query right next to the successful insert sql in the OffloadNodeStatusRepo.Save() method.

Yea that might make more sense since a deletion has to do a lookup. That specific deletion may be removed in #13286

Otherwise it would be good to know what slow queries you see on what version of Argo

This would still be useful -- your Server and Controller logs should show any slow queries

@agilgur5 agilgur5 added the area/controller Controller issues, panics label Jul 3, 2024
@agilgur5 agilgur5 changed the title Improve mysql performance and stability when enable offload node statues and archive workflows Improve mysql performance and stability when offload and archive workflows Jul 4, 2024
@imliuda
Copy link
Contributor Author

imliuda commented Jul 5, 2024

argo version is v3.4.8, most of slow query is insert.

image

@imliuda
Copy link
Contributor Author

imliuda commented Jul 5, 2024

When there are lots of slow queries, that delete may cause deadlock or lock wait timeout, I think #13286 will help, if this pr is merged, I think it is not necessary of above secondary index. But if we remove that delete code, is there a situation that table records increase continuously due to periodic gc speed is slower then insert?

image

@imliuda
Copy link
Contributor Author

imliuda commented Jul 7, 2024

@agilgur5 Anton, I opened a pull request, add some code to compress nodes for both offloading and archiving, can you help review this code when you you have time?

@agilgur5
Copy link

agilgur5 commented Jul 9, 2024

argo version is v3.4.8, most of slow query is insert.

@imliuda when providing logs, please use text instead of images, as text is much more accessible.

Otherwise that's a great data point, I'm surprised INSERT is so significantly slower on so many queries. DELETE looks like it could be worthwhile to add an index to as well

I think #13286 will help, if this pr is merged, I think it is not necessary of above secondary index

Mmm the index would still make the periodic GC deletes faster

But if we remove that delete code, is there a situation that table records increase continuously due to periodic gc speed is slower then insert?

Technically, I think that's always been possible.

I mentioned above that we could add --archive-ttl-workers to configure this. Maybe --archive-ttl-frequency too.

Oh, this is for offloading though. We might be able to add something similar but I'm actually not familiar with the code for offloading's periodic GC if it works the same way as all other Controller GC (which generally have a configurable number of workers)

@agilgur5 agilgur5 changed the title Improve mysql performance and stability when offload and archive workflows Improve mysql write performance and stability when offload and archive workflows Jul 9, 2024
@RyanDevlin
Copy link

For this proposal, I see it was mentioned "considering using redis or s3 to save node status or archived workflows". Have we considered alternatives to SQL for offloading?

My company is facing similar issues as mentioned here, where we run super large-scale batch workflows which consume more than 50% of etcd under load. Having a performant alternative to etcd as Argo's stateful backing store would greatly benefit us. We also have an organizational aversion to relational databases, so I'm wondering if we've considered things like S3 or DynamoDB for offloading? These are both extremely performant over a network and might be easier to manage than a relational DB?

@agilgur5
Copy link

I did respond to that above already. Given that the current performance is being addressed already (i.e. this issue is likely to close out soon), I would suggest opening a separate feature request for that and referencing this issue and #4162

@agilgur5 agilgur5 changed the title Improve mysql write performance and stability when offload and archive workflows Improve mysql write performance and stability when offloading Jul 19, 2024
@agilgur5 agilgur5 added the area/offloading Node status offloading label Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/offloading Node status offloading area/workflow-archive type/feature Feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants