-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[frontend] Cannot list runs and/or artifacts (upstream request timeout) #10230
Comments
I've cleaned up MySQL db from old entries and it helped a bit - now it runs slow to see details of # MySQL pipeline runs cleanup
# log into the mysql pod:
kubectl -n kubeflow exec -it mysql-67f89d87cb-ntnrp -- bash
# log into the db:
mysql
use mlpipeline;
show tables;
desc run_details;
select count(*) from run_details where CreatedAtInSec < UNIX_TIMESTAMP('2023-11-01 00:00:00');
delete from run_details where CreatedAtInSec < UNIX_TIMESTAMP('2023-11-01 00:00:00');
commit;
select count(*) from resource_references;
select count(*) from resource_references where (ResourceUUID, ResourceType, ReferenceType) in (select res.ResourceUUID, res.ResourceType, res.ReferenceType from (select rr.ResourceUUID, rr.ResourceType, rr.ReferenceType, rd.UUID from resource_references rr left join run_details rd on rr.ResourceUUID=rd.UUID) res where UUID is null);
delete from resource_references where (ResourceUUID, ResourceType, ReferenceType) in (select res.ResourceUUID, res.ResourceType, res.ReferenceType from (select rr.ResourceUUID, rr.ResourceType, rr.ReferenceType, rd.UUID from resource_references rr left join run_details rd on rr.ResourceUUID=rd.UUID) res where UUID is null);
commit;
optimize table resource_references;
optimize table run_details;
exit |
Page with Run details still keeps on loading for few minutes - any idea how to analyse/debug/fix it? |
/assign @jlyaoyuli |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Thank you all for help, definitely it is worth to use Kubeflow with a lot of pipelines. |
@zijianjoy @chensun @connor-mccarthy There is definitely some kind of scaling issue on the I wonder if we can improve the scaling of the two main queries that are hanging:
I bet the problem is the
|
Related issues:
There was also a PR #9806 which may have partially resolved this in KFP version |
@tomaszstachera I bet you can back-port the fix from #9806 into older versions of KFP (like You can create the same indexes as that PR with the following MySQL query: CREATE INDEX namespace_createatinsec ON run_details (Namespace, CreatedAtInSec);
CREATE INDEX namespace_conditions_finishedatinsec ON run_details (Namespace, Conditions, FinishedAtInSec); |
Thanks @thesuperzapper! |
Closing this issue, as the fix is already in place. /close |
@rimolive: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We're on 2.0.5 and I've confirmed that those indexes are present in the Here's are the indexes in question:
|
Manual creation of indexed helped to shrink response time for |
Environment
https://awslabs.github.io/kubeflow-manifests/release-v1.7.0-aws-b1.0.3/docs/deployment/vanilla/guide/
build version dev_local (https://awslabs.github.io/kubeflow-manifests/release-v1.7.0-aws-b1.0.3/docs/deployment/vanilla/guide/)
Steps to reproduce
Run a lot of pipelines (few thousands for few months)
Expected result
KF UI should list Runs and/or Artifacts.
Error while listing artifacts in the UI - incomplete response.
Error while listing runs in the UI (no matter in which namespace) - upstream request timeout.
Materials and Reference
ml-pipeline-ui logs:
ml-pipeline logs:
mysql:
also I think metadata store starved my one of the node resources (AWS EKS Event):
Which Pod manages this metadata_store process?
Impacted by this bug? Give it a 👍.
The text was updated successfully, but these errors were encountered: