Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[frontend] Cannot list runs and/or artifacts (upstream request timeout) #10230

Closed
tomaszstachera opened this issue Nov 13, 2023 · 13 comments
Closed

Comments

@tomaszstachera
Copy link

Environment

Steps to reproduce

Run a lot of pipelines (few thousands for few months)

Expected result

KF UI should list Runs and/or Artifacts.
Error while listing artifacts in the UI - incomplete response.
kf_ui_art_error

Error while listing runs in the UI (no matter in which namespace) - upstream request timeout.
kf_ui_err_runs

Materials and Reference

ml-pipeline-ui logs:

GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
[HPM] Error occurred while trying to proxy request /apis/v1beta1/runs?page_size=5&sort_by=created_at%20desc&resource_reference_key.type=NAMESPACE&resource_reference_key.id=tomasz from localhost:8016 to http://172.20.149.51:8888 (ECONNRESET) (https://nodejs.org/api/errors.html#errors_common_system_errors)
[HPM] Error occurred while trying to proxy request /apis/v1beta1/runs?page_token=&page_size=10&sort_by=created_at%20desc&resource_reference_key.type=NAMESPACE&resource_reference_key.id=tomasz&filter=%257B%2522predicates%2522%253A%255B%257B%2522key%2522%253A%2522storage_state%2522%252C%2522op%2522%253A%2522NOT_EQUALS%2522%252C%2522string_value%2522%253A%2522STORAGESTATE_ARCHIVED%2522%257D%255D%257D from localhost:8016 to http://172.20.149.51:8888 (ECONNRESET) (https://nodejs.org/api/errors.html#errors_common_system_errors)
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz

ml-pipeline logs:

I1113 14:06:14.593826       8 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting
I1113 14:06:14.670332       8 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I1113 14:06:14.674296       8 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting
I1113 14:06:14.730478       8 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I1113 14:06:14.733527       8 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting
I1113 14:06:14.780486       8 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I1113 14:06:14.783656       8 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting
I1113 14:06:14.830303       8 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I1113 14:06:14.833368       8 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting
I1113 14:06:14.890462       8 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I1113 14:06:14.893445       8 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting
I1113 14:06:14.940586       8 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I1113 14:06:15.000506       8 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I1113 14:06:15.058869       8 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting

mysql:

mysql> select count(*) from run_details;
+----------+
| count(*) |
+----------+
|   207719 |
+----------+
1 row in set (4.51 sec)

also I think metadata store starved my one of the node resources (AWS EKS Event):

Warning SystemOOM a few seconds ago kubelet System OOM encountered, victim process: metadata_store_, pid: 10313

Which Pod manages this metadata_store process?

Impacted by this bug? Give it a 👍.

@tomaszstachera
Copy link
Author

I've cleaned up MySQL db from old entries and it helped a bit - now it runs slow to see details of Run page but at least it works.

# MySQL pipeline runs cleanup
# log into the mysql pod:
kubectl -n kubeflow exec -it mysql-67f89d87cb-ntnrp -- bash
# log into the db:
mysql
use mlpipeline;
show tables;
desc run_details;
select count(*) from run_details where CreatedAtInSec < UNIX_TIMESTAMP('2023-11-01 00:00:00');
delete from run_details where CreatedAtInSec < UNIX_TIMESTAMP('2023-11-01 00:00:00');
commit;
select count(*) from resource_references;
select count(*) from resource_references where (ResourceUUID, ResourceType, ReferenceType) in (select res.ResourceUUID, res.ResourceType, res.ReferenceType from (select rr.ResourceUUID, rr.ResourceType, rr.ReferenceType, rd.UUID from resource_references rr left join run_details rd on rr.ResourceUUID=rd.UUID) res where UUID is null);
delete from resource_references where (ResourceUUID, ResourceType, ReferenceType) in (select res.ResourceUUID, res.ResourceType, res.ReferenceType from (select rr.ResourceUUID, rr.ResourceType, rr.ReferenceType, rd.UUID from resource_references rr left join run_details rd on rr.ResourceUUID=rd.UUID) res where UUID is null);
commit;
optimize table resource_references;
optimize table run_details;
exit

@tomaszstachera
Copy link
Author

Page with Run details still keeps on loading for few minutes - any idea how to analyse/debug/fix it?

@zijianjoy
Copy link
Collaborator

/assign @jlyaoyuli

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 14, 2024
@tomaszstachera
Copy link
Author

Thank you all for help, definitely it is worth to use Kubeflow with a lot of pipelines.
#sarcasm

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 14, 2024
@thesuperzapper
Copy link
Member

thesuperzapper commented Apr 2, 2024

@zijianjoy @chensun @connor-mccarthy There is definitely some kind of scaling issue on the "runs" page, when you have a very full database. I can reproduce this error ("runs" page never loading) on a large test cluster after thousands of runs.

I wonder if we can improve the scaling of the two main queries that are hanging:

  • Query 1:
    • /apis/v1beta1/runs?page_size=5&sort_by=created_at%20desc&resource_reference_key.type=NAMESPACE&resource_reference_key.id=<THE_USER_NAMESPACE>
  • Query 2:
    • /apis/v1beta1/runs?page_token=&page_size=10&sort_by=created_at%20desc&resource_reference_key.type=NAMESPACE&resource_reference_key.id=<THE_USER_NAMESPACE>&filter=%257B%2522predicates%2522%253A%255B%257B%2522key%2522%253A%2522storage_state%2522%252C%2522op%2522%253A%2522NOT_EQUALS%2522%252C%2522string_value%2522%253A%2522STORAGESTATE_ARCHIVED%2522%257D%255D%257D
    • NOTE: filter decodes to: {"predicates":[{"key":"storage_state","op":"NOT_EQUALS","string_value":"STORAGESTATE_ARCHIVED"}]}

I bet the problem is the sort_by=created_at, which will take a full table scan unless created_at is an Index.

Can someone more experienced with MySQL than me confirm if that would be a good solution, or share other ideas?

Also, I wonder if we should be cleaning up the run_details table automatically?

@thesuperzapper
Copy link
Member

Related issues:

There was also a PR #9806 which may have partially resolved this in KFP version 2.0.2.

@thesuperzapper
Copy link
Member

@tomaszstachera I bet you can back-port the fix from #9806 into older versions of KFP (like 2.0.0-alpha.7, which is the one from Kubeflow 1.7).

You can create the same indexes as that PR with the following MySQL query:

CREATE INDEX namespace_createatinsec ON run_details (Namespace, CreatedAtInSec);

CREATE INDEX namespace_conditions_finishedatinsec ON run_details (Namespace, Conditions, FinishedAtInSec);

@vanHavel
Copy link

Thanks @thesuperzapper!
Manually creating the indexes made our kfp v1 list_runs much faster!

@rimolive
Copy link
Member

Closing this issue, as the fix is already in place.

/close

Copy link

@rimolive: Closing this issue.

In response to this:

Closing this issue, as the fix is already in place.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@droctothorpe
Copy link
Contributor

droctothorpe commented May 3, 2024

We're on 2.0.5 and I've confirmed that those indexes are present in the run_details table, but it still regularly takes 5-10 seconds to load the run page. Is anyone else still running into this, even with the indexes in place? I filed this issue which includes a lot more details:
#10778

Here's are the indexes in question:

+-------------+------------+-------------------------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table       | Non_unique | Key_name                                  | Seq_in_index | Column_name     | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------------+------------+-------------------------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| run_details |          0 | PRIMARY                                   |            1 | UUID            | A         |       20638 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | experimentuuid_createatinsec              |            1 | ExperimentUUID  | A         |        1193 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | experimentuuid_createatinsec              |            2 | CreatedAtInSec  | A         |       20660 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | experimentuuid_conditions_finishedatinsec |            1 | ExperimentUUID  | A         |        1033 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | experimentuuid_conditions_finishedatinsec |            2 | Conditions      | A         |        1407 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | experimentuuid_conditions_finishedatinsec |            3 | FinishedAtInSec | A         |       20230 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | namespace_createatinsec                   |            1 | Namespace       | A         |          69 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | namespace_createatinsec                   |            2 | CreatedAtInSec  | A         |       20660 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | namespace_conditions_finishedatinsec      |            1 | Namespace       | A         |          58 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | namespace_conditions_finishedatinsec      |            2 | Conditions      | A         |         202 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | namespace_conditions_finishedatinsec      |            3 | FinishedAtInSec | A         |       20660 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
+-------------+------------+-------------------------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

@tomaszstachera
Copy link
Author

@tomaszstachera I bet you can back-port the fix from #9806 into older versions of KFP (like 2.0.0-alpha.7, which is the one from Kubeflow 1.7).

You can create the same indexes as that PR with the following MySQL query:

CREATE INDEX namespace_createatinsec ON run_details (Namespace, CreatedAtInSec);

CREATE INDEX namespace_conditions_finishedatinsec ON run_details (Namespace, Conditions, FinishedAtInSec);

Manual creation of indexed helped to shrink response time for Runs to 5 seconds. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Closed
Development

No branches or pull requests

7 participants