[frontend] Cannot list runs and/or artifacts (upstream request timeout) #10230

tomaszstachera · 2023-11-13T14:33:55Z

Environment

How did you deploy Kubeflow Pipelines (KFP)?
https://awslabs.github.io/kubeflow-manifests/release-v1.7.0-aws-b1.0.3/docs/deployment/vanilla/guide/
KFP version:
build version dev_local (https://awslabs.github.io/kubeflow-manifests/release-v1.7.0-aws-b1.0.3/docs/deployment/vanilla/guide/)

Steps to reproduce

Run a lot of pipelines (few thousands for few months)

Expected result

KF UI should list Runs and/or Artifacts.
Error while listing artifacts in the UI - incomplete response.

Error while listing runs in the UI (no matter in which namespace) - upstream request timeout.

Materials and Reference

ml-pipeline-ui logs:

GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
[HPM] Error occurred while trying to proxy request /apis/v1beta1/runs?page_size=5&sort_by=created_at%20desc&resource_reference_key.type=NAMESPACE&resource_reference_key.id=tomasz from localhost:8016 to http://172.20.149.51:8888 (ECONNRESET) (https://nodejs.org/api/errors.html#errors_common_system_errors)
[HPM] Error occurred while trying to proxy request /apis/v1beta1/runs?page_token=&page_size=10&sort_by=created_at%20desc&resource_reference_key.type=NAMESPACE&resource_reference_key.id=tomasz&filter=%257B%2522predicates%2522%253A%255B%257B%2522key%2522%253A%2522storage_state%2522%252C%2522op%2522%253A%2522NOT_EQUALS%2522%252C%2522string_value%2522%253A%2522STORAGESTATE_ARCHIVED%2522%257D%255D%257D from localhost:8016 to http://172.20.149.51:8888 (ECONNRESET) (https://nodejs.org/api/errors.html#errors_common_system_errors)
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz
GET /apis/v1beta1/healthz

ml-pipeline logs:

I1113 14:06:14.593826       8 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting
I1113 14:06:14.670332       8 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I1113 14:06:14.674296       8 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting
I1113 14:06:14.730478       8 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I1113 14:06:14.733527       8 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting
I1113 14:06:14.780486       8 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I1113 14:06:14.783656       8 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting
I1113 14:06:14.830303       8 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I1113 14:06:14.833368       8 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting
I1113 14:06:14.890462       8 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I1113 14:06:14.893445       8 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting
I1113 14:06:14.940586       8 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I1113 14:06:15.000506       8 interceptor.go:37] /api.ReportService/ReportScheduledWorkflow handler finished
I1113 14:06:15.058869       8 interceptor.go:29] /api.ReportService/ReportScheduledWorkflow handler starting

mysql:

mysql> select count(*) from run_details;
+----------+
| count(*) |
+----------+
|   207719 |
+----------+
1 row in set (4.51 sec)

also I think metadata store starved my one of the node resources (AWS EKS Event):

Warning	SystemOOM	a few seconds ago	kubelet	System OOM encountered, victim process: metadata_store_, pid: 10313

Which Pod manages this metadata_store process?

Impacted by this bug? Give it a 👍.

tomaszstachera · 2023-11-28T13:50:40Z

I've cleaned up MySQL db from old entries and it helped a bit - now it runs slow to see details of Run page but at least it works.

# MySQL pipeline runs cleanup
# log into the mysql pod:
kubectl -n kubeflow exec -it mysql-67f89d87cb-ntnrp -- bash
# log into the db:
mysql
use mlpipeline;
show tables;
desc run_details;
select count(*) from run_details where CreatedAtInSec < UNIX_TIMESTAMP('2023-11-01 00:00:00');
delete from run_details where CreatedAtInSec < UNIX_TIMESTAMP('2023-11-01 00:00:00');
commit;
select count(*) from resource_references;
select count(*) from resource_references where (ResourceUUID, ResourceType, ReferenceType) in (select res.ResourceUUID, res.ResourceType, res.ReferenceType from (select rr.ResourceUUID, rr.ResourceType, rr.ReferenceType, rd.UUID from resource_references rr left join run_details rd on rr.ResourceUUID=rd.UUID) res where UUID is null);
delete from resource_references where (ResourceUUID, ResourceType, ReferenceType) in (select res.ResourceUUID, res.ResourceType, res.ReferenceType from (select rr.ResourceUUID, rr.ResourceType, rr.ReferenceType, rd.UUID from resource_references rr left join run_details rd on rr.ResourceUUID=rd.UUID) res where UUID is null);
commit;
optimize table resource_references;
optimize table run_details;
exit

tomaszstachera · 2023-12-01T09:00:29Z

Page with Run details still keeps on loading for few minutes - any idea how to analyse/debug/fix it?

zijianjoy · 2023-12-14T23:36:10Z

/assign @jlyaoyuli

github-actions · 2024-03-14T07:41:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tomaszstachera · 2024-03-14T08:14:02Z

Thank you all for help, definitely it is worth to use Kubeflow with a lot of pipelines.
#sarcasm

thesuperzapper · 2024-04-02T01:42:03Z

@zijianjoy @chensun @connor-mccarthy There is definitely some kind of scaling issue on the "runs" page, when you have a very full database. I can reproduce this error ("runs" page never loading) on a large test cluster after thousands of runs.

I wonder if we can improve the scaling of the two main queries that are hanging:

Query 1:
- /apis/v1beta1/runs?page_size=5&sort_by=created_at%20desc&resource_reference_key.type=NAMESPACE&resource_reference_key.id=<THE_USER_NAMESPACE>
Query 2:
- /apis/v1beta1/runs?page_token=&page_size=10&sort_by=created_at%20desc&resource_reference_key.type=NAMESPACE&resource_reference_key.id=<THE_USER_NAMESPACE>&filter=%257B%2522predicates%2522%253A%255B%257B%2522key%2522%253A%2522storage_state%2522%252C%2522op%2522%253A%2522NOT_EQUALS%2522%252C%2522string_value%2522%253A%2522STORAGESTATE_ARCHIVED%2522%257D%255D%257D
- NOTE: filter decodes to: {"predicates":[{"key":"storage_state","op":"NOT_EQUALS","string_value":"STORAGESTATE_ARCHIVED"}]}

I bet the problem is the `sort_by=created_at`, which will take a full table scan unless `created_at` is an Index.

Can someone more experienced with MySQL than me confirm if that would be a good solution, or share other ideas?

Also, I wonder if we should be cleaning up the run_details table automatically?

thesuperzapper · 2024-04-02T02:01:04Z

Related issues:

There was also a PR #9806 which may have partially resolved this in KFP version 2.0.2.

thesuperzapper · 2024-04-02T02:14:57Z

@tomaszstachera I bet you can back-port the fix from #9806 into older versions of KFP (like 2.0.0-alpha.7, which is the one from Kubeflow 1.7).

You can create the same indexes as that PR with the following MySQL query:

CREATE INDEX namespace_createatinsec ON run_details (Namespace, CreatedAtInSec);

CREATE INDEX namespace_conditions_finishedatinsec ON run_details (Namespace, Conditions, FinishedAtInSec);

vanHavel · 2024-04-23T12:07:18Z

Thanks @thesuperzapper!
Manually creating the indexes made our kfp v1 list_runs much faster!

rimolive · 2024-04-23T20:44:27Z

Closing this issue, as the fix is already in place.

/close

google-oss-prow · 2024-04-23T20:44:31Z

@rimolive: Closing this issue.

In response to this:

Closing this issue, as the fix is already in place.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

droctothorpe · 2024-05-03T19:41:43Z

We're on 2.0.5 and I've confirmed that those indexes are present in the run_details table, but it still regularly takes 5-10 seconds to load the run page. Is anyone else still running into this, even with the indexes in place? I filed this issue which includes a lot more details:
#10778

Here's are the indexes in question:

+-------------+------------+-------------------------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table       | Non_unique | Key_name                                  | Seq_in_index | Column_name     | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-------------+------------+-------------------------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| run_details |          0 | PRIMARY                                   |            1 | UUID            | A         |       20638 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | experimentuuid_createatinsec              |            1 | ExperimentUUID  | A         |        1193 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | experimentuuid_createatinsec              |            2 | CreatedAtInSec  | A         |       20660 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | experimentuuid_conditions_finishedatinsec |            1 | ExperimentUUID  | A         |        1033 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | experimentuuid_conditions_finishedatinsec |            2 | Conditions      | A         |        1407 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | experimentuuid_conditions_finishedatinsec |            3 | FinishedAtInSec | A         |       20230 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | namespace_createatinsec                   |            1 | Namespace       | A         |          69 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | namespace_createatinsec                   |            2 | CreatedAtInSec  | A         |       20660 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | namespace_conditions_finishedatinsec      |            1 | Namespace       | A         |          58 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | namespace_conditions_finishedatinsec      |            2 | Conditions      | A         |         202 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| run_details |          1 | namespace_conditions_finishedatinsec      |            3 | FinishedAtInSec | A         |       20660 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
+-------------+------------+-------------------------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

tomaszstachera · 2024-05-06T13:52:45Z

@tomaszstachera I bet you can back-port the fix from #9806 into older versions of KFP (like 2.0.0-alpha.7, which is the one from Kubeflow 1.7).

You can create the same indexes as that PR with the following MySQL query:
CREATE INDEX namespace_createatinsec ON run_details (Namespace, CreatedAtInSec);

CREATE INDEX namespace_conditions_finishedatinsec ON run_details (Namespace, Conditions, FinishedAtInSec);

Manual creation of indexed helped to shrink response time for Runs to 5 seconds. Thanks!

tomaszstachera added area/frontend kind/bug labels Nov 13, 2023

google-oss-prow bot assigned jlyaoyuli Dec 14, 2023

github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 14, 2024

stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 14, 2024

google-oss-prow bot closed this as completed Apr 23, 2024

rimolive mentioned this issue May 4, 2024

[frontend] Latency when listing runs #10778

Closed

github-project-automation bot added this to KFP Runtime Triage Aug 29, 2024

github-project-automation bot moved this to Closed in KFP Runtime Triage Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[frontend] Cannot list runs and/or artifacts (upstream request timeout) #10230

[frontend] Cannot list runs and/or artifacts (upstream request timeout) #10230

tomaszstachera commented Nov 13, 2023

tomaszstachera commented Nov 28, 2023

tomaszstachera commented Dec 1, 2023

zijianjoy commented Dec 14, 2023

github-actions bot commented Mar 14, 2024

tomaszstachera commented Mar 14, 2024

thesuperzapper commented Apr 2, 2024 •

edited

Loading

thesuperzapper commented Apr 2, 2024

thesuperzapper commented Apr 2, 2024

vanHavel commented Apr 23, 2024

rimolive commented Apr 23, 2024

google-oss-prow bot commented Apr 23, 2024

droctothorpe commented May 3, 2024 •

edited

Loading

tomaszstachera commented May 6, 2024

[frontend] Cannot list runs and/or artifacts (upstream request timeout) #10230

[frontend] Cannot list runs and/or artifacts (upstream request timeout) #10230

Comments

tomaszstachera commented Nov 13, 2023

Environment

Steps to reproduce

Expected result

Materials and Reference

Which Pod manages this metadata_store process?

tomaszstachera commented Nov 28, 2023

tomaszstachera commented Dec 1, 2023

zijianjoy commented Dec 14, 2023

github-actions bot commented Mar 14, 2024

tomaszstachera commented Mar 14, 2024

thesuperzapper commented Apr 2, 2024 • edited Loading

I bet the problem is the sort_by=created_at, which will take a full table scan unless created_at is an Index.

thesuperzapper commented Apr 2, 2024

thesuperzapper commented Apr 2, 2024

vanHavel commented Apr 23, 2024

rimolive commented Apr 23, 2024

google-oss-prow bot commented Apr 23, 2024

droctothorpe commented May 3, 2024 • edited Loading

tomaszstachera commented May 6, 2024

thesuperzapper commented Apr 2, 2024 •

edited

Loading

I bet the problem is the `sort_by=created_at`, which will take a full table scan unless `created_at` is an Index.

droctothorpe commented May 3, 2024 •

edited

Loading