Benchmark and performance report for KFP instance #3259

jingzhang36 · 2020-03-11T09:26:32Z

KPF users are constantly wondering about the scalability and capacity of their KFP deployment., e.g., what is the maximum number of pipelines/runs I can create in one KFP deployment?

That might appear to be an easy question but in reality, it is actually a disguise of multiple related but independent questions. If something unexpected happens on their KFP instance and seems to be triggered when a certain amount of pipelines/runs/run metrics are reached, the actual cause under hoold can be (1) they go beyond the designed capacity, or (2) they haven't allocated enough cloud resource, or (3) a KFP bug. E.g., it has been reported that upgrading to machines with more cpus/ram etc once solves the multiple thread issue in tfx pipeline training step. Another example: out db schema used to have the description of pipeline limited to a certain length. A further example, a previous bug caused our db query getting truncated when the number of run metrics went beyond a certain threshold.

In short, even an issue in KPF usage seems to be triggered by amount of pipelines/runs/metrics, that doesn't necessarily mean that it is a scalability or capacity issue. However, if it is, we can also optimize our design/implementation to lift the upper bound of kfp capacity and make our KFP more scalable.

And the very first step to measure and improve our scalability and capacity is that we have to load test our KFP. For load testing, a simple solution is to use a benchmark pipeline to repeatedly create pipelines, pipeline versions, experiment, runs, recurring runs with different amount of run metrics; and verify the output. A starting point would like a script here https://github.com/dldaisy/KFP_CI_samples/blob/master/sdk-client-version-test/test_create_pipeline_and_version.py
But we can have discussion here and refine our approach.

Later, when we finalize our script, we'll also have it run periodically to prevent introducing changes that impact performance in an unexpected way. Moreover, we can also provide the script to our users in case they are interested to run it on their own deployment (Just be aware of the resource cost when you want to load test your own deployment)

/cc @rmgogogo
/cc @Bobgy
/cc @Ark-kun

jingzhang36 · 2020-03-25T03:36:04Z

Additional request from #2571 as
"
Since this is mostly cared in production scenarios, we should prioritize CloudSQL/GCS test over in-cluster storage.

Related issues
#2071
"

jingzhang36 · 2020-05-09T02:45:28Z

The first stage (client side scripts have been submitted). So move this issue to post 1.0 for the second stage work (i.e., stats collecting in inverse proxy and in server side)

jingzhang36 · 2020-06-11T01:13:57Z

Stage 2:

Leverage one of the open source monitoring tools, such as prometheus, or timescaledb, to support internal monitoring of our servers. That will bring us recorded time series data and visualization over those data.

stale · 2020-09-11T01:42:09Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Bobgy · 2020-09-11T06:13:53Z

Both stages done, let's mark as closed

jingzhang36 · 2020-09-14T13:26:51Z

A good next step is to have a pipeline that load-tests KFP.

jingzhang36 added priority/p1 area/backend labels Mar 11, 2020

jingzhang36 self-assigned this Mar 11, 2020

Bobgy added the status/triaged Whether the issue has been explicitly triaged label Mar 18, 2020

jingzhang36 mentioned this issue Mar 25, 2020

Load/Perf test for Pipeline API server #2571

Closed

jingzhang36 mentioned this issue Apr 22, 2020

Add two scripts to load test our api endpoints with measurement of run durations and api call latencies #3587

Merged

Bobgy mentioned this issue May 19, 2020

Compliance to Kubeflow 1.0 Guideline #2884

Closed

This was referenced Jul 30, 2020

feat: bring up a prometheus server and to verify it works, add some basic metrics for it to pull. #4294

Closed

chore: add optional prometheus deployment #4323

Merged

jingzhang36 mentioned this issue Aug 11, 2020

backend: add prometheus metrics collection to KFP server. #4354

Merged

2 tasks

jingzhang36 mentioned this issue Aug 24, 2020

feat: add grafana as part of the customized kfp deployment #4404

Merged

2 tasks

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 11, 2020

Bobgy closed this as completed Sep 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark and performance report for KFP instance #3259

Benchmark and performance report for KFP instance #3259

jingzhang36 commented Mar 11, 2020

jingzhang36 commented Mar 25, 2020

jingzhang36 commented May 9, 2020

jingzhang36 commented Jun 11, 2020

stale bot commented Sep 11, 2020

Bobgy commented Sep 11, 2020

jingzhang36 commented Sep 14, 2020

Benchmark and performance report for KFP instance #3259

Benchmark and performance report for KFP instance #3259

Comments

jingzhang36 commented Mar 11, 2020

jingzhang36 commented Mar 25, 2020

jingzhang36 commented May 9, 2020

jingzhang36 commented Jun 11, 2020

stale bot commented Sep 11, 2020

Bobgy commented Sep 11, 2020

jingzhang36 commented Sep 14, 2020