Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark and performance report for KFP instance #3259

Closed
jingzhang36 opened this issue Mar 11, 2020 · 6 comments
Closed

Benchmark and performance report for KFP instance #3259

jingzhang36 opened this issue Mar 11, 2020 · 6 comments
Assignees
Labels
area/backend lifecycle/stale The issue / pull request is stale, any activities remove this label. priority/p1 status/triaged Whether the issue has been explicitly triaged

Comments

@jingzhang36
Copy link
Contributor

KPF users are constantly wondering about the scalability and capacity of their KFP deployment., e.g., what is the maximum number of pipelines/runs I can create in one KFP deployment?

That might appear to be an easy question but in reality, it is actually a disguise of multiple related but independent questions. If something unexpected happens on their KFP instance and seems to be triggered when a certain amount of pipelines/runs/run metrics are reached, the actual cause under hoold can be (1) they go beyond the designed capacity, or (2) they haven't allocated enough cloud resource, or (3) a KFP bug. E.g., it has been reported that upgrading to machines with more cpus/ram etc once solves the multiple thread issue in tfx pipeline training step. Another example: out db schema used to have the description of pipeline limited to a certain length. A further example, a previous bug caused our db query getting truncated when the number of run metrics went beyond a certain threshold.

In short, even an issue in KPF usage seems to be triggered by amount of pipelines/runs/metrics, that doesn't necessarily mean that it is a scalability or capacity issue. However, if it is, we can also optimize our design/implementation to lift the upper bound of kfp capacity and make our KFP more scalable.

And the very first step to measure and improve our scalability and capacity is that we have to load test our KFP. For load testing, a simple solution is to use a benchmark pipeline to repeatedly create pipelines, pipeline versions, experiment, runs, recurring runs with different amount of run metrics; and verify the output. A starting point would like a script here https://github.com/dldaisy/KFP_CI_samples/blob/master/sdk-client-version-test/test_create_pipeline_and_version.py
But we can have discussion here and refine our approach.

Later, when we finalize our script, we'll also have it run periodically to prevent introducing changes that impact performance in an unexpected way. Moreover, we can also provide the script to our users in case they are interested to run it on their own deployment (Just be aware of the resource cost when you want to load test your own deployment)

/cc @rmgogogo
/cc @Bobgy
/cc @Ark-kun

@jingzhang36 jingzhang36 self-assigned this Mar 11, 2020
@Bobgy Bobgy added the status/triaged Whether the issue has been explicitly triaged label Mar 18, 2020
@jingzhang36
Copy link
Contributor Author

Additional request from #2571 as
"
Since this is mostly cared in production scenarios, we should prioritize CloudSQL/GCS test over in-cluster storage.

Related issues
#2071
"

@jingzhang36
Copy link
Contributor Author

The first stage (client side scripts have been submitted). So move this issue to post 1.0 for the second stage work (i.e., stats collecting in inverse proxy and in server side)

@jingzhang36
Copy link
Contributor Author

Stage 2:

Leverage one of the open source monitoring tools, such as prometheus, or timescaledb, to support internal monitoring of our servers. That will bring us recorded time series data and visualization over those data.

@stale
Copy link

stale bot commented Sep 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Sep 11, 2020
@Bobgy
Copy link
Contributor

Bobgy commented Sep 11, 2020

Both stages done, let's mark as closed

@Bobgy Bobgy closed this as completed Sep 11, 2020
@jingzhang36
Copy link
Contributor Author

A good next step is to have a pipeline that load-tests KFP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend lifecycle/stale The issue / pull request is stale, any activities remove this label. priority/p1 status/triaged Whether the issue has been explicitly triaged
Projects
None yet
Development

No branches or pull requests

2 participants