-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark and performance report for KFP instance #3259
Comments
The first stage (client side scripts have been submitted). So move this issue to post 1.0 for the second stage work (i.e., stats collecting in inverse proxy and in server side) |
Stage 2: Leverage one of the open source monitoring tools, such as prometheus, or timescaledb, to support internal monitoring of our servers. That will bring us recorded time series data and visualization over those data. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Both stages done, let's mark as closed |
A good next step is to have a pipeline that load-tests KFP. |
KPF users are constantly wondering about the scalability and capacity of their KFP deployment., e.g., what is the maximum number of pipelines/runs I can create in one KFP deployment?
That might appear to be an easy question but in reality, it is actually a disguise of multiple related but independent questions. If something unexpected happens on their KFP instance and seems to be triggered when a certain amount of pipelines/runs/run metrics are reached, the actual cause under hoold can be (1) they go beyond the designed capacity, or (2) they haven't allocated enough cloud resource, or (3) a KFP bug. E.g., it has been reported that upgrading to machines with more cpus/ram etc once solves the multiple thread issue in tfx pipeline training step. Another example: out db schema used to have the description of pipeline limited to a certain length. A further example, a previous bug caused our db query getting truncated when the number of run metrics went beyond a certain threshold.
In short, even an issue in KPF usage seems to be triggered by amount of pipelines/runs/metrics, that doesn't necessarily mean that it is a scalability or capacity issue. However, if it is, we can also optimize our design/implementation to lift the upper bound of kfp capacity and make our KFP more scalable.
And the very first step to measure and improve our scalability and capacity is that we have to load test our KFP. For load testing, a simple solution is to use a benchmark pipeline to repeatedly create pipelines, pipeline versions, experiment, runs, recurring runs with different amount of run metrics; and verify the output. A starting point would like a script here https://github.com/dldaisy/KFP_CI_samples/blob/master/sdk-client-version-test/test_create_pipeline_and_version.py
But we can have discussion here and refine our approach.
Later, when we finalize our script, we'll also have it run periodically to prevent introducing changes that impact performance in an unexpected way. Moreover, we can also provide the script to our users in case they are interested to run it on their own deployment (Just be aware of the resource cost when you want to load test your own deployment)
/cc @rmgogogo
/cc @Bobgy
/cc @Ark-kun
The text was updated successfully, but these errors were encountered: