-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike: design example kedro projects that can be used to assess performance issues #3957
Comments
Heavy dependency imports would be great here too |
Core dependencies of Kedro or just any? |
Sorry I meant things like Pytorch / Tensorflow / Spark / Pandas |
As mentioned in #3732 large parameter files seem to slow things down. |
Gathered some topics from Kedro Slack Archive & Github
Bottleneck:
What is considered slow?
We could also approach this at a component level first. i.e. How slow is DataCatalog when # of datasets scale up, how slow is pipeline sum when # of pipeline scale up. The outcome of this issue is to create some idea/script that can be re-use and we can benchmark performance in a on-going basis (maybe include in CI or manually trigger from time to time) |
I upvote the tests for:
Some less obvious:
|
I suggest focus on two things:
|
Speak to @rashidakanchwala today and we conclude that size of the pipeline is usually not the bottleneck for viz, so we will forgo creating project with complex (nested) modular pipeline. There are some evidence that Improve resume pipeline suggestion for SequentialRunner by jmholzer · Pull Request #1795 · kedro-org/kedro · GitHub pipeline usually scales reasonably well with size of node, up to 1000. This is my initial idea, I would like to tackle this in two parts:
Pipeline stress testThe goal of this is to reduce overhead of setting up realistic complex project. This usually include remote storage, pyspark connection etc. We can use this as an example:
Component stress test:
The direction of this is simple, we want to make measure the change of time against # number of entries. We would start with Datasets and Catalog, as this fits in the
This can address:
While we are creating the pipeline, we should think about how to scale this in the future (if we have new thing to test, where & how? This may need some flags to turn on/off and documentation) |
Thanks for the summary @noklam. Just one thought on the Pipeline stress test: Not sure if
This sounds OK. Maybe we need a bit more clarity on what this means to create a synthetic project, e.g. test
Otherwise looking for a "realistic" project might be hard. About component stress test, the plan sounds good 👍🏼 |
Thank you, @noklam!
|
What do you have in mind for stress testing Runners? Generate some dummy node and use different type of runners execute them? Or do we need different type of workload for Runners? I/O bound for ThreadRunner, CPU bound for ParallelRunner? |
I was thinking of having at least three different pipelines for them—one per runner to stress them: one random for sequential, one with external I/O for ThreadRunner, and one that can be run in parallel (at least several processes) for ParallelRunner. So we can check that their main functionality is not affected by changes and makes sense. That's also useful for upcoming |
It may be interesting to have memory profiling too, it will be helpful to address issues like |
Yes let's include it. |
I've moved this to There are some additional scope from review comments, I'd like to split it into additional ticket to make sure the scope of the ticket doesn't grow too big. Implementation will be carried in #3866, I believe @lrcouto already get started for the pipeline test. |
Happy to go forward with the approach of creating a project for pipeline stress testing and separately stress test components. Please go ahead and create follow up tickets. One thing I don't see suggestions on yet is the maintenance model for these testing projects and when and how they'll get run: automatically, before a release, on every PR, etc? |
I think it would be good to create some sort of automated process to run the projects before releases for sure. On every PR, as part of regular CI or similar, I think could be a bit slow or cumbersome. |
@merelcht I have opened a new ticket. My current idea is that the test should be easy to run both locally and also as Github action. We may use tag/branch name to conditionally trigger performance. For example, |
Description
Prework for #3866
Context
In order to create example kedro project that can be used to assess performance of Kedro and Kedro-Viz, we need to gather requirements of what defines complex pipelines. Some of the moving parts are number of nodes, number of pipelines and number of datasets, but that might not be all that's required to create a proper "family" of test projects.
Possible Implementation
Good starting point: https://github.com/noklam/kedro-example/blob/master/stress-test-pipeline/src/stress_test_pipeline/pipeline.py
The text was updated successfully, but these errors were encountered: