Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create massive pipeline to test with flowchart on Kedro-viz #1064

Closed
1 task done
rashidakanchwala opened this issue Sep 14, 2022 · 14 comments
Closed
1 task done

Create massive pipeline to test with flowchart on Kedro-viz #1064

rashidakanchwala opened this issue Sep 14, 2022 · 14 comments

Comments

@rashidakanchwala
Copy link
Contributor

rashidakanchwala commented Sep 14, 2022

Description

Create a massive kedro-viz pipeline to stress-test flowchart features.

Context

The fluidity of flowchart interactions depends on the size of the pipeline, currently we don't have massive pipelines so we cannot stress tests a lot of features on kedro-viz. We know a lot of data science projects have huge pipelines. This issue is to make sure we build kedro-viz to also handle massive pipelines.

Possible Implementation

Maybe we can just create a big json file with multiple large pipelines

Checklist

  • Include labels so that we can categorise your feature request
@rashidakanchwala rashidakanchwala changed the title <Title> Create massive pipeline to test with flowchart with on Kedro-viz Sep 14, 2022
@rashidakanchwala rashidakanchwala changed the title Create massive pipeline to test with flowchart with on Kedro-viz Create massive pipeline to test with flowchart on Kedro-viz Sep 14, 2022
@tynandebold tynandebold moved this to Inbox in Kedro-Viz Sep 14, 2022
@rashidakanchwala
Copy link
Contributor Author

@jmholzer recently did this kedro-org/kedro#1795 (comment) where he tested the runner with 1000 nodes. I am wondering if we can create a json from the pipeline with 1000 nodes and use it for the above.

@tynandebold
Copy link
Member

Great idea. Let's try and build this into the demo project so we don't have maintain two data sources.

Thoughts from backlog grooming.

  • Default pipeline is our current view
  • In the pipeline dropdown we have an item that, when selected, loops through and generates a massive pipeline.

@tynandebold tynandebold moved this from Inbox to Backlog in Kedro-Viz Oct 4, 2022
@tynandebold
Copy link
Member

Another idea: find a team that has a massive pipeline and get it from them.

@astrojuanlu
Copy link
Member

I know a few of them 😄

@tynandebold
Copy link
Member

Please let us know where we can get one!

@rashidakanchwala
Copy link
Contributor Author

rashidakanchwala commented Jan 15, 2024

We will use the insurex (QB vertical team) sanitized pipeline for this.

@ravi-kumar-pilla
Copy link
Contributor

Hi Team,

Update:

I reached out to Shubham from CommercialX and got one of their pipeline. He also shared a box link to go over the setup. I have set it up in my local and kedro viz run seems to load pretty normally. Though I had to comment out the Spark session initialization step.

Observations:

  1. If spark session is instantiated without using hooks, ignoring hooks by default will not have affect
  2. Since it is a huge pipeline, having an alignment option of horizontal/vertical nodes should be of great help
  3. If I would like to quickly filter the DAG on dataset type (want to see only SparkDatasets) it is not possible. At this moment our filter panel is limited. We should add more filterable options.
  4. The load time of Kedro-Viz DAG is not bad (for this pipeline at least) . But might take longer due to Spark sessions. (Need to investigate further each step)

I would like to get some help from the framework team (@SajidAlamQB , @ankatiyar if anyone has some time), to speed the process of Spark setup locally and successfully execute kedro run.

Thank you

@ravi-kumar-pilla
Copy link
Contributor

CommercialX Kedro Viz Testing -

Observations:

  1. Populating piplines dict(pipelines) takes 50% of the time to start the server
  2. Kedro Catalog creation takes up considerable time as well

Size of the data -

Image

RUN 1 -

Starting Kedro Viz ...
Time taken to configure/bootstrap project:: 2.6968612670898438
Time taken to create a kedro session:: 0.44796109199523926
[04/24/24 19:43:54] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/kedro/framework/session/session.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in warnings.py:109
Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader
warnings.warn(

Time taken to create a kedro context:: 0.12806415557861328
Time taken to create a kedro session store:: 9.5367431640625e-07
Time taken to create a kedro catalog:: 15.315791845321655
[04/24/24 19:44:31] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/pyspark/pandas/init.py:47: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is warnings.py:109
required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context
already launched.
warnings.warn(

Time taken to create pipeline dictionary:: 23.553779125213623
Time taken to create stats dictionary:: 7.510185241699219e-05
Time taken to load kedro project data:: 42.1427047252655
Time taken to populate pipelines:: 9.5367431640625e-07
[04/24/24 19:44:33] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
[04/24/24 19:44:34] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
Time taken to populate viz repositories:: 1.3385379314422607
Time taken to start uvicorn server:: 43.49144387245178
Kedro Viz started successfully.

RUN 2 -

Starting Kedro Viz ...
Time taken to configure/bootstrap project:: 1.7348659038543701
Time taken to create a kedro session:: 0.2879657745361328
[04/24/24 19:59:22] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/kedro/framework/session/session.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in warnings.py:109
Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader
warnings.warn(

Time taken to create a kedro context:: 0.12883210182189941
Time taken to create a kedro session store:: 0.0
Time taken to create a kedro catalog:: 13.26403284072876
[04/24/24 19:59:54] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/pyspark/pandas/init.py:47: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is warnings.py:109
required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context
already launched.
warnings.warn(

Time taken to create pipeline dictionary:: 21.121844053268433
Time taken to create stats dictionary:: 6.508827209472656e-05
Time taken to load kedro project data:: 36.5377631187439
Time taken to populate pipelines:: 1.1920928955078125e-06
[04/24/24 19:59:57] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
Time taken to populate viz repositories:: 1.4388270378112793
Time taken to start uvicorn server:: 37.98678135871887
Kedro Viz started successfully.

Immediate RUN 3 -

Starting Kedro Viz ...
Time taken to configure/bootstrap project:: 1.6473729610443115
Time taken to create a kedro session:: 0.2387540340423584
[04/24/24 20:01:57] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/kedro/framework/session/session.py:267: KedroDeprecationWarning: Jinja2TemplatedConfigLoader will be deprecated in warnings.py:109
Kedro 0.19. Please use the OmegaConfigLoader instead. To consult the documentation for OmegaConfigLoader, see here:
https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader
warnings.warn(

Time taken to create a kedro context:: 0.12455415725708008
Time taken to create a kedro session store:: 9.5367431640625e-07
Time taken to create a kedro catalog:: 9.044120073318481
[04/24/24 20:02:15] WARNING /Users/Ravi_Kumar_Pilla/opt/anaconda3/envs/promotion/lib/python3.9/site-packages/pyspark/pandas/init.py:47: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is warnings.py:109
required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context
already launched.
warnings.warn(

Time taken to create pipeline dictionary:: 9.573238134384155
Time taken to create stats dictionary:: 4.982948303222656e-05
Time taken to load kedro project data:: 20.628222227096558
Time taken to populate pipelines:: 9.5367431640625e-07
[04/24/24 20:02:16] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
[04/24/24 20:02:17] WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
WARNING Cannot find parameter feature_generation.item.transaction.window.drop_columns_list in the catalog. flowchart.py:1006
Time taken to populate viz repositories:: 1.3532860279083252
Time taken to start uvicorn server:: 21.99152898788452
Kedro Viz started successfully.

@ravi-kumar-pilla ravi-kumar-pilla moved this from In Progress to Todo in Kedro-Viz Apr 25, 2024
@astrojuanlu
Copy link
Member

  • Populating piplines dict(pipelines) takes 50% of the time to start the server
  • Kedro Catalog creation takes up considerable time as well

Good to know. What are the next steps?

The logs are a bit difficult to read. Maybe it would help to see a flamegraph, like this kedro-org/kedro#3033 (comment)

@astrojuanlu
Copy link
Member

Also notice that, while testing with internal projects is useful, for us to confidently move forward with this we will probably have to generate some open source synthetic projects to test. See kedro-org/kedro#3790 for past discussion about this

@ravi-kumar-pilla ravi-kumar-pilla moved this from Todo to In Progress in Kedro-Viz May 1, 2024
@ravi-kumar-pilla
Copy link
Contributor

Hi @astrojuanlu , Thank you for the suggestions. I tested with the tools you have mentioned and also prepared a rough notes on the next steps here.

To summarize, as a first step, if we load kedro data in an async way (async loading test branch) would help improve the Kedro-Viz load time for larger pipelines. If there are any new findings on the internal implementation of Kedro, I would be happy to discuss in the next Tech design.

Thank you

@astrojuanlu
Copy link
Member

Thanks @ravi-kumar-pilla. To summarize from the internal document:

Insights

  • It takes a long time to initialise the Kedro modules and reach the actual kedro viz run command (already sort of known, [spike] Improve Kedro CLI startup time kedro#1476)
  • The expensive operation before starting the viz server is loading the data from the Kedro session (possibly related to Lazy Loading of Catalog Items  kedro#2829 ?)
  • Most of the time taken to load the data is from catalog and pipelines_dict resolution, which worsens as the pipeline count increases

Next steps

  1. Stress test with https://github.com/noklam/kedro-example/tree/master/stress-test-pipeline and summarize the results
  2. Check for internals of _get_catalog() and pipelines to further optimize

And if I may add, I think

@astrojuanlu
Copy link
Member

Adding a bit more context after a quick discussion:

  • These performance bottlenecks affect all projects, not only large ones, because startup times for Kedro are exceedingly long, and also the data is seemingly loaded in sequence cc @yetudada
  • We will likely need not 1, but several "massive pipelines" to do a comprehensive performance analysis, where "massive" means

@astrojuanlu
Copy link
Member

I'm not sure there's anything else for us to do here.

  • We did extensive benchmarking and found that, because of how Kedro Viz waits for all the data to be ready, the main bottleneck is Kedro itself
  • We addressed all the issues we found, and added extensive benchmarks
  • We haven't found, or heard from users, that the rendering step is slow
  • We opened Enhancing Kedro-Viz Performance with Lazy Loading #1806 to track the possibility of making the Kedro Viz UI launch before collecting the data

Let's close this issue as completed until we have more concrete actions.

@github-project-automation github-project-automation bot moved this from Backlog to Done in Kedro-Viz Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

5 participants