Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightweight Kedro Viz Experimentation using AST #1966

Merged
merged 63 commits into from
Sep 3, 2024

Conversation

ravi-kumar-pilla
Copy link
Contributor

@ravi-kumar-pilla ravi-kumar-pilla commented Jul 3, 2024

Description

Related to #1742

Kedro-Viz has lots of heavy dependencies. At the same time, it needs to import the pipeline code to be able to function, even when doing an initial export with --save-file. This means that sometimes using Kedro-Viz is difficult or impossible if Viz dependencies clash with the project dependencies, which can happen often.

One example of that has been the push for Pydantic v2 support #1603.

Another example, @inigohidalgo says "due to the heavy deps from viz i usually have my dev venv but I create another one just for viz where i just install viz over whatever project I have installed, overriding the project's dependencies with viz's" and asks "do you know if anybody has tested using kedro viz as an "app", so installing it through pipx or smth similar? is that even possible with how viz works?". https://linen-slack.kedro.org/t/16380121/question-regarding-kedro-viz-why-is-there-a-restriction-on-p#38213e99-ba9d-4b60-9001-c0add0e2555b

The acceptance criteria for this is simple - As a user I shouldn't need a full Spark installation to view Kedro-Viz for a project which uses Spark to process data.

Development notes

Added an option --lite in Kedro-Viz CLI. When users execute the command kedro viz --lite , it takes the approach mentioned below -

Using AST + Mock Imports :

Steps:
1. Parse the Kedro project using AST
2. Locate all the import statements
3. Try importing the located statements
4. Mock the dependencies in-case of an import error
5. Patch sys modules with the mocked modules before retrieving the pipelines information from Kedro.

Testing:

I have tested basic Kedro projects with -

  • Dataset factory patterns
  • Starter projects
  • Demo-Project on Kedro-Viz
  • MLOps project
  • Spark pet project with spark.driver.host configured to localhost. The idea was to test spark initialization both via hooks and outside of hooks
  • Dynamic Pipelines project getindata tutorial

Observations:

On macOS Sonoma (2.4 GHz 8-Core Intel i9, 64GB) - These observations might differ as my system was a bit slow while doing tests. But this should give a basic idea of performance. All the tests are run using time <command>. To summarize, kedro viz --lite was faster (~10-15sec) though get_mocked_modules() took (~1-2sec), initializing DataCatalogLite instead of DataCatalog saved time.

**Project: demo_project**

Run 1 - kedro viz  2.69s user 0.71s system 7% cpu 47.519 total
Run 2 - kedro viz  2.78s user 0.75s system 7% cpu 44.181 total

Run 1 - kedro viz --lite  2.79s user 0.74s system 7% cpu 46.126 total
Run 2 - kedro viz --lite  2.76s user 0.74s system 7% cpu 44.393 total

In env with no dependencies installed -

Run 1 - kedro viz --lite  2.43s user 0.93s system 9% cpu 37.082 total
Run 2 - kedro viz --lite  2.32s user 0.67s system 10% cpu 28.553 total

Lite_Parser.get_mocked_modules() - python3 -m lite_parser  1.76s user 0.35s system 86% cpu 2.433 total

**Project: MLOps**

Run 1 - kedro viz  4.48s user 1.86s system 11% cpu 54.817 total
Run 2 - kedro viz  4.52s user 1.59s system 13% cpu 46.505 total

Run 1 - kedro viz --lite  3.98s user 1.42s system 24% cpu 22.159 total
Run 2 - kedro viz --lite  3.99s user 1.42s system 19% cpu 28.372 total

In env with no dependencies installed -

Run 1 - kedro viz --lite  2.35s user 0.71s system 11% cpu 27.291 total
Run 2 - kedro viz --lite  2.34s user 0.69s system 13% cpu 23.242 total

Lite_Parser.get_mocked_modules() - python3 -m lite_parser  0.14s user 0.12s system 15% cpu 1.658 total

**Project: Pyspark dummy project**

Run 1 - kedro viz  3.66s user 1.32s system 6% cpu 1:14.60 total
Run 2 - kedro viz  3.22s user 1.05s system 32% cpu 13.233 total

Run 1 - kedro viz --lite  3.70s user 1.18s system 16% cpu 29.910 total
Run 2 - kedro viz --lite  3.18s user 1.07s system 32% cpu 13.083 total

In env with no dependencies installed -

Run 1 - kedro viz --lite  2.38s user 0.75s system 10% cpu 28.534 total
Run 2 - kedro viz --lite  2.03s user 0.60s system 23% cpu 11.011 total

Lite_Parser.get_mocked_modules() - python3 -m lite_parser  0.51s user 0.18s system 54% cpu 1.269 total

**Project: spaceflights with dynamic pipelines**

Run 1 - kedro viz  2.73s user 0.89s system 7% cpu 46.366 total
Run 2 - kedro viz  2.41s user 0.60s system 20% cpu 14.372 total

Run 1 - kedro viz --lite  2.73s user 0.73s system 9% cpu 35.341 total
Run 2 - kedro viz --lite  2.42s user 0.62s system 11% cpu 27.509 total

In env with no dependencies installed -

Run 1 - kedro viz --lite  2.00s user 0.57s system 16% cpu 15.596 total
Run 2 - kedro viz --lite  1.98s user 0.57s system 21% cpu 11.801 total

Lite_Parser.get_mocked_modules() - python3 -m lite_parser  0.52s user 0.21s system 53% cpu 1.369 total

Note

I have also performed monitoring using line_profiler which gave similar results. This ticket may not improve kedro viz performance but makes it run with missing external dependencies. However, we can improve the overall performance once #1920 and #1920 (comment) are implemented

Limitations:
1. If the datasets are not resolved in the catalog, they will be defaulted to MemoryDataset
2. Since MemoryDatasets do not have layer information, the layers will not be shown in the flowchart if the datasets are
not resolved
3. Experiment Tracking will not work if the datasets are not resolved and the pre-requisite of having kedro-datasets
version 2.1.0 and above is not met.
4. The metadata panel for a data node shows the data node type as MemoryDataset if the dataset is not resolved

Next Steps:

  • Once this PR is reviewed and good to go, I will add documentation as needed in a separate PR.
  • MemoryDataset to custom ImportErrorDataset in-case of missing dataset dependencies
  • CLI showing a warning message with missing imports (if there are any missing imports) - This is already done, but
    need to confirm with @stephkaiser
  • FE notice with a warning of missing imports. This needs a separate ticket and discussion with @stephkaiser
  • Word about the --lite flag. Once this PR is merged and we have the above tasks complete, I will demo this feature
    in the Coffee chat (Sep 1st or 2nd week).

QA notes

Steps to test -

  1. Create a new conda env - conda create -n viz-parser-test python=3.11
  2. Activate the created env - conda activate viz-parser-test
  3. Install Kedro - pip install kedro
  4. Install Kedro-Viz (current parser branch) -
git clone https://github.com/kedro-org/kedro-viz.git
cd kedro-viz
git checkout feature/kedro-viz-lite

pip install -e package
  1. Create a spaceflights starter project - kedro new --starter=spaceflights-pandas
  2. Navigate to your Kedro project - cd spaceflights-pandas
  3. Run kedro viz
  4. It throws an error -
raise DatasetError(
kedro.io.core.DatasetError: An exception occurred when parsing config for dataset 'companies':
Class 'pandas.CSVDataset' not found, is this a typo?
  1. Run kedro viz --lite
  2. Kedro Viz should start successfully
image

Credits:

  1. Thank you Kedro TSC for suggesting the approach taken in this PR
  2. Thank you @noklam for having an initial look at the PR and suggesting the sys modules patch and custom DataCatalog implementation
  3. Thank you Kedro Team for writing an awesome TestSuite for DataCatalog which I reused in DataCatalogLite for coverage. (Let me know if there is a better way to do instead of duplicating the test code)

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added new entries to the RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: ravi-kumar-pilla <[email protected]>
Signed-off-by: ravi-kumar-pilla <[email protected]>
Signed-off-by: ravi-kumar-pilla <[email protected]>
Signed-off-by: ravi-kumar-pilla <[email protected]>
Signed-off-by: ravi-kumar-pilla <[email protected]>
@ravi-kumar-pilla ravi-kumar-pilla changed the title Feature/kedro viz lite Lightweight Kedro Viz Experimentation using AST Jul 3, 2024
Signed-off-by: ravi-kumar-pilla <[email protected]>
Signed-off-by: ravi-kumar-pilla <[email protected]>
Signed-off-by: ravi-kumar-pilla <[email protected]>
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only review the high level approach without diving into details yet. Can you tell me how should I test this PR or testing the parser separately?

package/kedro_viz/integrations/kedro/data_loader.py Outdated Show resolved Hide resolved
package/kedro_viz/integrations/kedro/lite_parser.py Outdated Show resolved Hide resolved
package/kedro_viz/integrations/kedro/lite_parser.py Outdated Show resolved Hide resolved
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Really happy to see this is getting to the finish line. This would be useful in many ways and speed up kedro viz a lot!

I approved with a minor comment as I don't want to block the PR.

@astrojuanlu
Copy link
Member

[08/23/24 16:33:59] WARNING  Kedro-Viz has mocked the following             data_loader.py:173
                             dependencies for lite-mode.                                      
                             ['sklearn.base', 'matplotlib.pyplot',                            
                             'matplotlib', 'seaborn', 'PIL',                                  
                             'sklearn.model_selection', 'sklearn',                            
                             'sklearn.metrics']                                               
                             In order to get a complete experience of Viz,                    
                             please install the missing Kedro project                         
                             dependencies 

💯

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a quick test and it works! 💯 Thanks @ravi-kumar-pilla!

Copy link
Contributor

@rashidakanchwala rashidakanchwala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMAZZING!!!

@ravi-kumar-pilla ravi-kumar-pilla merged commit 023a05b into main Sep 3, 2024
33 checks passed
@ravi-kumar-pilla ravi-kumar-pilla deleted the feature/kedro-viz-lite branch September 3, 2024 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants