-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kedro-viz --lite : Build DAG without importing the code #1742
Comments
Tangentially related: https://openlineage.io/ (as a means to export Kedro pipelines) |
I've seen Openlineage in a few issues, but is it related to this? From what I understand it's more about understanding the lineage between systems, how data flows from different databases/table to downstream application etc. |
I think some of the concepts in this ticket are relevant too |
The acceptance criteria for this is simple - As a user I shouldn't need a full Spark installation to view Kedro-Viz for a project which uses Spark to process data. |
Hi @datajoely , I started looking at the issue and I am pretty new to the Spark environment. I tried testing the Kedro starter project spaceflights-pyspark-viz which uses kedro-datasets -> spark.SparkDataset . For this project, the minimum steps required to get kedro viz up were -
I know starter project might not give me the full picture of the issue. It would be great if we can connect or you can point me to any kedro project which uses full Spark installation to process data. Thank you |
Hi @astrojuanlu, Regarding this ticket of building DAG without importing the code, needs a significant refactor as we heavily depend on kedro session to load data. I would like to take this in 3 steps -
I have few questions regarding kedro session -
Thank you |
Great work Ravi - to articulate my point a bit better:
|
Quick answers:
|
@astrojuanlu , Yes. At this moment, we need to know the information regarding pipelines which is only possible by having all the kedro project dependencies resolved. i.e., We use I am trying to use Thank you |
I think this is the right approach - I know @imdoroshenko has had success with the libcst library too |
One further point - I think this sessionless pipeline construction should live in kedro core longer term rather than just in Viz, lots of uses for other purposes. |
@astrojuanlu https://github.com/noklam/kedro-viz-lite, glad you asked. I'd love to see kedro-viz become more lightweight. I attempt to make it works on Notebook before (forgot if I end up make it successfully, but it still required session). Interesting I just see #1459 exist,
My opinion: The parsing approach is interesting and love to learn more, though I don't think working with
This is basically |
Yeah perhaps AST isn't needed - the actual I'd love to imagine a future where the |
I am also of the same opinion. Can the default be no hook, but an additional flag to turn on |
Sure we can ignore hooks by default if it only affects fewer users. Let me create a ticket. Thanks ! |
Approach 1 - exporting the pipeline
Approach 2 - Problem with ast:
Approach 3 - Parser Approach:
I am quite confident the approach 3 will work, but the effort won't be small(maybe 2 weeks for a Prototype?). I have a small PoC with the parser but there are limited time that I can commit outside of work for this. I'd love to work on this if this get prioritised but LSP is my first priority after review :P. p.s.(what I am saying is assign a 13 point estimate and put me on the ticket in the next two/three months. 😆 ) |
I don't think we need a Concrete Syntax Tree for this, since we don't need to retain comments or formatting. An Abstract Syntax Tree should in theory suffice, or am I missing something?
I'm confused. Doesn't For example: In [4]: import test_parser.pipelines.data_processing
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[4], line 1
----> 1 import test_parser.pipelines.data_processing
File ~/Projects/QuantumBlackLabs/tmp/test-parser/src/test_parser/pipelines/data_processing/__init__.py:3
1 """Complete Data Processing pipeline for the spaceflights tutorial"""
----> 3 from .pipeline import create_pipeline # NOQA
File ~/Projects/QuantumBlackLabs/tmp/test-parser/src/test_parser/pipelines/data_processing/pipeline.py:1
----> 1 from kedro.pipeline import Pipeline, node, pipeline
3 from .nodes import create_model_input_table, preprocess_companies, preprocess_shuttles
6 def create_pipeline(**kwargs) -> Pipeline:
ModuleNotFoundError: No module named 'kedro'
In [5]:
Do you really want to exit ([y]/n)? ^D
~/Projects/QuantumBlackLabs/tmp/test-parser ······················································································ 1m 17s test-parser 08:19:0
❯ python -m ast src/test_parser/pipelines/data_processing/pipeline.py
Module(
body=[
ImportFrom(
module='kedro.pipeline',
names=[
alias(name='Pipeline'),
alias(name='node'),
alias(name='pipeline')],
level=0),
ImportFrom(
module='nodes',
names=[
alias(name='create_model_input_table'),
alias(name='preprocess_companies'),
alias(name='preprocess_shuttles')],
level=1),
FunctionDef(
name='create_pipeline',
args=arguments(
posonlyargs=[],
args=[],
kwonlyargs=[],
kw_defaults=[],
kwarg=arg(arg='kwargs'),
...
~/Projects/QuantumBlackLabs/tmp/test-parser ······························································································· test-parser 08:17:4
❯ ipython
In [1]: import ast
In [2]: with open("src/test_parser/pipelines/data_processing/pipeline.py") as fh:
...: tree = ast.parse(fh.read())
...:
In [9]: pipeline_func_nodes = []
...:
...: class PipelineLocator(ast.NodeVisitor):
...: def visit_FunctionDef(self, node):
...: if node.name == "create_pipeline":
...: pipeline_func_nodes.append(node)
...: self.generic_visit(node)
...:
In [10]: PipelineLocator().visit(tree)
In [11]: pipeline_func_nodes
Out[11]: [<ast.FunctionDef at 0x103a7fe20>]
In [12]: print(ast.dump(pipeline_func_nodes[0], indent=2))
FunctionDef(
name='create_pipeline',
args=arguments(
posonlyargs=[],
args=[],
kwonlyargs=[],
kw_defaults=[],
kwarg=arg(arg='kwargs'),
defaults=[]),
body=[
Return(
value=Call(
func=Name(id='pipeline', ctx=Load()),
args=[
List(
elts=[
Call(
func=Name(id='node', ctx=Load()),
args=[],
keywords=[
keyword(
arg='func',
value=Name(id='preprocess_companies', ctx=Load())),
keyword(
arg='inputs',
value=Constant(value='companies')),
keyword(
arg='outputs',
value=Constant(value='preprocessed_companies')),
keyword(
arg='name',
value=Constant(value='preprocess_companies_node'))]),
Call(
func=Name(id='node', ctx=Load()),
args=[],
keywords=[
keyword(
arg='func',
value=Name(id='preprocess_shuttles', ctx=Load())),
keyword(
arg='inputs',
value=Constant(value='shuttles')),
keyword(
arg='outputs',
value=Constant(value='preprocessed_shuttles')),
keyword(
arg='name',
value=Constant(value='preprocess_shuttles_node'))]),
Call(
func=Name(id='node', ctx=Load()),
args=[],
keywords=[
keyword(
arg='func',
value=Name(id='create_model_input_table', ctx=Load())),
keyword(
arg='inputs',
value=List(
elts=[
Constant(value='preprocessed_shuttles'),
Constant(value='preprocessed_companies'),
Constant(value='reviews')],
ctx=Load())),
keyword(
arg='outputs',
value=Constant(value='model_input_table')),
keyword(
arg='name',
value=Constant(value='create_model_input_table_node'))])],
ctx=Load())],
keywords=[]))],
decorator_list=[],
returns=Name(id='Pipeline', ctx=Load())) This of course is only the beginning, one then needs to keep visiting the node to "unwind" the pipeline definition. What happens in the Long story short, a POC would be something that works for "canonical" pipeline definitions like
How to get from this 80/20 thing to something that is more robust for real world pipeline definitions is a big mistery. That's why my initial proposal stated AST as an alternative solution.
I am not sure what custom parsing capabilities you're referring to but I think we should stay away from the business of parsing Python code. 2 weeks for a Prototype sounds like something that can get out of hand pretty quickly. |
@astrojuanlu you are right about this. If we don't care about comment/docstring etc we can go with |
An internal user asked about this
|
Is this work already started? |
Hi @noklam, I wanted to start some research in this sprint with ast (#1742 (comment)) but did not get time to explore. I would also like to get your thoughts on this. Let's connect next week and discuss when you are free. Thank you |
Just copying the comment I left in the discussion.
^ I think what's clear from the discussion is that a pure static approach is proven to be difficult and error-prone with edge cases. We cannot get rid of actually executing the code, but instead we should think about "how to execute part of the code that we are interested"
There is also a comment about what to mock, we don't want to mock import that are importing pipeline from other modules, or constant that construct pipeline dynamically (Question is: How do we know which one are important? Is there a way to identify them)? @sbrugman Also brought up a good point, kedro-viz in CI/CD would benefit a lot with lightweight dependencies without the full project dependencies. |
Closed in #1966 ! |
Description
kedro-viz has lots of heavy dependencies. At the same time, it needs to
import
the pipeline code to be able to function, even when doing an initial export with--save-file
. This means that sometimes using Kedro Viz is difficult or impossible if Viz dependencies clash with the project dependencies, which can happen often.One outstanding example of that has been the push for Pydantic v2 support #1603.
Another example, @inigohidalgo says "due to the heavy deps from viz i usually have my dev venv but I create another one just for viz where i just install viz over whatever project I have installed, overriding the project's dependencies with viz's" and asks "do you know if anybody has tested using kedro viz as an "app", so installing it through pipx or smth similar? is that even possible with how viz works?". https://linen-slack.kedro.org/t/16380121/question-regarding-kedro-viz-why-is-there-a-restriction-on-p#38213e99-ba9d-4b60-9001-c0add0e2555b
Possible Implementation
One way to do it is to tell Kedro users to write their pipelines in YAML kedro-org/kedro#650, kedro-org/kedro#1963
Possible Alternatives
Another way would be to do some sort of AST scanning of the Python code, assuming that in some cases this would fail or not be accurate.
Yet another way would be to extract the minimal amount of code that does the
--save-file
and decouple it from the web application that serves it with--load-file
.There are possibly other alternatives.
Checklist
The text was updated successfully, but these errors were encountered: