-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow of debugging Kedro pipeline in notebook #1832
Comments
This is a great issue, thank you for opening it! I agree with a lot of what you say here, including which bits might be feasible to improve within kedro. I myself have followed pretty much exactly the same debugging workflow that you describe many, many times. I think it's quite a common way of working, even when IDE breakpoints are available. Especially for users coming from a notebook background, debugging things using an interactive Python session in Jupyter is way easier than trying to use The mooted You can then break the function up into multiple cells (or maybe Disclaimers:
Another idea: when a pipeline fails, we say "try running I think there's two distinct but related use cases here:
I don't know if the same solution might work for both these or if we need separate solutions. Something we've wondered before is whether you should be able to open up the node code in a Jupyter cell in kedro-viz. In fact @limdauto had a rough prototype of this. Basically the bit that shows node code in the metadata panel becomes an mini interactive Jupyter cell that executes |
Note for myself: create a demo for existing debugging workflow. |
One question for Antony - this would work if the error is within the node function - but would it works if it's deeper in the node? For example # node.py
def some_func():
b = a()
d = c(b) <- Assume the error is in c -> you would then also need to copy paste the code of c to make it works |
Potentially useful IPython magic
|
Discussed in Technical Design The general agreement is that we need to improve the debugging workflow of Kedro in notebooks. Concrete actions to achieve this:
|
Background
Kedro's philosophy is pretty much you should use a notebook wisely and keep your code as a Python module. But there are situation that you have to debug with a notebook because data infrastructure is tied to the platform.
What are the pain points with debugging Kedro Pipeline?
suggested
I think 3. is something Kedro should solve and I would love more feedback about this. 1 is not a kedro-specific problem, but it’s more common to kedro users due to data science/ML workflow and we may try to make it easier. I don’t have any workaround for 2.
My opinion is:
I talked to Tom earlier and try to understand the debugging process that he had.
Steps to debug Kedro pipeline in a notebook
catalog.yml
, and re-run the pipeline, error is thrown againsession
has already been used once, so if you call session again it will throw error. (so he had a wrapper function that recreatesession
and do something similar tosession.run
%reload_kedro
?catalog.load
that persisted dataset, i.e.func(catalog.load("some_data"))
func
to notebook, it would work if the function itself is the node function, but if it is some function buried deep down, that's a lot more copy-pasting and change of import maybe.Note that if this is a local development environment, all you would do is set a breakpoint. But you will have to touch on a few files with a notebook i.e.
catalog.yml
Problems
KedroSession
cannot be re-run, user will calledsession.run
multiple time for debugging purpose.session.run
doesn't give the correct output and this issue try to address this problem Should we change the output ofsession.run
? #1802Why it would be less of a problem with stuff like Airflow?
Proposal
We are definitely not trying to re-create the debugger experience. Ideally, it would be great if Kedro can just pop out the correct context at the exact line of code (similar to putting a breakpoint right before an error happen).
%reload_kedro
enough? If we want to keep things in memory thenreload_kedro
does not fit well.%load_node
proposal mentioned in Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed #1721 - which should address Step1-Step7session
which you can dosession.run(dataset=["a","b","c"])
and keep specific dataset you are interested, or evenSome of this can reuse the backtracking logic we have in #1795 so we don't have to rerun the entire pipeline.
The text was updated successfully, but these errors were encountered: