Should we change the output of `session.run`? #1802

noklam · 2022-08-23T21:39:35Z

Background

What's the output of session.run()? Currently, this is not clear as you think and it isn't documented anywhere. The logic is defined in runner.py, this can be counter-intuitive in some cases, is there a good reason why we want to do this?

kedro/kedro/runner/runner.py

Lines 78 to 91 in f491420

    
           free_outputs = pipeline.outputs() - set(catalog.list()) 
        
           unregistered_ds = pipeline.data_sets() - set(catalog.list()) 
        
           for ds_name in unregistered_ds: 
        
               catalog.add(ds_name, self.create_default_data_set(ds_name)) 
        
           if self._is_async: 
        
               self._logger.info( 
        
                   "Asynchronous mode is enabled for loading and saving data" 
        
               ) 
        
           self._run(pipeline, catalog, hook_manager, session_id) 
        
           self._logger.info("Pipeline execution completed successfully.") 
        
           return {ds_name: catalog.load(ds_name) for ds_name in free_outputs}

kedro has improved a lot in terms of how to run the pipeline with packaging & KedroSession as a standalone application, #1423 documents different ways to do it. Personally, I think it is still not easy enough to integrate with kedro for someone who is inexperienced with kedro. In #1423, It mentioned how a pipeline can be called programmatically. Even though the pipeline itself is a function call, it doesn't behave like a function, i.e. you can't really define an input as an argument easily (it has to be a Catalog entry), the output of the pipeline is also very restricted.

Motivation

Kedro works really well within the kedro world, but it also mean that kedro works very differently from the rest of the Python world.

This issue mainly focuses on the output side, this will improve the experience to integrate the kedro pipeline as an upstream. In a over-simplified world, this should be straight forward to do. Currently I think we a strong assumption that people work with "Kedro Project", but if we are moving towards a kedro package, i.e. using from kedro_package import main, it should behave just like a Python function, I think this is a reasonable expectation.

1. df = get_some_data()
2. model = my_kedro_pipeline(input={'my_pipeline_input_df': df})
3. app = PredictionWebService(model)

Questions

What should be return with session.run?

Things to consider

How can any Python developer integrate with the kedro pipeline easily? Can it behave just like a function?
In an interactive workflow, it may make sense to keep all intermediate output in the resulting dict
Is there a known reason why the output is defined as it is?

Related Issue:

It would enable a better interactive workflow Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed #1721
Improve kedro run as a package #1423 Is trying to improve/simplifies how we run kedro pipeline as a standalone package
How to distribute and extend kedro pipelines #795 Discussion of how to use kedro as upstream/downstream

Workflow of debugging Kedro pipeline in notebook #1832

The text was updated successfully, but these errors were encountered:

antonymilne · 2022-08-23T22:54:16Z

This is a very interesting question. I think it's right to focus just on the output side here so I'll save my comments on input for another time 🙂

I think we'd need @idanov or maybe even @tsanikgr to explain exactly why session.run returns what it does. AFAIK it's always been this way. Intuitively it kind of feels like the right thing to me, since those are the "unprocessed" datasets which you might want to work with further. All intermediate datasets have already been consumed by the pipeline and so shouldn't be required further downstream. If you really want to make them available then you could make a mock identify node that copies them to a free output. Returning all intermediate datasets feels like too much to me.

The reason I only say kind of above is that it seems more questionable to me that we only return those outputs that are MemoryDataSet. I think there's an argument that we should return all unconsumed outputs, even if they have been persisted to disk. i.e. we could have free_output = pipeline.outputs()

Also, technically it looks to me like the code that finds free_outputs is not quite right. If I define something explicitly as a MemoryDataSet in my catalog (unusual, but not unheard of, e.g. to change the copy_mode) then it won't count as a free_output when probably it should do. It's an edge case, but worth mentioning since we're discussing it here. What free_output means in the code is just "output that's not defined in the catalog", which is a subset of "output that's not a MemoryDataSet".

noklam · 2022-10-03T15:15:53Z

Add this related SO Question - How to run a kedro pipeline interactively like a fuction - this issues only focus on the output of a pipeline, what about input? I think this will be the next question.

noklam · 2022-10-04T10:06:41Z

Notes for Tech Design

The reason I only say kind of above is that it seems more questionable to me that we only return those outputs that are MemoryDataSet. I think there's an argument that we should return all unconsumed outputs, even if they have been persisted to disk. i.e. we could have free_output = pipeline.outputs()

Less controversial - Change the default - the definition of free_output is a bit buggy, we should change it.

The reason I only say kind of above is that it seems more questionable to me that we only return those outputs that are MemoryDataSet. I think there's an argument that we should return all unconsumed outputs, even if they have been persisted to disk. i.e. we could have free_output = pipeline.outputs()

Open up an optional argument for session.run to return any targeted datasets - even if it's an intermediate dataset or persisted dataset - this is more useful for interactive workflow (i.e. notebook) or debugging purpose. Currently it's tricky to make it work. This one is highly related to Workflow of debugging Kedro pipeline in notebook #1832
2.1. If it's an intermediate Memory dataset - you can't really get it.
2.2. If it's an intermediate persisted dataset - you need to first session.run and then do catalog.load

merelcht · 2022-10-05T14:05:00Z

Notes from Technical Design session:

There was agreement that the "free outputs" output from session isn't very clear. It was suggested to simply return all output from nodes that is not consumed, even if it's defined in the catalog. However, this could lead to very large amounts of data being returned. Instead we'll change it to return all free outputs and additionally any MemoryDataSets that are defined in the catalog.

The second point about adding an optional argument for session.run() to return any targeted datasets was discussed briefly, but it was decided to talk about it more thoroughly in a separate workstream about node debugging.

Change "free outputs" to also return MemoryDataSet entries from the catalog #1900

noklam · 2022-10-05T15:16:50Z

Supplement on the above comments to address @AntonyMilneQB question:

i.e. we could have free_output = pipeline.outputs()

The answer to that is there is a catalog.load call at the end, it's an expensive call and potentially memory hungry. So persisted datasets are deleted from memory as long as they are not needed. For MemoryDataSet, it's loaded in memory already, so there is no harm to return it.

noklam · 2022-10-06T13:55:07Z

I just give it a go to see what would it takes to make the initial idea works, partly because I want to test how the nbdev system works. See DebugRunner

https://noklam.github.io/kedro-debug-runner/core.html

noklam · 2023-03-23T15:01:31Z

Adding this as inspiration on whether we should have some kind of argument or debug mode that can specifically return output easily without editing configuration.

At the moment, the proper way to inspect is

For "free memory data" - it will return by session.run
"intermediate memory data" - it will be deleted as soon as it not needed (Not possible to be returned by a session, user need to do a session.run which make the "targeted dataset" as "free output"
For "persisted dataset" - user need to do `catalog.load("dataset_name")

The complication is mainly due to the kedro run need to be efficient and thus some data is deleted on the fly to reduce memory footprint.

The question is how can we improve the user experience? It's hard to reason what is "free output" and what is not.
I would also question that there are significant users working with moderate size of data, keeping everything in memory isn't a problem and make the development experience smoother. kedro-org/kedro-plugins#44
Is there a way to let users do what they want without touching any configuration?

Minor improvements in the IPython and Jupyter Notebook workflows #1075 (comment)

noklam added this to Kedro Framework Aug 23, 2022

noklam added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Aug 23, 2022

antonymilne mentioned this issue Aug 23, 2022

Improve resume pipeline suggestion for SequentialRunner #1795

Merged

5 tasks

noklam added this to the Improve the Interactive Jupyter notebook workflow milestone Sep 6, 2022

noklam mentioned this issue Sep 6, 2022

Workflow of debugging Kedro pipeline in notebook #1832

Open

3 tasks

merelcht mentioned this issue Oct 5, 2022

Change "free outputs" to also return MemoryDataSet entries from the catalog #1900

Closed

jmholzer mentioned this issue Oct 6, 2022

Add an attribute to dataset classes to flag persistence #1910

Closed

noklam mentioned this issue Mar 10, 2023

Pipeline to_outputs doesn't accept list as mentioned in the documentations. #2293

Closed

yetudada modified the milestones: Improve the Interactive Jupyter notebook workflow, Improving the debugging experience with Jupyter Notebook Jun 30, 2023

kedro-org locked and limited conversation to collaborators Mar 27, 2024

merelcht converted this issue into discussion #3745 Mar 27, 2024

github-project-automation bot moved this to Done in Kedro Framework Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Should we change the output of `session.run`? #1802

Should we change the output of `session.run`? #1802

noklam commented Aug 23, 2022 •

edited

Loading

antonymilne commented Aug 23, 2022

noklam commented Oct 3, 2022 •

edited

Loading

noklam commented Oct 4, 2022 •

edited

Loading

merelcht commented Oct 5, 2022 •

edited

Loading

noklam commented Oct 5, 2022 •

edited

Loading

noklam commented Oct 6, 2022 •

edited

Loading

noklam commented Mar 23, 2023 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Should we change the output of session.run? #1802

Should we change the output of session.run? #1802

Comments

noklam commented Aug 23, 2022 • edited Loading

Background

Motivation

Questions

Things to consider

Related Issue:

antonymilne commented Aug 23, 2022

noklam commented Oct 3, 2022 • edited Loading

noklam commented Oct 4, 2022 • edited Loading

merelcht commented Oct 5, 2022 • edited Loading

noklam commented Oct 5, 2022 • edited Loading

noklam commented Oct 6, 2022 • edited Loading

noklam commented Mar 23, 2023 • edited Loading

This issue was moved to a discussion.

Should we change the output of `session.run`? #1802

Should we change the output of `session.run`? #1802

noklam commented Aug 23, 2022 •

edited

Loading

noklam commented Oct 3, 2022 •

edited

Loading

noklam commented Oct 4, 2022 •

edited

Loading

merelcht commented Oct 5, 2022 •

edited

Loading

noklam commented Oct 5, 2022 •

edited

Loading

noklam commented Oct 6, 2022 •

edited

Loading

noklam commented Mar 23, 2023 •

edited

Loading