Improve resume pipeline suggestion for SequentialRunner #1795

jmholzer · 2022-08-18T14:09:39Z

Description

Resolves #1477

Development notes

After a failed run, Kedro suggests a command to the user:

You can resume the pipeline run by adding the following argument to your previous command: --from-nodes "node4_B"

Before this PR, the suggested command will run from the last nodes to be executed, regardless of whether their input was persisted or not. If any of the inputs to the listed nodes is not persisted, the run immediately fails again.

After this PR, the suggested command will run from the closest successfully executed nodes with persisted inputs:

You can resume the pipeline run from the nearest nodes with persisted inputs by adding the following argument to your previous command: --from-nodes "node1_B,node1_A"

This is achieved by performing a breadth-first search, starting at the last successfully executed nodes. This backward search yields a set of the nearest nodes that have persisted inputs.

Six tests are added to the test_sequential_runner test suite to test different cases on an X-shaped pipeline.

Limitations

This change is a significant improvement, but there are still two important limitations:

Persisted inputs are defined to be any that are not MemoryDataSets. This definition has limitations; it does not account for custom datasets that are not persisted.
Neither the approach in this PR nor the previous one handle the case where nodes append to datasets. Running these nodes repeatedly could have unintended consequences.

In the future, I think it would be a good idea to add a method to the API of AbstractDataSet that checks for persistence. I would love to hear thoughts on this.

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes

Signed-off-by: Jannic Holzer <[email protected]>

noklam

Great work tackling a tough issue! I don't have much concern since this feature wasn't doing much before, even if it doesn't work for all cases it will still be an improvement. But like Antony said it would be good to check if it works for other Runner.

jmholzer · 2022-09-01T16:47:14Z

Just one question from before that's left over but very happy to approve here! 🙂

does this work for parallel runner? There's some funky stuff going on there with SharedMemoryDataSet so would be good to check it still works.

Thanks for the re-review! You're right I dropped this one somewhere, I'm sorry. I've been looking into this, I'll give an update when I've finished. _SharedMemoryDataset could be a problem.

antonymilne · 2022-09-01T16:51:30Z

Thanks for the re-review! You're right I dropped this one somewhere, I'm sorry. I've been looking into this, I'll give an update when I've finished. _SharedMemoryDataset could be a problem.

Cool, no worries. As @noklam says, if it doesn't work then it's not a showstopper. I'm happy to merge with it just working on sequential runner and we can fall back on using the previous inferior _suggest_resume_scenario for the parallel runner case if it's not easy to fix. Would be nice to have it working for parallel runner too, but it's not worth spending a huge amount of time on.

jmholzer · 2022-09-02T11:37:57Z

Alright, I finished my investigation into ParallelRunner.

It is possible to implement the new scheme proposed in this PR for ParallelRunner, though it involves some workarounds due to _SharedMemoryDataSet.

Unfortunately, it isn't of much use, since the sequence in which nodes are run (and the resulting exception is reached) is not deterministic for ParallelRunner. This causes problems for both the new and the existing logic for generating suggestions. For example, with the existing logic a run with ParallelRunner will produce the message:

You can resume the pipeline run by adding the following argument to your previous command:
--from-nodes "node3_B,node4_A"

Another identical run will (stochastically) produce the message:

You can resume the pipeline run by adding the following argument to your previous command:
--from-nodes "node4_A"

Similar results are seen for the new logic implemented in this PR. One message is correct while the other isn't. Since these conflicting messages occur with roughly the same frequency, I don't think we should be suggesting a resume command at all at the moment for ParallelRunner. I think implementing this will first require that the order of execution of nodes is made deterministic, which is a large enough task to be a separate PR.

@noklam @AntonyMilneQB it would be good to hear your thoughts on this. If you agree with me, I will turn off this feature for ParallelRunner for the time being in a new commit, merge this PR and then write up an issue.

antonymilne · 2022-09-02T12:42:32Z

That sounds like a perfect plan, thanks very much @jmholzer. Note that until recently the sequential runner was also not deterministic in the order of running nodes (something @noklam fixed). I don't know if the same sort of fix would be relevant for the parallel runner.

noklam · 2022-09-02T13:18:16Z

I am happy that this is added just for SequentialRunner.

Note that there may be 2 sources of non-deterministic behavior:

Kedro itself order the nodes in a non-deterministic (Used to be the case with SequentialRunner due to some set operation) -> Distributed these nodes into subprocesses.
The nature of parallelism, the order of execution depends on if the computation is finished or not, so I think is non-deterministic by nature.

It's impossible to have deterministic nodes execution order for ParallelRunner, but there may be things that can be more deterministic for 1.

Signed-off-by: Jannic Holzer <[email protected]>

jmholzer · 2022-09-02T14:18:37Z

Thanks for the feedback @noklam and @AntonyMilneQB! It's much appreciated.

@noklam thanks for the hint in 1. Regarding 2, you're right about this, the execution order is inherently indeterminate. Nonetheless I think we can at least reach a deterministic 'solution' (in this case, the correct warning) using join(s). I will open an issue and explain my thinking.

Signed-off-by: Jannic Holzer <[email protected]>

…kedro-org/kedro into feat/improve-resume-scenario-suggestion Signed-off-by: Jannic Holzer <[email protected]>

* Add _find_first_persistent_ancestors and stubs for supporting functions. Signed-off-by: Jannic Holzer <[email protected]> * Add body to _enumerate_parents. Signed-off-by: Jannic Holzer <[email protected]> * Add function to check persistence of node outputs. Signed-off-by: Jannic Holzer <[email protected]> * Modify _suggest_resume_scenario to use _find_first_persistent_ancestors Signed-off-by: Jannic Holzer <[email protected]> * Pass catalog to self._suggest_resume_scenario Signed-off-by: Jannic Holzer <[email protected]> * Track and return all ancestor nodes that must be re-run during DFS. Signed-off-by: Jannic Holzer <[email protected]> * Integrate DFS with original _suggest_resume_scenario. Signed-off-by: Jannic Holzer <[email protected]> * Implement backwards-DFS strategy on all boundary nodes. Signed-off-by: Jannic Holzer <[email protected]> * Switch to multi-node start BFS approach to finding persistent ancestors. Signed-off-by: Jannic Holzer <[email protected]> * Add a useful error message if no nodes ran. Signed-off-by: Jannic Holzer <[email protected]> * Add docstrings to new functions. Signed-off-by: Jannic Holzer <[email protected]> * Add catalog argument to self._suggest_resume_scenario Signed-off-by: Jannic Holzer <[email protected]> * Modify exception_fn to allow it to take multiple arguments Signed-off-by: Jannic Holzer <[email protected]> * Add test for AbstractRunner._suggest_resume_scenario Signed-off-by: Jannic Holzer <[email protected]> * Add docstring for _suggest_resume_scenario Signed-off-by: Jannic Holzer <[email protected]> * Improve formatting Signed-off-by: Jannic Holzer <[email protected]> * Move new functions out of AbstractRunner Signed-off-by: Jannic Holzer <[email protected]> * Remove bare except Signed-off-by: Jannic Holzer <[email protected]> * Fix broad except clause Signed-off-by: Jannic Holzer <[email protected]> * Access datasets __dict__ using vars() Signed-off-by: Jannic Holzer <[email protected]> * Sort imports Signed-off-by: Jannic Holzer <[email protected]> * Improve resume message Signed-off-by: Jannic Holzer <[email protected]> * Add a space to resume suggestion message Signed-off-by: Jannic Holzer <[email protected]> * Modify DFS logic to eliminate possible queue duplicates Signed-off-by: Jannic Holzer <[email protected]> * Modify catalog.datasets to catalog._data_sets w/ disabled linter warning Signed-off-by: Jannic Holzer <[email protected]> * Move all pytest fixtures to conftest.py Signed-off-by: Jannic Holzer <[email protected]> * Modify all instances of Pipeline to pipeline Signed-off-by: Jannic Holzer <[email protected]> * Fix typo in the name of TestSequentialRunnerBranchedPipeline Signed-off-by: Jannic Holzer <[email protected]> * Remove spurious assert in save of persistent_dataset_catalog Signed-off-by: Jannic Holzer <[email protected]> * Replace instantiations of Pipeline with pipeline Signed-off-by: Jannic Holzer <[email protected]> * Modify test_suggest_resume_scenario fixture to use node names Signed-off-by: Jannic Holzer <[email protected]> * Add disable=unused-argument to _save Signed-off-by: Jannic Holzer <[email protected]> * Remove resume suggestion for ParallelRunner Signed-off-by: Jannic Holzer <[email protected]> * Remove spurious try / except Signed-off-by: Jannic Holzer <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> Signed-off-by: nickolasrm <[email protected]>

jmholzer added 23 commits August 19, 2022 10:53

Add _find_first_persistent_ancestors and stubs for supporting functions.

426824a

Signed-off-by: Jannic Holzer <[email protected]>

Add body to _enumerate_parents.

f90daf7

Signed-off-by: Jannic Holzer <[email protected]>

Add function to check persistence of node outputs.

e1cb2e3

Signed-off-by: Jannic Holzer <[email protected]>

Modify _suggest_resume_scenario to use _find_first_persistent_ancestors

18a6105

Signed-off-by: Jannic Holzer <[email protected]>

Pass catalog to self._suggest_resume_scenario

7753486

Signed-off-by: Jannic Holzer <[email protected]>

Track and return all ancestor nodes that must be re-run during DFS.

a402aa7

Signed-off-by: Jannic Holzer <[email protected]>

Integrate DFS with original _suggest_resume_scenario.

699a9f5

Signed-off-by: Jannic Holzer <[email protected]>

Implement backwards-DFS strategy on all boundary nodes.

a49a6f7

Signed-off-by: Jannic Holzer <[email protected]>

Switch to multi-node start BFS approach to finding persistent ancestors.

7955d0d

Signed-off-by: Jannic Holzer <[email protected]>

Add a useful error message if no nodes ran.

68764f7

Signed-off-by: Jannic Holzer <[email protected]>

Add docstrings to new functions.

74c60f7

Signed-off-by: Jannic Holzer <[email protected]>

Add catalog argument to self._suggest_resume_scenario

958fb91

Signed-off-by: Jannic Holzer <[email protected]>

Modify exception_fn to allow it to take multiple arguments

d61a19b

Signed-off-by: Jannic Holzer <[email protected]>

Add test for AbstractRunner._suggest_resume_scenario

8724923

Signed-off-by: Jannic Holzer <[email protected]>

Add docstring for _suggest_resume_scenario

f57c431

Signed-off-by: Jannic Holzer <[email protected]>

Improve formatting

9fda4c0

Signed-off-by: Jannic Holzer <[email protected]>

Move new functions out of AbstractRunner

3a79059

Signed-off-by: Jannic Holzer <[email protected]>

Remove bare except

13063dd

Signed-off-by: Jannic Holzer <[email protected]>

Fix broad except clause

f29bbf5

Signed-off-by: Jannic Holzer <[email protected]>

Access datasets __dict__ using vars()

01d5ab0

Signed-off-by: Jannic Holzer <[email protected]>

Sort imports

1dae5e7

Signed-off-by: Jannic Holzer <[email protected]>

Improve resume message

d572896

Signed-off-by: Jannic Holzer <[email protected]>

Add a space to resume suggestion message

af405ed

Signed-off-by: Jannic Holzer <[email protected]>

jmholzer force-pushed the feat/improve-resume-scenario-suggestion branch from 122839f to af405ed Compare August 19, 2022 09:54

Merge branch 'main' into feat/improve-resume-scenario-suggestion

d1b6693

jmholzer marked this pull request as ready for review August 19, 2022 10:29

jmholzer requested a review from idanov as a code owner August 19, 2022 10:29

jmholzer requested review from merelcht, AhdraMeraliQB and antonymilne August 19, 2022 10:30

antonymilne requested a review from noklam September 1, 2022 13:04

noklam approved these changes Sep 1, 2022

View reviewed changes

jmholzer requested review from noklam and antonymilne September 2, 2022 11:39

antonymilne approved these changes Sep 2, 2022

View reviewed changes

noklam approved these changes Sep 2, 2022

View reviewed changes

jmholzer and others added 2 commits September 2, 2022 14:56

Remove resume suggestion for ParallelRunner

99e0bb2

Signed-off-by: Jannic Holzer <[email protected]>

Merge branch 'main' into feat/improve-resume-scenario-suggestion

85f7609

jmholzer added 2 commits September 2, 2022 15:23

Remove spurious try / except

a74fef6

Signed-off-by: Jannic Holzer <[email protected]>

Merge branch 'feat/improve-resume-scenario-suggestion' of github.com:…

74d0cec

…kedro-org/kedro into feat/improve-resume-scenario-suggestion Signed-off-by: Jannic Holzer <[email protected]>

jmholzer merged commit 6428dd9 into main Sep 2, 2022

jmholzer deleted the feat/improve-resume-scenario-suggestion branch September 2, 2022 14:59

noklam mentioned this pull request Sep 5, 2022

kedro run CLI incorrectly splits the names of nodes at commas #1828

Closed

This was referenced Sep 5, 2022

Add resume suggestion to parallel runner #1830

Open

Replace Pipeline with pipeline across all tests #1833

Closed

noklam mentioned this pull request Sep 6, 2022

Workflow of debugging Kedro pipeline in notebook #1832

Open

3 tasks

noklam changed the title ~~Improve resume pipeline suggestion~~ Improve resume pipeline suggestion for SequentialRunner Sep 20, 2022

rashidakanchwala mentioned this pull request Sep 20, 2022

Create massive pipeline to test with flowchart on Kedro-viz kedro-org/kedro-viz#1064

Closed

1 task

jmholzer mentioned this pull request Oct 6, 2022

Add an attribute to dataset classes to flag persistence #1910

Closed

ondrejzacha mentioned this pull request Sep 3, 2023

Improve resume pipeline suggestions #3002

Closed

AhdraMeraliQB mentioned this pull request Jan 5, 2024

Create QA Kedro test projects for stress testing and performance and evaluation #3489

Closed

This was referenced Aug 20, 2024

[Stress Testing] - Create example projects to assess Kedro performance for complex pipelines #3866

Closed

Spike: design example kedro projects that can be used to assess performance issues #3957

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve resume pipeline suggestion for SequentialRunner #1795

Improve resume pipeline suggestion for SequentialRunner #1795

jmholzer commented Aug 18, 2022 •

edited

Loading

noklam left a comment

jmholzer commented Sep 1, 2022

antonymilne commented Sep 1, 2022 •

edited

Loading

jmholzer commented Sep 2, 2022

antonymilne commented Sep 2, 2022

noklam commented Sep 2, 2022

jmholzer commented Sep 2, 2022

Improve resume pipeline suggestion for SequentialRunner #1795

Improve resume pipeline suggestion for SequentialRunner #1795

Conversation

jmholzer commented Aug 18, 2022 • edited Loading

Description

Development notes

Limitations

Checklist

noklam left a comment

Choose a reason for hiding this comment

jmholzer commented Sep 1, 2022

antonymilne commented Sep 1, 2022 • edited Loading

jmholzer commented Sep 2, 2022

antonymilne commented Sep 2, 2022

noklam commented Sep 2, 2022

jmholzer commented Sep 2, 2022

jmholzer commented Aug 18, 2022 •

edited

Loading

antonymilne commented Sep 1, 2022 •

edited

Loading