Promote private `_compiler()` via public `to_pipeline()` method #251

cisaacstern · 2021-11-30T20:04:38Z

Here's an idea for how we might wrap #238 to make a more expressive manual interface for step-through recipe debugging. I've been using NamedManualStages to run pangeo-forge/staged-recipes#93 today. The use pattern looks something like

>>> from pangeo_forge_recipes.executors.python import NamedManualStages
>>> from recipe import recipe

>>> nms = NamedManualStages(recipe)
>>> nms.stage_names

('cache_input', 'prepare_target', 'store_chunk', 'finalize_target')

>>> nms.metadata

{'cache_input': {'index': 0, 'ncalls': 1},
 'prepare_target': {'index': 1, 'ncalls': 1},
 'store_chunk': {'index': 2, 'ncalls': 3},
 'finalize_target': {'index': 3, 'ncalls': 1}}

>>> nms.execute_stage("cache_input")

pangeo_forge_recipes.executors.python - INFO - Executing `'cache_input'`, stage 1 of 4 stages.
pangeo_forge_recipes.executors.python - INFO - Call 1 of 1 for this stage.
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Caching input 'time-0'
# etc.

I haven't written tests yet because I'm unsure whether or not this should be made to fit the PipelineExecutor model, or if it's perhaps just its own standalone concept. If the former, a test fixture might look like

@pytest.fixture(params=[pytest.param(0, marks=pytest.mark.executor_manual)])
def execute_recipe_manual():
    def execute(recipe):
        for stage_name, f in recipe.to_manual():
            # here `f` would be the same `NamedManualStages.execute_stage` method for every iteration
            f(stage_name)

    return execute

Though arguably this is not a PipelineExecutor because by definition it's not used for executing recipes start-to-finish, but rather piecewise. The ncall_range argument of execute_stage, allows the user to specify, for example

>>> nms.execute_stage("cache_input", ncall_range=(30, 40))
# caches only the 30th through 39th inputs for the recipe

which is useful for manually patching a known gap in the cache, or perhaps debugging why an FTP server seems to timeout when particular inputs are requested from it.

@rabernat, what do you think? Is this a reasonable basis for the lowest level of manual execution? If so, where does this fit in the executor model: within, beside? If it's not a PipelineExecutor I can write one-off tests for it (rather than trying to make it fit within the pipeline testing scheme).

rabernat · 2021-12-01T19:06:27Z

Charles -- thanks for working on this and sorry for not giving feedback earlier.

I'm looking at a lot of new lines of code here and trying to consider the maintenance and documentation costs to adding it. It strikes me that this executor you are proposing is basically just packing the Pipeline object into a new, similar, yet distinct data structure. That makes me think that the complexity here may not be worth it compared to just giving the user the Pipeline object directly and explaining how to iterate through it.

What if, instead of this, we promoted the private Recipe._compile() method to a public Recipe.to_pipeline() method and added documentation about how to manually step through a pipeline. Wouldn't that achieve the same level of introspection without adding more code to the package?

cisaacstern · 2021-12-01T19:08:36Z

I think I grok the PipelineExecutor model now, and have rewritten this PR to fit within that frame. Going to take a stab at the documentation, and will open this as Ready for review after that's complete.

cisaacstern · 2021-12-01T19:09:54Z

Wouldn't that achieve the same level of introspection without adding more code to the package?

Interesting idea. Yes, that probably works. I will try that now.

cisaacstern · 2021-12-02T00:26:42Z

@rabernat, I'm going to chalk my two abandoned ideas on this PR (a dataclass, then briefly a PipelineExecutor) up to a good learning tour through the new executor model. But ultimately, yes, just using the existing Pipeline object works better than either of those approaches, and is way more succinct too.

The updates to the Manual Execution docs reflect what I think is an approachable and generalizable style for manually executing any recipe class. (Which is actually syntactically quite close to what we had before, albeit a bit more generalized.)

You'll note from the use pattern described in the docs that, to streamline manual execution, I've added __iter__ and __getitem__ methods to Pipeline and a ismappable property to Stage. This felt necessary to me to make manual execution as expressive and human-readable of a user experience as possible.

rabernat

Charles I really appreciate your work on this.

I want to continue to push to not add any new methods or properties to the Pipeline class (except the repr suggestions), because I truly don't think they are necessary. I think we can accomplish everything we need through better reprs for the class plus documentation.

Sorry if this comes off as negative. I just feel strongly that simplest is really best.

Thanks for your patience with my nit picking. 🙃

rabernat · 2021-12-02T13:44:43Z

pangeo_forge_recipes/executors/base.py

+    @property
+    def ismappable(self):
+        return True if self.mappable is not None else False


I feel like this property is not necessary because if stage.mappable will accomplish the same thing.

rabernat · 2021-12-02T13:49:18Z

pangeo_forge_recipes/executors/base.py

+    def __iter__(self):
+        for stage in self.stages:
+            yield stage.name
+
+    def __getitem__(self, name):
+        names = [s.name for s in self.stages]
+        if name not in names:
+            raise KeyError(f"'{name}' not a stage name in this pipeline. Must be one of {names}.")
+        stage = [s for s in self.stages if s.name == name][0]
+        return stage


I don't see why it's necessary to add dictionary semantics to the Pipeline class. Is it really so much simpler to say

for stage in pipeline: ...

instead of

for stage in pipeline.stages: ...

The iterator here also returns only the stage name, rather than the stage itself. That necessitates the additional __getitem__ method, which is also unnecessary if you just iterate through pipeline.stages directly.

What I would instead recommend is to provide custom and verbose __repr__ and _repr_html methods on the class. These, together with a better docstring, should give the user all the information they need to explore the pipeline.

cisaacstern · 2021-12-02T18:25:41Z

I agree with the majority of this review, particularly that Pipeline.__iter__ is unnecessary, and that adding __repr__ and _repr_html_ methods is a much better way to provide introspection than the one-off Stages.ismappable.

the additional __getitem__ method, which is also unnecessary if you just iterate through pipeline.stages directly

This is the one point I'm interested in some further discussion on. If I understand correctly, your proposal is that we should encourage users to debug (and therefore, I'd argue, conceptualize) recipes (in their serial mode) as one big loop, like in the function executor

pangeo-forge-recipes/pangeo_forge_recipes/executors/python.py

Lines 41 to 46 in 5355479

    
           for stage in pipeline.stages: 
        
               if stage.mappable is not None: 
        
                   for m in stage.mappable: 
        
                       stage.function(m, config=pipeline.config) 
        
               else: 
        
                   stage.function(config=pipeline.config)

This may work well for certain recipes and learning styles. For me, however, the act of explicitly typing out stage names has been an integral part of familiarizing myself what's happening under the hood of XarrayZarrRecipe. For my learning style, the loop through pipeline.stages is quite opaque and therefore considerably less instructive than actually typing, e.g.

recipe.prepare_target()
# or, because that will be deprecated soon,
pipeline["prepare_target"].function(config=recipe)

Of course even without a __getitem__ (or equivalent), at the beginning of a debugging session we can do

for stage in pipeline.stages:
    if stage.name == "the_stage_i_want_to_debug":
        the_stage_i_want_to_debug = stage

But being able to jump directly to

pipeline["the_stage_i_want_to_debug"]

feels easier, more expressive, and less error-prone to me.

And I suspect this style will be quite useful for entraining new contributors as well. Which I guess gets at an aspect of this I'm realizing in real time, namely (no pun intended) that being able to surface stages by name directly via the top level of the Pipeline object (and not retrieving them in a loop, etc.), feels not just like a debugging convenience, but also an invaluable teaching tool for helping new users understand what a given recipe actually does. Thanks for your critical eye which has helped me distill my main goal here; curious to hear your thoughts.

rabernat · 2021-12-02T18:38:22Z

What you say makes a lot of sense, and your perspective as a recipe debugger is unique.

My main goal is to avoid redundant code. If an OrderedDict is a good representation of a Pipeline, let's just make that the fundamental data model. We could changed the Pipeline object itself to actually subclass OrderedDict, and remove name as an attribute of Stage.

This would mean refactoring quite a bit, but it should be quite doable.

Probably not coincidentally, this would bring a Pipeline close to a Dask Graph, which is a dictionary.

martindurant · 2021-12-02T18:43:03Z

Probably not coincidentally, this would bring a Pipeline close to a Dask Graph, which is a dictionary.

(or layers of a HighLevelGraph, which we also mentioned elsewhere)

cisaacstern · 2021-12-02T19:08:30Z

We could changed the Pipeline object itself to actually subclass OrderedDict, and remove name as an attribute of Stage.
This would mean refactoring quite a bit, but it should be quite doable.

This seems like a great idea. Okay if I attempt this refactor? You've typically handled the larger structural changes, but I feel like I have a reasonable sense of what this entails, and I think it would be a great opportunity to dig deeper into the executor model.

rabernat · 2021-12-02T19:12:34Z

Please go ahead! 🚀

I'm not quite sure the best way to extend an OrderedDict. Perhaps just

@dataclass(frozen=True)
class Pipeline(OrderedDict[str, Stage]):
    config: Optional[Config] = None

The type hints will be the hard part. And maybe it can't be frozen?

cisaacstern · 2021-12-02T21:15:13Z

The type hints will be the hard part

What do you expect will be hard re: type hints?

cisaacstern · 2021-12-03T15:55:57Z

I'm going to close this and reopen a new PR for the OrderedDict refactor when it's ready.

cisaacstern added 3 commits November 30, 2021 10:07

add NamedManualStages class

bbcd0bd

Rewrite NamedManualStages dataclass as ManualPipelineExecutor

901a5f3

test ManualPipelineExecutor

cd855c6

cisaacstern added 4 commits December 1, 2021 13:20

remove ManualPipelineExecutor

d3491c7

add to_pipeline, plus Stage and Pipeline methods

fc6463b

add tests for new methods

eb7eb74

update manual exec docs

e0ef611

cisaacstern changed the title ~~Add NamedManualStages convenience wrapper for to_generator~~ Promote private _compiler() via public to_pipeline() method Dec 2, 2021

oops revert conftest changes

a2692b3

cisaacstern marked this pull request as ready for review December 2, 2021 00:26

cisaacstern requested a review from rabernat December 2, 2021 00:30

rabernat requested changes Dec 2, 2021

View reviewed changes

cisaacstern closed this Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promote private `_compiler()` via public `to_pipeline()` method #251

Promote private `_compiler()` via public `to_pipeline()` method #251

cisaacstern commented Nov 30, 2021 •

edited

Loading

rabernat commented Dec 1, 2021

cisaacstern commented Dec 1, 2021

cisaacstern commented Dec 1, 2021

cisaacstern commented Dec 2, 2021 •

edited

Loading

rabernat left a comment

rabernat Dec 2, 2021

rabernat Dec 2, 2021

cisaacstern commented Dec 2, 2021 •

edited

Loading

rabernat commented Dec 2, 2021

martindurant commented Dec 2, 2021

cisaacstern commented Dec 2, 2021

rabernat commented Dec 2, 2021

cisaacstern commented Dec 2, 2021

cisaacstern commented Dec 3, 2021

Promote private _compiler() via public to_pipeline() method #251

Promote private _compiler() via public to_pipeline() method #251

Conversation

cisaacstern commented Nov 30, 2021 • edited Loading

rabernat commented Dec 1, 2021

cisaacstern commented Dec 1, 2021

cisaacstern commented Dec 1, 2021

cisaacstern commented Dec 2, 2021 • edited Loading

rabernat left a comment

Choose a reason for hiding this comment

rabernat Dec 2, 2021

Choose a reason for hiding this comment

rabernat Dec 2, 2021

Choose a reason for hiding this comment

cisaacstern commented Dec 2, 2021 • edited Loading

rabernat commented Dec 2, 2021

martindurant commented Dec 2, 2021

cisaacstern commented Dec 2, 2021

rabernat commented Dec 2, 2021

cisaacstern commented Dec 2, 2021

cisaacstern commented Dec 3, 2021

Promote private `_compiler()` via public `to_pipeline()` method #251

Promote private `_compiler()` via public `to_pipeline()` method #251

cisaacstern commented Nov 30, 2021 •

edited

Loading

cisaacstern commented Dec 2, 2021 •

edited

Loading

cisaacstern commented Dec 2, 2021 •

edited

Loading