Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote private _compiler() via public to_pipeline() method #251

Closed
wants to merge 8 commits into from

Conversation

cisaacstern
Copy link
Member

@cisaacstern cisaacstern commented Nov 30, 2021

Here's an idea for how we might wrap #238 to make a more expressive manual interface for step-through recipe debugging. I've been using NamedManualStages to run pangeo-forge/staged-recipes#93 today. The use pattern looks something like

>>> from pangeo_forge_recipes.executors.python import NamedManualStages
>>> from recipe import recipe

>>> nms = NamedManualStages(recipe)
>>> nms.stage_names

('cache_input', 'prepare_target', 'store_chunk', 'finalize_target')

>>> nms.metadata

{'cache_input': {'index': 0, 'ncalls': 1},
 'prepare_target': {'index': 1, 'ncalls': 1},
 'store_chunk': {'index': 2, 'ncalls': 3},
 'finalize_target': {'index': 3, 'ncalls': 1}}

>>> nms.execute_stage("cache_input")

pangeo_forge_recipes.executors.python - INFO - Executing `'cache_input'`, stage 1 of 4 stages.
pangeo_forge_recipes.executors.python - INFO - Call 1 of 1 for this stage.
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Caching input 'time-0'
# etc.

I haven't written tests yet because I'm unsure whether or not this should be made to fit the PipelineExecutor model, or if it's perhaps just its own standalone concept. If the former, a test fixture might look like

@pytest.fixture(params=[pytest.param(0, marks=pytest.mark.executor_manual)])
def execute_recipe_manual():
    def execute(recipe):
        for stage_name, f in recipe.to_manual():
            # here `f` would be the same `NamedManualStages.execute_stage` method for every iteration
            f(stage_name)

    return execute

Though arguably this is not a PipelineExecutor because by definition it's not used for executing recipes start-to-finish, but rather piecewise. The ncall_range argument of execute_stage, allows the user to specify, for example

>>> nms.execute_stage("cache_input", ncall_range=(30, 40))
# caches only the 30th through 39th inputs for the recipe

which is useful for manually patching a known gap in the cache, or perhaps debugging why an FTP server seems to timeout when particular inputs are requested from it.

@rabernat, what do you think? Is this a reasonable basis for the lowest level of manual execution? If so, where does this fit in the executor model: within, beside? If it's not a PipelineExecutor I can write one-off tests for it (rather than trying to make it fit within the pipeline testing scheme).

@rabernat
Copy link
Contributor

rabernat commented Dec 1, 2021

Charles -- thanks for working on this and sorry for not giving feedback earlier.

I'm looking at a lot of new lines of code here and trying to consider the maintenance and documentation costs to adding it. It strikes me that this executor you are proposing is basically just packing the Pipeline object into a new, similar, yet distinct data structure. That makes me think that the complexity here may not be worth it compared to just giving the user the Pipeline object directly and explaining how to iterate through it.

What if, instead of this, we promoted the private Recipe._compile() method to a public Recipe.to_pipeline() method and added documentation about how to manually step through a pipeline. Wouldn't that achieve the same level of introspection without adding more code to the package?

@cisaacstern
Copy link
Member Author

I think I grok the PipelineExecutor model now, and have rewritten this PR to fit within that frame. Going to take a stab at the documentation, and will open this as Ready for review after that's complete.

@cisaacstern
Copy link
Member Author

Wouldn't that achieve the same level of introspection without adding more code to the package?

Interesting idea. Yes, that probably works. I will try that now.

@cisaacstern cisaacstern changed the title Add NamedManualStages convenience wrapper for to_generator Promote private _compiler() via public to_pipeline() method Dec 2, 2021
@cisaacstern cisaacstern marked this pull request as ready for review December 2, 2021 00:26
@cisaacstern
Copy link
Member Author

cisaacstern commented Dec 2, 2021

@rabernat, I'm going to chalk my two abandoned ideas on this PR (a dataclass, then briefly a PipelineExecutor) up to a good learning tour through the new executor model. But ultimately, yes, just using the existing Pipeline object works better than either of those approaches, and is way more succinct too.

The updates to the Manual Execution docs reflect what I think is an approachable and generalizable style for manually executing any recipe class. (Which is actually syntactically quite close to what we had before, albeit a bit more generalized.)

You'll note from the use pattern described in the docs that, to streamline manual execution, I've added __iter__ and __getitem__ methods to Pipeline and a ismappable property to Stage. This felt necessary to me to make manual execution as expressive and human-readable of a user experience as possible.

@cisaacstern cisaacstern requested a review from rabernat December 2, 2021 00:30
Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Charles I really appreciate your work on this.

I want to continue to push to not add any new methods or properties to the Pipeline class (except the repr suggestions), because I truly don't think they are necessary. I think we can accomplish everything we need through better reprs for the class plus documentation.

Sorry if this comes off as negative. I just feel strongly that simplest is really best.

Thanks for your patience with my nit picking. 🙃

Comment on lines +30 to +32
@property
def ismappable(self):
return True if self.mappable is not None else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this property is not necessary because if stage.mappable will accomplish the same thing.

Comment on lines +40 to +49
def __iter__(self):
for stage in self.stages:
yield stage.name

def __getitem__(self, name):
names = [s.name for s in self.stages]
if name not in names:
raise KeyError(f"'{name}' not a stage name in this pipeline. Must be one of {names}.")
stage = [s for s in self.stages if s.name == name][0]
return stage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why it's necessary to add dictionary semantics to the Pipeline class. Is it really so much simpler to say

for stage in pipeline:
    ...

instead of

for stage in pipeline.stages:
    ...

The iterator here also returns only the stage name, rather than the stage itself. That necessitates the additional __getitem__ method, which is also unnecessary if you just iterate through pipeline.stages directly.

What I would instead recommend is to provide custom and verbose __repr__ and _repr_html methods on the class. These, together with a better docstring, should give the user all the information they need to explore the pipeline.

@cisaacstern
Copy link
Member Author

cisaacstern commented Dec 2, 2021

I agree with the majority of this review, particularly that Pipeline.__iter__ is unnecessary, and that adding __repr__ and _repr_html_ methods is a much better way to provide introspection than the one-off Stages.ismappable.

the additional __getitem__ method, which is also unnecessary if you just iterate through pipeline.stages directly

This is the one point I'm interested in some further discussion on. If I understand correctly, your proposal is that we should encourage users to debug (and therefore, I'd argue, conceptualize) recipes (in their serial mode) as one big loop, like in the function executor

for stage in pipeline.stages:
if stage.mappable is not None:
for m in stage.mappable:
stage.function(m, config=pipeline.config)
else:
stage.function(config=pipeline.config)

This may work well for certain recipes and learning styles. For me, however, the act of explicitly typing out stage names has been an integral part of familiarizing myself what's happening under the hood of XarrayZarrRecipe. For my learning style, the loop through pipeline.stages is quite opaque and therefore considerably less instructive than actually typing, e.g.

recipe.prepare_target()
# or, because that will be deprecated soon,
pipeline["prepare_target"].function(config=recipe)

Of course even without a __getitem__ (or equivalent), at the beginning of a debugging session we can do

for stage in pipeline.stages:
    if stage.name == "the_stage_i_want_to_debug":
        the_stage_i_want_to_debug = stage

But being able to jump directly to

pipeline["the_stage_i_want_to_debug"]

feels easier, more expressive, and less error-prone to me.

And I suspect this style will be quite useful for entraining new contributors as well. Which I guess gets at an aspect of this I'm realizing in real time, namely (no pun intended) that being able to surface stages by name directly via the top level of the Pipeline object (and not retrieving them in a loop, etc.), feels not just like a debugging convenience, but also an invaluable teaching tool for helping new users understand what a given recipe actually does. Thanks for your critical eye which has helped me distill my main goal here; curious to hear your thoughts.

@rabernat
Copy link
Contributor

rabernat commented Dec 2, 2021

What you say makes a lot of sense, and your perspective as a recipe debugger is unique.

My main goal is to avoid redundant code. If an OrderedDict is a good representation of a Pipeline, let's just make that the fundamental data model. We could changed the Pipeline object itself to actually subclass OrderedDict, and remove name as an attribute of Stage.

This would mean refactoring quite a bit, but it should be quite doable.

Probably not coincidentally, this would bring a Pipeline close to a Dask Graph, which is a dictionary.

@martindurant
Copy link
Contributor

Probably not coincidentally, this would bring a Pipeline close to a Dask Graph, which is a dictionary.

(or layers of a HighLevelGraph, which we also mentioned elsewhere)

@cisaacstern
Copy link
Member Author

We could changed the Pipeline object itself to actually subclass OrderedDict, and remove name as an attribute of Stage.
This would mean refactoring quite a bit, but it should be quite doable.

This seems like a great idea. Okay if I attempt this refactor? You've typically handled the larger structural changes, but I feel like I have a reasonable sense of what this entails, and I think it would be a great opportunity to dig deeper into the executor model.

@rabernat
Copy link
Contributor

rabernat commented Dec 2, 2021

Please go ahead! 🚀

I'm not quite sure the best way to extend an OrderedDict. Perhaps just

@dataclass(frozen=True)
class Pipeline(OrderedDict[str, Stage]):
    config: Optional[Config] = None

The type hints will be the hard part. And maybe it can't be frozen?

@cisaacstern
Copy link
Member Author

The type hints will be the hard part

What do you expect will be hard re: type hints?

@cisaacstern
Copy link
Member Author

I'm going to close this and reopen a new PR for the OrderedDict refactor when it's ready.

@cisaacstern cisaacstern closed this Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants