-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promote private _compiler()
via public to_pipeline()
method
#251
Conversation
Charles -- thanks for working on this and sorry for not giving feedback earlier. I'm looking at a lot of new lines of code here and trying to consider the maintenance and documentation costs to adding it. It strikes me that this executor you are proposing is basically just packing the What if, instead of this, we promoted the private |
I think I grok the |
Interesting idea. Yes, that probably works. I will try that now. |
NamedManualStages
convenience wrapper for to_generator
_compiler()
via public to_pipeline()
method
@rabernat, I'm going to chalk my two abandoned ideas on this PR (a dataclass, then briefly a PipelineExecutor) up to a good learning tour through the new executor model. But ultimately, yes, just using the existing The updates to the Manual Execution docs reflect what I think is an approachable and generalizable style for manually executing any recipe class. (Which is actually syntactically quite close to what we had before, albeit a bit more generalized.) You'll note from the use pattern described in the docs that, to streamline manual execution, I've added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Charles I really appreciate your work on this.
I want to continue to push to not add any new methods or properties to the Pipeline class (except the repr suggestions), because I truly don't think they are necessary. I think we can accomplish everything we need through better repr
s for the class plus documentation.
Sorry if this comes off as negative. I just feel strongly that simplest is really best.
Thanks for your patience with my nit picking. 🙃
@property | ||
def ismappable(self): | ||
return True if self.mappable is not None else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this property is not necessary because if stage.mappable
will accomplish the same thing.
def __iter__(self): | ||
for stage in self.stages: | ||
yield stage.name | ||
|
||
def __getitem__(self, name): | ||
names = [s.name for s in self.stages] | ||
if name not in names: | ||
raise KeyError(f"'{name}' not a stage name in this pipeline. Must be one of {names}.") | ||
stage = [s for s in self.stages if s.name == name][0] | ||
return stage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why it's necessary to add dictionary semantics to the Pipeline class. Is it really so much simpler to say
for stage in pipeline:
...
instead of
for stage in pipeline.stages:
...
The iterator here also returns only the stage name, rather than the stage itself. That necessitates the additional __getitem__
method, which is also unnecessary if you just iterate through pipeline.stages
directly.
What I would instead recommend is to provide custom and verbose __repr__
and _repr_html
methods on the class. These, together with a better docstring, should give the user all the information they need to explore the pipeline.
I agree with the majority of this review, particularly that
This is the one point I'm interested in some further discussion on. If I understand correctly, your proposal is that we should encourage users to debug (and therefore, I'd argue, conceptualize) recipes (in their serial mode) as one big loop, like in the function executor pangeo-forge-recipes/pangeo_forge_recipes/executors/python.py Lines 41 to 46 in 5355479
This may work well for certain recipes and learning styles. For me, however, the act of explicitly typing out stage names has been an integral part of familiarizing myself what's happening under the hood of recipe.prepare_target()
# or, because that will be deprecated soon,
pipeline["prepare_target"].function(config=recipe) Of course even without a for stage in pipeline.stages:
if stage.name == "the_stage_i_want_to_debug":
the_stage_i_want_to_debug = stage But being able to jump directly to pipeline["the_stage_i_want_to_debug"] feels easier, more expressive, and less error-prone to me. And I suspect this style will be quite useful for entraining new contributors as well. Which I guess gets at an aspect of this I'm realizing in real time, namely (no pun intended) that being able to surface stages by name directly via the top level of the |
What you say makes a lot of sense, and your perspective as a recipe debugger is unique. My main goal is to avoid redundant code. If an OrderedDict is a good representation of a Pipeline, let's just make that the fundamental data model. We could changed the Pipeline object itself to actually subclass This would mean refactoring quite a bit, but it should be quite doable. Probably not coincidentally, this would bring a Pipeline close to a Dask Graph, which is a dictionary. |
(or layers of a HighLevelGraph, which we also mentioned elsewhere) |
This seems like a great idea. Okay if I attempt this refactor? You've typically handled the larger structural changes, but I feel like I have a reasonable sense of what this entails, and I think it would be a great opportunity to dig deeper into the executor model. |
Please go ahead! 🚀 I'm not quite sure the best way to extend an @dataclass(frozen=True)
class Pipeline(OrderedDict[str, Stage]):
config: Optional[Config] = None The type hints will be the hard part. And maybe it can't be frozen? |
What do you expect will be hard re: type hints? |
I'm going to close this and reopen a new PR for the OrderedDict refactor when it's ready. |
Here's an idea for how we might wrap #238 to make a more expressive manual interface for step-through recipe debugging. I've been using
NamedManualStages
to run pangeo-forge/staged-recipes#93 today. The use pattern looks something likeI haven't written tests yet because I'm unsure whether or not this should be made to fit the
PipelineExecutor
model, or if it's perhaps just its own standalone concept. If the former, a test fixture might look likeThough arguably this is not a
PipelineExecutor
because by definition it's not used for executing recipes start-to-finish, but rather piecewise. Thencall_range
argument ofexecute_stage
, allows the user to specify, for examplewhich is useful for manually patching a known gap in the cache, or perhaps debugging why an FTP server seems to timeout when particular inputs are requested from it.
@rabernat, what do you think? Is this a reasonable basis for the lowest level of manual execution? If so, where does this fit in the executor model: within, beside? If it's not a
PipelineExecutor
I can write one-off tests for it (rather than trying to make it fit within the pipeline testing scheme).