-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move the canonical class to parsing and core modules #100
Conversation
|
||
|
||
def test_workflow_canonicalization(): | ||
def test_workflow_(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A test for Workflow.from_config_workflow
already exists. If this now tests something in addition to the existing one, move it to test_workflow
(and deduplicate). Else, remove the module for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok! will check this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok done!
@@ -541,11 +548,12 @@ class ConfigWorkflow(BaseModel): | |||
|
|||
""" | |||
|
|||
rootdir: None | Path = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We took rootdir
out of here, because it can not be supplied inside the yaml file, based on Matthieu's comment on #89 . What is the reasoning for reintroducing it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be supplied by using the from_config_file
constructor, it seems to me also reasonable that ConfigWorkflow
is aware of its location. It is not perfect since you are right that the information is not supplied in the yaml file. The alternative, however, to make the core.Workflow
aware of how the ConfigWorkflow
should be used for parsing seems to me less ideal. So I think it is not the perfect but the alternative seems worse to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what way was core.Workflow
aware of any of this in #89? Does it not provide exactly that alternative without the drawbacks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this logic was in _yaml_data_models
so core.Workflow
was not aware of the parsing logic, I remembered it wrongly. Before the canonicalization the rootdir
was updated in the _yaml_data_models.load_workflow_config
function. Then this was moved to the construction of the CanonicalWorkflow
. In this PR we move it to the ConfigWorkflow
into the new from_config_file
constructor so it is handled within the class which is more logic than adding it an util function as it was before the canonicalization. I think this an implementation imitating the benefits of the CanonicalWorkflow
but moving it to the ConfigWorkflow
. One can use rootdir
is an indicator if the workflow was created from a file or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain in which way moving from_config_file
from the module to the class makes it "more logical"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do this, this PR should add versions of each test that uses ConfigWorkflow
with rootdir=None
and check the appropriate outcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain in which way moving from_config_file from the module to the class makes it "more logical"?
Since it is essentially a method that constructs an instance of the class.
If we do this, this PR should add versions of each test that uses ConfigWorkflow with rootdir=None and check the appropriate outcome.
Added tests
cycles: Annotated[list[ConfigCycle], AfterValidator(list_not_empty)] | ||
tasks: Annotated[list[ConfigTask], AfterValidator(list_not_empty)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can add the validators here, then the CanonicalWorkflow
class only adds the rootdir. I agree that then, there is a more elegant solution that duplicates less code and is orthogonal to the changes to workflow.Workflow
in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest replacing CanonicalWorkflow
with a dataclass / pydantic model / namedtuple (whatever), which puts a type name to ConfigWorkflow
+ rootdir
. workflow.Workflow
should then only deal with that type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can add the validators here, then the
CanonicalWorkflow
class only adds the rootdir. I agree that then, there is a more elegant solution that duplicates less code and is orthogonal to the changes toworkflow.Workflow
in this PR.
CanonicalWorkflow
also makes name
non-optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used now a reasonable default argument for rootdir
that I would have been favor for in a future PR, but since it simplifies the handling of the rootdir
with both constructors, I introduced it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move rootdir
outside of the ConfigWorkflow I think it would become problematic with setting rootdir
inside the yaml file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had sort of assumed that there was a design decision to not include rootdir
in the yaml config. If we can (and also make it and name
mandatory) then CanonicalWorkflow
can be removed without replacement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had sort of assumed that there was a design decision to not include rootdir in the yaml config
I am not sure. I think the name rootdir
indicates that it should not be set by the user, but right now it acts more like a working directory which I would argue should be settable by the user. One could introduce a new variable that makes this difference apparent (so we have rootdir
and working_dir
. On the other hand, the only the usage of rootdir
is as a working directory, ergo one could replace it. What do you think @DropD @leclairm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agoscinski
rootdir
is the directory where we find the full configuration, which consists not only of the yaml file but also all the other ones like ICON namelists, user scripts for shell tasks, etc... Subsequently, this is also the root path used to resolve the relative paths provided in the yaml file for these other config files. So it has to simply be the parent directory of the yaml file, nothing else, and must not be set by the user in the yaml file.
Regarding its name, rootdir
seems logical as it is originally a property of CanonicalWorkflow
being itself derived from ConfigWorkflow
. Then it becomes Workflow.config_rootdir
which still makes sense. The only name we could change here for consistency is CanonicalWorkflow
-> CanonicalConfigWorkflow
(if the canonical class is kept, I still need to read through the PR again to understand what it enables)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rootdir is the directory where we find the full configuration, which consists not only of the yaml file but also all the other ones like ICON namelists, user scripts for shell tasks, etc... Subsequently, this is also the root path used to resolve the relative paths provided in the yaml file for these other config files. So it has to simply be the parent directory of the yaml file, nothing else, and must not be set by the user in the yaml file.
But here again you argue that there is must be rootdir
, then we use it as working directory. I argue there must be a working directory, and if this working directory is not specified we use the rootdir
as it. I don't see why we should not give the user the possibility to specify where the data for the workflow is so the user has more flexibility in arranging his/her folder structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided in a meeting to keep rootdir
as a name because working_directory
implies too much that the calculation happens there
src/sirocco/core/workflow.py
Outdated
data_dict: dict[str, ConfigAvailableData | ConfigGeneratedData] = {data.name: data for data in chain(data.available, data.generated)} | ||
task_dict: dict[str, ConfigTask] = {task.name: task for task in tasks} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The properties could also be here (see the other comment about properties).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or, for performance reasons, they could store the output of the properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are just temporaries that are needed for the creation of the Workflow. One can add them as properties but they are nowhere else in the code used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then properties are the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand, why introduce a property for something that is computed once? Do you want to capsulate the logic to test it? This I can understand, but then I would make them private properties, since we expose it to the API of the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. no as property it also does not make sense as it is created from the tasks: ConfigTask
that are only provided during initialization in core.Workflow
, so accessing it afterwards does not make so much sense. If you want to test it I would propose to imitate the build_internal_dicts
function in core.Workflow
def _build_data_dict(self, data: ConfigData) -> dict[str, ConfigAvailableData | ConfigGeneratedData]:
return data_dict = {data.name: data for data in data.available} | {
data.name: data for data in data.generated
}
same for task dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it now like this, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a discussion with @DropD we decided to remove it since it is only needed at one place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes strike me as orthogonal to the goal of canonicalizing the yaml models. They are improvements along the same lines in that they make the workflow.Workflow
class and everything that depends on it more testable. Nevertheless, they could be isolated and put into their own PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure we we can merge it after #89 but it needs to be based on test-config-workflow
to see a proper diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant isolated from other changes, like undoing the shift to properties instead of duplicating information, moving the validators to the ConfigWorkflow and replacing CanonicalWorkflow with some other way of storing ConfigWorkflow + rootdir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this PR is to move all the logic that was implemented in CanonicalWorkflow
to the classes Workflow
and ConfigWorkflow
, that has been done so far. I feel like splitting it up into several PRs seems not fully taking all logic that are currently implemented in account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can split it up maybe in multiple commits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I splitted up the PR in multiple commits but a lot of things are not orthogonal improvements and need to be in the first commit, I am very verbose in the commit message to explain the dependency of the changes within the commit
20c25bb
to
f104396
Compare
src/sirocco/core/workflow.py
Outdated
def __init__(self, | ||
name: str, | ||
rootdir: Path, | ||
cycles: list[ConfigCycle], | ||
tasks: list[ConfigTask], | ||
data: ConfigData, | ||
parameters: dict[str, list], # TODO type should be more specified but not in this PR | ||
) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This signature change is valuable, especially together with the from_config_workflow
classmethod.
src/sirocco/core/workflow.py
Outdated
if not isinstance(rootdir, Path): | ||
msg = f"Need to provide path for rootdir but got {type(rootdir)}" | ||
raise TypeError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this logic here is what we agreed not to do, as it is part of dealing with user input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay removed, see latest commit
# TODO maybe not best design, since we have multiple entry points, but | ||
# on the other hand we don't have rootdir ambiguity,I think this can be | ||
# better integrated with pydantic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which way is it true that "we don't have rootdir ambiguity"? .rootdir
is still type hinted as Path | None
and a check in workflow.Workflow.__init__
is still required, is there another definition of "ambiguity"?
@@ -541,11 +548,12 @@ class ConfigWorkflow(BaseModel): | |||
|
|||
""" | |||
|
|||
rootdir: None | Path = None | |||
name: str | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since now ambiguity in the type of name
is reintroduced, all tests are required to have a version for name=None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rootdir
and name
cannot be None anymore
7e6e0a7
to
f2eae39
Compare
484c50d
to
c53383c
Compare
@classmethod | ||
def from_config_workflow(cls: type[Self], config_workflow: ConfigWorkflow) -> Workflow: | ||
return cls( | ||
name=config_workflow.name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail mypy checks, because ConfigWorkflow.name
can be None
. That was the main job of CanonicalWorkflow
to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR needs to catch that now. In general, please run mypy
manually and check that it isn't a problem anywhere. This is the bar for any PR that wants to replace CanonicalWorkflow
(You can run mypy there and compare).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name
cannot be anymore None, by separating the yaml parsing and validation one can set it in the from_config_file
constructor. The only reason why we allowed None
was because before we could not determine if name
is specified in the given file.
src/sirocco/core/workflow.py
Outdated
def from_config_workflow(cls: type[Self], config_workflow: ConfigWorkflow) -> Workflow: | ||
return cls( | ||
name=config_workflow.name, | ||
rootdir=config_workflow.rootdir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail mypy checks, because ConfigWorkflow.rootdir
can be None. That was the main job of CanonicalWorkflow to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rootdir
and name
cannot be None anymore
rootdir: Path | ||
name: str | ||
cycles: Annotated[list[ConfigCycle], AfterValidator(list_not_empty)] | ||
tasks: Annotated[list[ConfigTask], AfterValidator(list_not_empty)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we also use check_parameters_lists
as BeforeValidator
annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just replicated the existing check. In this case it does not make a real difference. If I am not wrong, it just specifies if it happens before the cycles are converted from list of dictionaries to a list of ConfigCycles or after. Doing it after or before should not change it logically. But yes we could change it to BeforeValidator
.
return canonicalize_workflow(config_workflow=parsed_workflow, rootdir=rootdir) | ||
# return parsed_workflow | ||
@classmethod | ||
def from_config_file(cls, config_filename: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config_filename
-> config_file_path
would be more logical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depends, path could also mean that you need to specify a pathlib.Path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah misunderstood, yes in this case we actually specify a path not a filename, will do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed, I used config_path since this name was used in the visualizer
src/sirocco/core/workflow.py
Outdated
@@ -88,5 +107,21 @@ def cycle_dates(cycle_config: ConfigCycle) -> Iterator[datetime]: | |||
yield date | |||
|
|||
@classmethod | |||
def from_yaml(cls: type[Self], config_path: str) -> Self: | |||
return cls(load_workflow_config(config_path)) | |||
def from_yaml(cls: type[Self], config_filename: str) -> Self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here also config_file_path
is preferable to config_filename
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I wanted to change this in another PR to keep it separate, but one could also do it in this PR and make another commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed now in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used config_path
since this name was used in the visualizer
I'm fine with the PR, once we've handled the |
a719ac9
to
42ffff2
Compare
…rkflow` The validation of `cycles` and `tasks` in `list_not_empty` is moved to the `ConfigWorkflow`. This subsequently required a modification of the tests that initalize a `ConfigWorkflow` with empty `cycles` and `tasks` as this is not anymore allowed through the tests. We switched to `BeforeValidator` as the check can be happen before pydantic validation. The `data_dict` and `task_dict` member variables are moved to the constructor of the `core.Workflow`. The constructor of `core.Workflow` is split into two: `from_config_workflow` constructor, which replicates the behavir of the previous constructor and the default construcor now accepts all required that are extracted from the passed `ConfigWorkflow` in the previous constructor as individual parameters. The introduction of the `from_config_file` constructor in ConfigWorkflow, which replicates the behavior of `load_workflow_config`, allows the `rootdir` to become a part of `ConfigWorkflow`. This approach eliminates the need for an external utility function for workflow creation from a file. Since the utility function was essentially acting as a constructor for `ConfigWorkflow`, it simplifies the interface by enabling direct access to the `rootdir` within `core.Workflow`. In addition we make `rootdir` and `name` not optional as these parameters cane determined in the new `from_config_file` and passed to the default constructor. Furthermore, We introduce the util function `validate_yaml_content` that allows a more generic usage of creating an instance from a `Conig*` class from a yaml string, especially used in the docstring tests.
The doctest was referencing a task that was never used in the tasks and also did not reference `task_a`. So `task_b` was renamed to `task_b` to solve these issues.
47052df
to
672576e
Compare
There is a bug the
Task.plugin_classes
are not yet initialized. Maybe it is because of the constructor approach I take. Need to investigate