Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move the canonical class to parsing and core modules #100

Merged
merged 3 commits into from
Jan 31, 2025
Merged

Conversation

agoscinski
Copy link
Collaborator

There is a bug the Task.plugin_classes are not yet initialized. Maybe it is because of the constructor approach I take. Need to investigate



def test_workflow_canonicalization():
def test_workflow_():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test for Workflow.from_config_workflow already exists. If this now tests something in addition to the existing one, move it to test_workflow (and deduplicate). Else, remove the module for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok! will check this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok done!

@@ -541,11 +548,12 @@ class ConfigWorkflow(BaseModel):

"""

rootdir: None | Path = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We took rootdir out of here, because it can not be supplied inside the yaml file, based on Matthieu's comment on #89 . What is the reasoning for reintroducing it here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be supplied by using the from_config_file constructor, it seems to me also reasonable that ConfigWorkflow is aware of its location. It is not perfect since you are right that the information is not supplied in the yaml file. The alternative, however, to make the core.Workflow aware of how the ConfigWorkflow should be used for parsing seems to me less ideal. So I think it is not the perfect but the alternative seems worse to me

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what way was core.Workflow aware of any of this in #89? Does it not provide exactly that alternative without the drawbacks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this logic was in _yaml_data_models so core.Workflow was not aware of the parsing logic, I remembered it wrongly. Before the canonicalization the rootdir was updated in the _yaml_data_models.load_workflow_config function. Then this was moved to the construction of the CanonicalWorkflow. In this PR we move it to the ConfigWorkflow into the new from_config_file constructor so it is handled within the class which is more logic than adding it an util function as it was before the canonicalization. I think this an implementation imitating the benefits of the CanonicalWorkflow but moving it to the ConfigWorkflow. One can use rootdir is an indicator if the workflow was created from a file or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain in which way moving from_config_file from the module to the class makes it "more logical"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do this, this PR should add versions of each test that uses ConfigWorkflow with rootdir=None and check the appropriate outcome.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain in which way moving from_config_file from the module to the class makes it "more logical"?

Since it is essentially a method that constructs an instance of the class.

If we do this, this PR should add versions of each test that uses ConfigWorkflow with rootdir=None and check the appropriate outcome.

Added tests

Comment on lines 553 to 684
cycles: Annotated[list[ConfigCycle], AfterValidator(list_not_empty)]
tasks: Annotated[list[ConfigTask], AfterValidator(list_not_empty)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can add the validators here, then the CanonicalWorkflow class only adds the rootdir. I agree that then, there is a more elegant solution that duplicates less code and is orthogonal to the changes to workflow.Workflow in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest replacing CanonicalWorkflow with a dataclass / pydantic model / namedtuple (whatever), which puts a type name to ConfigWorkflow + rootdir. workflow.Workflow should then only deal with that type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can add the validators here, then the CanonicalWorkflow class only adds the rootdir. I agree that then, there is a more elegant solution that duplicates less code and is orthogonal to the changes to workflow.Workflow in this PR.

CanonicalWorkflow also makes name non-optional

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used now a reasonable default argument for rootdir that I would have been favor for in a future PR, but since it simplifies the handling of the rootdir with both constructors, I introduced it here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move rootdir outside of the ConfigWorkflow I think it would become problematic with setting rootdir inside the yaml file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had sort of assumed that there was a design decision to not include rootdir in the yaml config. If we can (and also make it and name mandatory) then CanonicalWorkflow can be removed without replacement.

Copy link
Collaborator Author

@agoscinski agoscinski Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had sort of assumed that there was a design decision to not include rootdir in the yaml config

I am not sure. I think the name rootdir indicates that it should not be set by the user, but right now it acts more like a working directory which I would argue should be settable by the user. One could introduce a new variable that makes this difference apparent (so we have rootdirand working_dir. On the other hand, the only the usage of rootdir is as a working directory, ergo one could replace it. What do you think @DropD @leclairm

Copy link
Contributor

@leclairm leclairm Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agoscinski
rootdir is the directory where we find the full configuration, which consists not only of the yaml file but also all the other ones like ICON namelists, user scripts for shell tasks, etc... Subsequently, this is also the root path used to resolve the relative paths provided in the yaml file for these other config files. So it has to simply be the parent directory of the yaml file, nothing else, and must not be set by the user in the yaml file.

Regarding its name, rootdir seems logical as it is originally a property of CanonicalWorkflow being itself derived from ConfigWorkflow. Then it becomes Workflow.config_rootdir which still makes sense. The only name we could change here for consistency is CanonicalWorkflow -> CanonicalConfigWorkflow (if the canonical class is kept, I still need to read through the PR again to understand what it enables)

Copy link
Collaborator Author

@agoscinski agoscinski Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rootdir is the directory where we find the full configuration, which consists not only of the yaml file but also all the other ones like ICON namelists, user scripts for shell tasks, etc... Subsequently, this is also the root path used to resolve the relative paths provided in the yaml file for these other config files. So it has to simply be the parent directory of the yaml file, nothing else, and must not be set by the user in the yaml file.

But here again you argue that there is must be rootdir, then we use it as working directory. I argue there must be a working directory, and if this working directory is not specified we use the rootdir as it. I don't see why we should not give the user the possibility to specify where the data for the workflow is so the user has more flexibility in arranging his/her folder structure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided in a meeting to keep rootdir as a name because working_directory implies too much that the calculation happens there

Comment on lines 44 to 47
data_dict: dict[str, ConfigAvailableData | ConfigGeneratedData] = {data.name: data for data in chain(data.available, data.generated)}
task_dict: dict[str, ConfigTask] = {task.name: task for task in tasks}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The properties could also be here (see the other comment about properties).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or, for performance reasons, they could store the output of the properties.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are just temporaries that are needed for the creation of the Workflow. One can add them as properties but they are nowhere else in the code used

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then properties are the way to go.

Copy link
Collaborator Author

@agoscinski agoscinski Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand, why introduce a property for something that is computed once? Do you want to capsulate the logic to test it? This I can understand, but then I would make them private properties, since we expose it to the API of the class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. no as property it also does not make sense as it is created from the tasks: ConfigTask that are only provided during initialization in core.Workflow, so accessing it afterwards does not make so much sense. If you want to test it I would propose to imitate the build_internal_dicts function in core.Workflow

    def _build_data_dict(self, data: ConfigData) -> dict[str, ConfigAvailableData | ConfigGeneratedData]:
        return data_dict = {data.name: data for data in data.available} | {
            data.name: data for data in data.generated
        }

same for task dict

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it now like this, what do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a discussion with @DropD we decided to remove it since it is only needed at one place

Copy link
Collaborator

@DropD DropD Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes strike me as orthogonal to the goal of canonicalizing the yaml models. They are improvements along the same lines in that they make the workflow.Workflow class and everything that depends on it more testable. Nevertheless, they could be isolated and put into their own PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we we can merge it after #89 but it needs to be based on test-config-workflow to see a proper diff

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant isolated from other changes, like undoing the shift to properties instead of duplicating information, moving the validators to the ConfigWorkflow and replacing CanonicalWorkflow with some other way of storing ConfigWorkflow + rootdir.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this PR is to move all the logic that was implemented in CanonicalWorkflow to the classes Workflow and ConfigWorkflow, that has been done so far. I feel like splitting it up into several PRs seems not fully taking all logic that are currently implemented in account.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can split it up maybe in multiple commits?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I splitted up the PR in multiple commits but a lot of things are not orthogonal improvements and need to be in the first commit, I am very verbose in the commit message to explain the dependency of the changes within the commit

Base automatically changed from test-config-workflow to main January 21, 2025 14:26
Comment on lines 28 to 35
def __init__(self,
name: str,
rootdir: Path,
cycles: list[ConfigCycle],
tasks: list[ConfigTask],
data: ConfigData,
parameters: dict[str, list], # TODO type should be more specified but not in this PR
) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This signature change is valuable, especially together with the from_config_workflow classmethod.

Comment on lines 38 to 40
if not isinstance(rootdir, Path):
msg = f"Need to provide path for rootdir but got {type(rootdir)}"
raise TypeError(msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this logic here is what we agreed not to do, as it is part of dealing with user input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay removed, see latest commit

Comment on lines 583 to 585
# TODO maybe not best design, since we have multiple entry points, but
# on the other hand we don't have rootdir ambiguity,I think this can be
# better integrated with pydantic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which way is it true that "we don't have rootdir ambiguity"? .rootdir is still type hinted as Path | None and a check in workflow.Workflow.__init__ is still required, is there another definition of "ambiguity"?

@@ -541,11 +548,12 @@ class ConfigWorkflow(BaseModel):

"""

rootdir: None | Path = None
name: str | None = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since now ambiguity in the type of name is reintroduced, all tests are required to have a version for name=None

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rootdir and name cannot be None anymore

@agoscinski agoscinski force-pushed the move-to-core branch 3 times, most recently from 484c50d to c53383c Compare January 27, 2025 09:42
@agoscinski agoscinski requested a review from DropD January 27, 2025 10:59
@agoscinski agoscinski changed the title WIP: moving the canonical class to config and core Moving the canonical class to parsing and core modules Jan 27, 2025
@agoscinski agoscinski marked this pull request as ready for review January 27, 2025 11:03
@classmethod
def from_config_workflow(cls: type[Self], config_workflow: ConfigWorkflow) -> Workflow:
return cls(
name=config_workflow.name,
Copy link
Collaborator

@DropD DropD Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail mypy checks, because ConfigWorkflow.name can be None. That was the main job of CanonicalWorkflow to avoid.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR needs to catch that now. In general, please run mypy manually and check that it isn't a problem anywhere. This is the bar for any PR that wants to replace CanonicalWorkflow (You can run mypy there and compare).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name cannot be anymore None, by separating the yaml parsing and validation one can set it in the from_config_file constructor. The only reason why we allowed None was because before we could not determine if name is specified in the given file.

def from_config_workflow(cls: type[Self], config_workflow: ConfigWorkflow) -> Workflow:
return cls(
name=config_workflow.name,
rootdir=config_workflow.rootdir,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail mypy checks, because ConfigWorkflow.rootdir can be None. That was the main job of CanonicalWorkflow to avoid.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rootdir and name cannot be None anymore

src/sirocco/parsing/_yaml_data_models.py Outdated Show resolved Hide resolved
rootdir: Path
name: str
cycles: Annotated[list[ConfigCycle], AfterValidator(list_not_empty)]
tasks: Annotated[list[ConfigTask], AfterValidator(list_not_empty)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we also use check_parameters_lists as BeforeValidator annotation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just replicated the existing check. In this case it does not make a real difference. If I am not wrong, it just specifies if it happens before the cycles are converted from list of dictionaries to a list of ConfigCycles or after. Doing it after or before should not change it logically. But yes we could change it to BeforeValidator.

src/sirocco/core/workflow.py Outdated Show resolved Hide resolved
src/sirocco/core/workflow.py Outdated Show resolved Hide resolved
src/sirocco/core/workflow.py Outdated Show resolved Hide resolved
return canonicalize_workflow(config_workflow=parsed_workflow, rootdir=rootdir)
# return parsed_workflow
@classmethod
def from_config_file(cls, config_filename: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config_filename -> config_file_path would be more logical

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depends, path could also mean that you need to specify a pathlib.Path

Copy link
Collaborator Author

@agoscinski agoscinski Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah misunderstood, yes in this case we actually specify a path not a filename, will do it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed, I used config_path since this name was used in the visualizer

@@ -88,5 +107,21 @@ def cycle_dates(cycle_config: ConfigCycle) -> Iterator[datetime]:
yield date

@classmethod
def from_yaml(cls: type[Self], config_path: str) -> Self:
return cls(load_workflow_config(config_path))
def from_yaml(cls: type[Self], config_filename: str) -> Self:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here also config_file_path is preferable to config_filename

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I wanted to change this in another PR to keep it separate, but one could also do it in this PR and make another commit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed now in this PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used config_path since this name was used in the visualizer

@leclairm
Copy link
Contributor

I'm fine with the PR, once we've handled the ConfigWorkflow.rootdir issue (cannot be set by the user).

@agoscinski agoscinski force-pushed the move-to-core branch 2 times, most recently from a719ac9 to 42ffff2 Compare January 30, 2025 16:47
…rkflow`

The validation of `cycles` and `tasks` in `list_not_empty` is moved to the
`ConfigWorkflow`. This subsequently required a modification of the tests that
initalize a `ConfigWorkflow` with empty `cycles` and `tasks` as this is not
anymore allowed through the tests. We switched to `BeforeValidator`
as the check can be happen before pydantic validation.

The `data_dict` and `task_dict` member variables are moved to the
constructor of the `core.Workflow`. The constructor of `core.Workflow`
is split into two: `from_config_workflow` constructor, which replicates
the behavir of the previous constructor and the default construcor now
accepts all required that are extracted from the passed `ConfigWorkflow`
in the previous constructor as individual parameters.

The introduction of the `from_config_file` constructor in ConfigWorkflow, which
replicates the behavior of `load_workflow_config`, allows the `rootdir` to
become a part of `ConfigWorkflow`. This approach eliminates the need for an
external utility function for workflow creation from a file. Since the utility
function was essentially acting as a constructor for `ConfigWorkflow`, it
simplifies the interface by enabling direct access to the `rootdir` within
`core.Workflow`. In addition we make `rootdir` and `name` not optional as these
parameters cane determined in the new `from_config_file` and passed to the
default constructor. Furthermore, We introduce the util function
`validate_yaml_content` that allows a more generic usage of creating an instance
from a `Conig*` class from a yaml string, especially used in the docstring
tests.
The doctest was referencing a task that was never used in the tasks and
also did not reference `task_a`. So `task_b` was renamed to `task_b` to
solve these issues.
@agoscinski agoscinski changed the title Moving the canonical class to parsing and core modules Move the canonical class to parsing and core modules Jan 31, 2025
@agoscinski agoscinski merged commit 99e779d into main Jan 31, 2025
6 checks passed
@agoscinski agoscinski deleted the move-to-core branch January 31, 2025 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants