Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor config #118

Merged
merged 8 commits into from
Jun 26, 2024
Merged

Refactor config #118

merged 8 commits into from
Jun 26, 2024

Conversation

geoo89
Copy link
Collaborator

@geoo89 geoo89 commented Apr 16, 2024

Refactor the config so that the sequence of pipeline steps can be customized.

Data sources factored out of pipeline steps.

In parallel, here's an example of a refactored config that this code can be tested on (if both config.json and config.py are present, the former is used): https://github.com/IDEMSInternational/parenttext-crisis/tree/pipeline-refactor

This PR uses setuptools_scm for the pipeline to determine its own version from git tags. The main tool checks whether the config version (reasonably) matches the pipeline version. The example config has version 1.0.0; in order to have the pipeline version match, after pulling this branch, open a terminal in the root folder and proceed as follows:

git tag -a 1.0.0 -m "test version"
python3 -m pip install -e .

You can then run the pipeline:

python3 -m parenttext_pipeline pull_data compile_flows

@dataclass(kw_only=True)
class CreateFlowsStepConfig(StepConfig):
# Name of the Python module containing data models describing the data sheets
models_module: str = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we possibly want to store the models with the sources rather than the steps, so that if necessary, we can do sanity checks already when pulling? Or even convert to nested representations? We then still need meta-information associating models to sheets, which is not straightfoward. In RPFT this is done via the content index: certain types are known, and data sheets specifically have their model specified in a column.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move model definitions from Python into the spreadsheets. I think this is the only option really worth considering.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. (You won't have access to the full palette of features that way (e.g. aliases) but I don't see a problem with that, as these are only used by the very specific pre-defined models for flows/content index etc.) I also want to implement auto-inference at some point.

In that case, I guess I won't worry about working out inheritance of models. We'll just assume there's a global model file with everything until we don't need that anymore (and can use sheets/jsons/auto-inference).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we probably want to specify the models within repo where we invoke the pipeline, rather than globally within the pipeline?

We'll need to devise a way of merging multiple models files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this file be split depending on the type of config?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they should all be in the same module because they are all related to the same config file.

print("Result written to output folder")


STEP_MAPPING = {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this pattern a few times, maybe there's a better way of doing this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all the steps are in the steps module: getattr(steps, name)

Otherwise, it may be worth using function decorators.

STEP_MAPPING = dict()

def step(func):
    STEP_MAPPING[func.__name__] = func
    return func

def get_step(name):
    return STEP_MAPPING[name]

@step
def create_flows(...):
    ...

config = load_config()

config_pipeline_version = config.meta["pipeline_version"]
if config_pipeline_version != PIPELINE_VERSION:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipeline version should ideally be guaranteed to be in sync with a github tag? Not sure how to do that.

Also, this is probably a bit too strict. Maybe we could have versioning so that if only the last number changes, we're assuming backward-compatibility. So major and minor must be the same, and we must have revision (config) < revision (pipeline)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor releases should be backwards-compatible as well. If the config file states v1.2, but the pipeline is v1.1 it should still be able to run to completion, although this may not be what the user intended, so a warning may be more appropriate than stopping execution.

I have tried using setuptools-scm to version the goals api based on tags, so you may want to try that also.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll look into that.

@@ -29,9 +29,9 @@ classifiers = [
]
dependencies = [
"beautifulsoup4~=4.12",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make tags so we don't need to reference commits?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no preference.


@dataclass(kw_only=True)
class Config:
meta: dict
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I envision this to contain a pipeline_version field, and probably a version field.

This version should probably be stored somewhere in the output folder.

When pulling data, we should also store the timestamp (as a version ref), and store that as well for the output? Then, we know which input version and config version was used, and can detect mismatches if someone committed config/input/output that were out of sync.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

@dataclass(kw_only=True)
class Config:
meta: dict
parents: list[dict] = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm envisioning that steps of the pipelines may references sources from parent repos (recursively). Of course, this means that parent step config is not inherited (if this causes a lot of duplication, maybe we could consider defining mechanisms where steps can also inherit from parent steps of the same name). Any other suggestions for how inheritance could work?

If we go with this, I think either a config should have a single parent (whose sources can be referenced via parent.{source}), or a dict (rather than a list) of parents (so that we can reference sources via {key}.{source}).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that sources would be a single list containing all of the spreadsheets required to create the flows. It could include spreadsheets from other repositories. The order in which the spreadsheets appear in the list would dictate how sheets within them might be merged or replaced. If we then push as much information as possible into the spreadsheets, it would all be subject to the same rules regarding inheritance.

I used to think that it would be useful to be able to explicitly configure which steps are included in the pipeline, but over time I've come to realise that the steps are pretty much the same for every deployment, and that a step can decide whether it should run or not based on whether it has sufficient information (in config or spreadsheets) to do its thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run, once everything is in the same basic format and the toolkit takes care of all the steps, that sounds sensible. Right now I still see need for modularization of the sources though, as we need to support different types (for some steps), and having a grouping mechanism also seems useful to avoid replicating full lists of sheets from the parent. (Now that makes me wonder if we should also have a grouping mechanism for steps, to avoid duplicating sequences.) The merge override strategy I'm implementing is the same as you suggest, later sheets overwrite earlier.

Regarding configurable steps, good point. I've seen some instances where steps are disabled by passing blank (but valid) data, but that could be done in a cleaner way, as you say. Configuration of the steps may be useful going forward though, to maintain backward compatibility more easily when new steps are introduced or steps are refactored/replaced with different tooling.

@geoo89 geoo89 force-pushed the refactor-config branch from 8ff0c18 to 09c6862 Compare May 16, 2024 15:08
@geoo89
Copy link
Collaborator Author

geoo89 commented May 16, 2024

This has been tested by running the pipeline for parenttext-crisis using version 0.2.1 (HEAD with dependency bump: the pipeline would crash as the sheets were using features that were only added in later version of rpft) and using version 1.0.0 (using the same old config format). The outputs were identical, modulo UUIDs.

@geoo89 geoo89 force-pushed the refactor-config branch from 5abad07 to f12d241 Compare June 21, 2024 10:48
@geoo89
Copy link
Collaborator Author

geoo89 commented Jun 21, 2024

This PR should close automatically once #126 is merged.

Please make any further amendments to #126.

@geoo89 geoo89 merged commit f12d241 into main Jun 26, 2024
1 check passed
@geoo89 geoo89 deleted the refactor-config branch July 10, 2024 06:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants