-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor config #118
Refactor config #118
Conversation
@dataclass(kw_only=True) | ||
class CreateFlowsStepConfig(StepConfig): | ||
# Name of the Python module containing data models describing the data sheets | ||
models_module: str = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we possibly want to store the models with the sources rather than the steps, so that if necessary, we can do sanity checks already when pulling? Or even convert to nested representations? We then still need meta-information associating models to sheets, which is not straightfoward. In RPFT this is done via the content index: certain types are known, and data sheets specifically have their model specified in a column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move model definitions from Python into the spreadsheets. I think this is the only option really worth considering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. (You won't have access to the full palette of features that way (e.g. aliases) but I don't see a problem with that, as these are only used by the very specific pre-defined models for flows/content index etc.) I also want to implement auto-inference at some point.
In that case, I guess I won't worry about working out inheritance of models. We'll just assume there's a global model file with everything until we don't need that anymore (and can use sheets/jsons/auto-inference).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we probably want to specify the models within repo where we invoke the pipeline, rather than globally within the pipeline?
We'll need to devise a way of merging multiple models files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this file be split depending on the type of config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they should all be in the same module because they are all related to the same config file.
print("Result written to output folder") | ||
|
||
|
||
STEP_MAPPING = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use this pattern a few times, maybe there's a better way of doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all the steps are in the steps
module: getattr(steps, name)
Otherwise, it may be worth using function decorators.
STEP_MAPPING = dict()
def step(func):
STEP_MAPPING[func.__name__] = func
return func
def get_step(name):
return STEP_MAPPING[name]
@step
def create_flows(...):
...
src/parenttext_pipeline/cli.py
Outdated
config = load_config() | ||
|
||
config_pipeline_version = config.meta["pipeline_version"] | ||
if config_pipeline_version != PIPELINE_VERSION: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipeline version should ideally be guaranteed to be in sync with a github tag? Not sure how to do that.
Also, this is probably a bit too strict. Maybe we could have versioning so that if only the last number changes, we're assuming backward-compatibility. So major and minor must be the same, and we must have revision (config) < revision (pipeline)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor releases should be backwards-compatible as well. If the config file states v1.2, but the pipeline is v1.1 it should still be able to run to completion, although this may not be what the user intended, so a warning may be more appropriate than stopping execution.
I have tried using setuptools-scm to version the goals api based on tags, so you may want to try that also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll look into that.
@@ -29,9 +29,9 @@ classifiers = [ | |||
] | |||
dependencies = [ | |||
"beautifulsoup4~=4.12", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make tags so we don't need to reference commits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no preference.
|
||
@dataclass(kw_only=True) | ||
class Config: | ||
meta: dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I envision this to contain a pipeline_version
field, and probably a version
field.
This version should probably be stored somewhere in the output folder.
When pulling data, we should also store the timestamp (as a version ref), and store that as well for the output? Then, we know which input version and config version was used, and can detect mismatches if someone committed config/input/output that were out of sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
@dataclass(kw_only=True) | ||
class Config: | ||
meta: dict | ||
parents: list[dict] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm envisioning that steps of the pipelines may references sources from parent repos (recursively). Of course, this means that parent step config is not inherited (if this causes a lot of duplication, maybe we could consider defining mechanisms where steps can also inherit from parent steps of the same name). Any other suggestions for how inheritance could work?
If we go with this, I think either a config should have a single parent (whose sources can be referenced via parent.{source}
), or a dict (rather than a list) of parents (so that we can reference sources via {key}.{source}
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that sources would be a single list containing all of the spreadsheets required to create the flows. It could include spreadsheets from other repositories. The order in which the spreadsheets appear in the list would dictate how sheets within them might be merged or replaced. If we then push as much information as possible into the spreadsheets, it would all be subject to the same rules regarding inheritance.
I used to think that it would be useful to be able to explicitly configure which steps are included in the pipeline, but over time I've come to realise that the steps are pretty much the same for every deployment, and that a step can decide whether it should run or not based on whether it has sufficient information (in config or spreadsheets) to do its thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the long run, once everything is in the same basic format and the toolkit takes care of all the steps, that sounds sensible. Right now I still see need for modularization of the sources though, as we need to support different types (for some steps), and having a grouping mechanism also seems useful to avoid replicating full lists of sheets from the parent. (Now that makes me wonder if we should also have a grouping mechanism for steps, to avoid duplicating sequences.) The merge override strategy I'm implementing is the same as you suggest, later sheets overwrite earlier.
Regarding configurable steps, good point. I've seen some instances where steps are disabled by passing blank (but valid) data, but that could be done in a cleaner way, as you say. Configuration of the steps may be useful going forward though, to maintain backward compatibility more easily when new steps are introduced or steps are refactored/replaced with different tooling.
Config format is refactored so that entries reflect which pipeline steps they belong to. Pipeline steps are still hardcoded.
This has been tested by running the pipeline for parenttext-crisis using version 0.2.1 (HEAD with dependency bump: the pipeline would crash as the sheets were using features that were only added in later version of rpft) and using version 1.0.0 (using the same old config format). The outputs were identical, modulo UUIDs. |
Refactor the config so that the sequence of pipeline steps can be customized.
Data sources factored out of pipeline steps.
In parallel, here's an example of a refactored config that this code can be tested on (if both config.json and config.py are present, the former is used): https://github.com/IDEMSInternational/parenttext-crisis/tree/pipeline-refactor
This PR uses setuptools_scm for the pipeline to determine its own version from git tags. The main tool checks whether the config version (reasonably) matches the pipeline version. The example config has version 1.0.0; in order to have the pipeline version match, after pulling this branch, open a terminal in the root folder and proceed as follows:
You can then run the pipeline: