Refactor config #118

geoo89 · 2024-04-16T20:00:29Z

Refactor the config so that the sequence of pipeline steps can be customized.

Data sources factored out of pipeline steps.

In parallel, here's an example of a refactored config that this code can be tested on (if both config.json and config.py are present, the former is used): https://github.com/IDEMSInternational/parenttext-crisis/tree/pipeline-refactor

This PR uses setuptools_scm for the pipeline to determine its own version from git tags. The main tool checks whether the config version (reasonably) matches the pipeline version. The example config has version 1.0.0; in order to have the pipeline version match, after pulling this branch, open a terminal in the root folder and proceed as follows:

git tag -a 1.0.0 -m "test version"
python3 -m pip install -e .

You can then run the pipeline:

python3 -m parenttext_pipeline pull_data compile_flows

geoo89 · 2024-04-16T20:08:57Z

src/parenttext_pipeline/configs.py

+@dataclass(kw_only=True)
+class CreateFlowsStepConfig(StepConfig):
+    # Name of the Python module containing data models describing the data sheets
+    models_module: str = None


Do we possibly want to store the models with the sources rather than the steps, so that if necessary, we can do sanity checks already when pulling? Or even convert to nested representations? We then still need meta-information associating models to sheets, which is not straightfoward. In RPFT this is done via the content index: certain types are known, and data sheets specifically have their model specified in a column.

We should move model definitions from Python into the spreadsheets. I think this is the only option really worth considering.

That makes sense. (You won't have access to the full palette of features that way (e.g. aliases) but I don't see a problem with that, as these are only used by the very specific pre-defined models for flows/content index etc.) I also want to implement auto-inference at some point.

In that case, I guess I won't worry about working out inheritance of models. We'll just assume there's a global model file with everything until we don't need that anymore (and can use sheets/jsons/auto-inference).

geoo89 · 2024-04-16T20:09:21Z

src/parenttext_pipeline/models/parenttext_models.py

I assume we probably want to specify the models within repo where we invoke the pipeline, rather than globally within the pipeline?

We'll need to devise a way of merging multiple models files.

geoo89 · 2024-04-16T20:10:45Z

src/parenttext_pipeline/configs.py

Should this file be split depending on the type of config?

I think they should all be in the same module because they are all related to the same config file.

geoo89 · 2024-04-16T20:11:12Z

src/parenttext_pipeline/compile_flows.py

+    print("Result written to output folder")
+
+
+STEP_MAPPING = {


I use this pattern a few times, maybe there's a better way of doing this.

If all the steps are in the steps module: getattr(steps, name)

Otherwise, it may be worth using function decorators.

STEP_MAPPING = dict() def step(func): STEP_MAPPING[func.__name__] = func return func def get_step(name): return STEP_MAPPING[name] @step def create_flows(...): ...

geoo89 · 2024-04-16T20:13:18Z

src/parenttext_pipeline/cli.py

+    config = load_config()
+
+    config_pipeline_version = config.meta["pipeline_version"]
+    if config_pipeline_version != PIPELINE_VERSION:


Pipeline version should ideally be guaranteed to be in sync with a github tag? Not sure how to do that.

Also, this is probably a bit too strict. Maybe we could have versioning so that if only the last number changes, we're assuming backward-compatibility. So major and minor must be the same, and we must have revision (config) < revision (pipeline)?

Minor releases should be backwards-compatible as well. If the config file states v1.2, but the pipeline is v1.1 it should still be able to run to completion, although this may not be what the user intended, so a warning may be more appropriate than stopping execution.

I have tried using setuptools-scm to version the goals api based on tags, so you may want to try that also.

Thanks, I'll look into that.

geoo89 · 2024-04-16T20:13:59Z

pyproject.toml

@@ -29,9 +29,9 @@ classifiers = [
 ]
 dependencies = [
    "beautifulsoup4~=4.12",


Make tags so we don't need to reference commits?

I have no preference.

geoo89 · 2024-04-16T20:16:25Z

src/parenttext_pipeline/configs.py

+
+@dataclass(kw_only=True)
+class Config:
+    meta: dict


I envision this to contain a pipeline_version field, and probably a version field.

This version should probably be stored somewhere in the output folder.

When pulling data, we should also store the timestamp (as a version ref), and store that as well for the output? Then, we know which input version and config version was used, and can detect mismatches if someone committed config/input/output that were out of sync.

geoo89 · 2024-04-16T20:28:09Z

src/parenttext_pipeline/configs.py

+@dataclass(kw_only=True)
+class Config:
+    meta: dict
+    parents: list[dict] = None


I'm envisioning that steps of the pipelines may references sources from parent repos (recursively). Of course, this means that parent step config is not inherited (if this causes a lot of duplication, maybe we could consider defining mechanisms where steps can also inherit from parent steps of the same name). Any other suggestions for how inheritance could work?

If we go with this, I think either a config should have a single parent (whose sources can be referenced via parent.{source}), or a dict (rather than a list) of parents (so that we can reference sources via {key}.{source}).

I was thinking that sources would be a single list containing all of the spreadsheets required to create the flows. It could include spreadsheets from other repositories. The order in which the spreadsheets appear in the list would dictate how sheets within them might be merged or replaced. If we then push as much information as possible into the spreadsheets, it would all be subject to the same rules regarding inheritance.

I used to think that it would be useful to be able to explicitly configure which steps are included in the pipeline, but over time I've come to realise that the steps are pretty much the same for every deployment, and that a step can decide whether it should run or not based on whether it has sufficient information (in config or spreadsheets) to do its thing.

In the long run, once everything is in the same basic format and the toolkit takes care of all the steps, that sounds sensible. Right now I still see need for modularization of the sources though, as we need to support different types (for some steps), and having a grouping mechanism also seems useful to avoid replicating full lists of sheets from the parent. (Now that makes me wonder if we should also have a grouping mechanism for steps, to avoid duplicating sequences.) The merge override strategy I'm implementing is the same as you suggest, later sheets overwrite earlier.

Regarding configurable steps, good point. I've seen some instances where steps are disabled by passing blank (but valid) data, but that could be done in a cleaner way, as you say. Configuration of the steps may be useful going forward though, to maintain backward compatibility more easily when new steps are introduced or steps are refactored/replaced with different tooling.

Config format is refactored so that entries reflect which pipeline steps they belong to. Pipeline steps are still hardcoded.

geoo89 · 2024-05-16T15:29:40Z

This has been tested by running the pipeline for parenttext-crisis using version 0.2.1 (HEAD with dependency bump: the pipeline would crash as the sheets were using features that were only added in later version of rpft) and using version 1.0.0 (using the same old config format). The outputs were identical, modulo UUIDs.

geoo89 · 2024-06-21T10:50:32Z

This PR should close automatically once #126 is merged.

Please make any further amendments to #126.

geoo89 commented Apr 16, 2024

View reviewed changes

geoo89 added 6 commits May 16, 2024 17:02

Bump dependency versions

52aec52

Refactor pipeline into pull_data and compile_flows operations.

80c9a67

Config format is refactored so that entries reflect which pipeline steps they belong to. Pipeline steps are still hardcoded.

Config refactor: customizable step sequence, sources

a796d06

Converter for old config format (untested)

eea4406

Dynamic versioning

54fad8d

Write diffable at the end of pipeline

09c6862

geoo89 force-pushed the refactor-config branch from 8ff0c18 to 09c6862 Compare May 16, 2024 15:08

geoo89 added 2 commits May 16, 2024 17:43

Apply style guidelines

0294ee1

Ignore formatting commit for git blame

f12d241

geoo89 force-pushed the refactor-config branch from 5abad07 to f12d241 Compare June 21, 2024 10:48

geoo89 merged commit f12d241 into main Jun 26, 2024
1 check passed

geoo89 deleted the refactor-config branch July 10, 2024 06:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor config #118

Refactor config #118

geoo89 commented Apr 16, 2024 •

edited

Loading

geoo89 Apr 16, 2024

istride Apr 17, 2024

geoo89 Apr 22, 2024

geoo89 Apr 16, 2024

geoo89 Apr 16, 2024

istride Apr 17, 2024

geoo89 Apr 16, 2024

istride Apr 17, 2024

geoo89 Apr 16, 2024

istride Apr 17, 2024

geoo89 Apr 22, 2024

geoo89 Apr 16, 2024

istride Apr 18, 2024

geoo89 Apr 16, 2024

istride Apr 18, 2024

geoo89 Apr 16, 2024

istride Apr 18, 2024

geoo89 Apr 22, 2024

geoo89 commented May 16, 2024

geoo89 commented Jun 21, 2024

Refactor config #118

Refactor config #118

Conversation

geoo89 commented Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geoo89 commented May 16, 2024

geoo89 commented Jun 21, 2024

geoo89 commented Apr 16, 2024 •

edited

Loading