Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for parameterized sequential pipelines #10627

Open
henrypickler opened this issue Nov 21, 2024 · 1 comment
Open

Support for parameterized sequential pipelines #10627

henrypickler opened this issue Nov 21, 2024 · 1 comment

Comments

@henrypickler
Copy link

I want to parameterize how many times to repeat a stage which depends on previous stages. For example, consider the list [0, 0.2, 0.5, 0.75] and a held-out dataset. I want to have a pipeline that does the following:

  • Start: train an initial model_0
  • re-train@0: process 20% of the held-out dataset using the model model_0 and re-train a new model, model_20, including the newly processed samples.
  • re-train@1: process the next 30% of the held out data with model_20 and re-train model_50
  • re-train@2: process the next 25% with model_50 and re-train model_75

Ideally I want to be able to modify the list to a different size, for example [0, 0.2, 0.4, 0.5, 0.75, 0.85, 0.95] where it would define re-train@0 until re-train@5. More than that, it then could re-use the cached model_0 and model_20 (model_50 and model_75 are different now because they depend on model_40).

I tried doing this using a foreach to define my stage. However, since I need to reference the previous stage dependency it is not possible, for example if this was possible:

re-train:
    foreach: [0,0.2,0.5,0.75]
    do:
        cmd: python train.py --reference-model=model_${prev_item} --output-model=model_${item}
        deps: [model_${prev_item}]
        outs: [model_${item}]

Then it would be fairly easy to chain the stages. However, AFAIK this is not possible, so my workaround is using an object defined in var such as:

re-trains:
  - {curr: 0.2, prev: 0}
  - {curr: 0.5, prev: 0.2}
  - {curr: 0.75, prev: 0.5}

And then referencing $item.curr and $item.prev. However this is error prone (setting prev wrongly gives weird results without prior warning) and a bit of a hassle to deal with.

I use DBT very frequently and so I think Jinja2 templating could be a good tool to have to deal with these cases. For example, my situation would be solved by doing something like this:

{% set stages = [0.2, 0.5, 0.75] %}

train:
  cmd: python train.py --output-model=model_0
  outs: [model_0]

{% for stage in stages %}
re-train@{{ loop.index0 }}:
    {% set input_model = 'model_0' if loop.first else 'model_' ~ stages[loop.index0 - 1] | replace(".", "_") %}
    {% set output_model = 'model_' ~ stage | replace(".", "_") %}
    cmd: python train.py --reference-model={{  input_model }} --output-model = {{ output_model }}
    deps:
      - {{ output_model }}
    outs:
      - {{ input_model }}
{% endfor %}

Putting it in a template renderer gives:

Rendered output

  train:
    cmd: python train.py --output-model=model_0
    outs: [model_0]


  re-train@0:
      cmd: python train.py --reference-model=model_0 --output-model = model_0_2
      deps:
        - model_0_2
      outs:
        - model_0

  re-train@1:
      cmd: python train.py --reference-model=model_0_2 --output-model = model_0_5
      deps:
        - model_0_5
      outs:
        - model_0_2

  re-train@2:
      cmd: python train.py --reference-model=model_0_5 --output-model = model_0_75
      deps:
        - model_0_75
      outs:
        - model_0_5

I searched for jinja2 on the repo and it seems that it has been considered previously (and deemed too weird/ugly which, honestly, I agree, specially for beginners). However, drawing inspiration from it, another approach would be to allow arithmetic to be done on dvc string interpolation and also provide more values for loops, for example providing idx, which enables something like

vars:
    - retrains: [0.2,0.5,0.75]

train:
    cmd: python train.py --output-model=model_0
    outs: model_0

re-train:
    foreach: ${retrains}
    do:
        cmd: python train.py --reference-model=model_${idx} --output-model=model_${idx+1}
        deps: [model_${idx}]
        outs: [model_${idx+1}]

Which is much cleaner

@PythonFZ
Copy link
Contributor

Might be not exactly what you want, but using ZnTrack you can dynamically construct a dvc.yaml and thus can achieve what you are looking for.

Disclaimer: I am the author of ZnTrack.

import zntrack

class YourModelTrainingCls(zntrack.Node): ...

project = zntrack.Project()

with project:
   model = YourModelTrainingCls(**kwargs)

project.repro()

if model.?<metric>:
    with project:
       ...

  project.repro()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants