Support for parameterized sequential pipelines #10627

henrypickler · 2024-11-21T13:00:04Z

I want to parameterize how many times to repeat a stage which depends on previous stages. For example, consider the list [0, 0.2, 0.5, 0.75] and a held-out dataset. I want to have a pipeline that does the following:

Start: train an initial model_0
re-train@0: process 20% of the held-out dataset using the model model_0 and re-train a new model, model_20, including the newly processed samples.
re-train@1: process the next 30% of the held out data with model_20 and re-train model_50
re-train@2: process the next 25% with model_50 and re-train model_75

Ideally I want to be able to modify the list to a different size, for example [0, 0.2, 0.4, 0.5, 0.75, 0.85, 0.95] where it would define re-train@0 until re-train@5. More than that, it then could re-use the cached model_0 and model_20 (model_50 and model_75 are different now because they depend on model_40).

I tried doing this using a foreach to define my stage. However, since I need to reference the previous stage dependency it is not possible, for example if this was possible:

re-train:
    foreach: [0,0.2,0.5,0.75]
    do:
        cmd: python train.py --reference-model=model_${prev_item} --output-model=model_${item}
        deps: [model_${prev_item}]
        outs: [model_${item}]

Then it would be fairly easy to chain the stages. However, AFAIK this is not possible, so my workaround is using an object defined in var such as:

re-trains:
  - {curr: 0.2, prev: 0}
  - {curr: 0.5, prev: 0.2}
  - {curr: 0.75, prev: 0.5}

And then referencing $item.curr and $item.prev. However this is error prone (setting prev wrongly gives weird results without prior warning) and a bit of a hassle to deal with.

I use DBT very frequently and so I think Jinja2 templating could be a good tool to have to deal with these cases. For example, my situation would be solved by doing something like this:

{% set stages = [0.2, 0.5, 0.75] %}

train:
  cmd: python train.py --output-model=model_0
  outs: [model_0]

{% for stage in stages %}
re-train@{{ loop.index0 }}:
    {% set input_model = 'model_0' if loop.first else 'model_' ~ stages[loop.index0 - 1] | replace(".", "_") %}
    {% set output_model = 'model_' ~ stage | replace(".", "_") %}
    cmd: python train.py --reference-model={{  input_model }} --output-model = {{ output_model }}
    deps:
      - {{ output_model }}
    outs:
      - {{ input_model }}
{% endfor %}

Putting it in a template renderer gives:

Rendered output


  train:
    cmd: python train.py --output-model=model_0
    outs: [model_0]


  re-train@0:
      cmd: python train.py --reference-model=model_0 --output-model = model_0_2
      deps:
        - model_0_2
      outs:
        - model_0

  re-train@1:
      cmd: python train.py --reference-model=model_0_2 --output-model = model_0_5
      deps:
        - model_0_5
      outs:
        - model_0_2

  re-train@2:
      cmd: python train.py --reference-model=model_0_5 --output-model = model_0_75
      deps:
        - model_0_75
      outs:
        - model_0_5

I searched for jinja2 on the repo and it seems that it has been considered previously (and deemed too weird/ugly which, honestly, I agree, specially for beginners). However, drawing inspiration from it, another approach would be to allow arithmetic to be done on dvc string interpolation and also provide more values for loops, for example providing idx, which enables something like

vars:
    - retrains: [0.2,0.5,0.75]

train:
    cmd: python train.py --output-model=model_0
    outs: model_0

re-train:
    foreach: ${retrains}
    do:
        cmd: python train.py --reference-model=model_${idx} --output-model=model_${idx+1}
        deps: [model_${idx}]
        outs: [model_${idx+1}]

Which is much cleaner

The text was updated successfully, but these errors were encountered:

PythonFZ · 2024-11-27T12:08:11Z

Might be not exactly what you want, but using ZnTrack you can dynamically construct a dvc.yaml and thus can achieve what you are looking for.

Disclaimer: I am the author of ZnTrack.

import zntrack

class YourModelTrainingCls(zntrack.Node): ...

project = zntrack.Project()

with project:
   model = YourModelTrainingCls(**kwargs)

project.repro()

if model.?<metric>:
    with project:
       ...

  project.repro()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for parameterized sequential pipelines #10627

Support for parameterized sequential pipelines #10627

henrypickler commented Nov 21, 2024

PythonFZ commented Nov 27, 2024

Support for parameterized sequential pipelines #10627

Support for parameterized sequential pipelines #10627

Comments

henrypickler commented Nov 21, 2024

PythonFZ commented Nov 27, 2024