Make workers configurable #254

eu9ene · 2023-11-08T19:25:50Z

Currently, it requires changing each step if we say, want to train the whole pipeline on a different worker pool. For example, we might want to experiment with different GPU types or run the pipeline on the on-prem cluster and set the specific GPU we want to use.

Can we configure workers from the training config? I think it should be possible with transforms.

gabrielBusta · 2023-11-08T23:01:16Z

It sounds possible. The transform can load the config and tweak the worker field in the task

bhearsum · 2023-11-17T14:42:09Z

Note that transforms will work if each worker pool has a consistent value. If we have multiple workers in the same pool that require different values, we won't know which ones are correct until runtime.

IMO, in an ideal world, the workers would set GPUS/WORKSPACE themselves, and the tasks would inherit these values (since they are a property of the worker). I'm not sure at this point if that's possible. If not, runtime detection in a script is necessary for this case.

eu9ene · 2023-11-17T19:25:19Z

Yes, ideally those values should depend on a worker. Related to #253

bhearsum · 2023-11-20T19:31:23Z

We talked about this a bit on Zoom today. We agreed that in the short, and maybe medium, term that we would keep worker pools consistent as far as their type and number of GPUs go. With this in mind we should be able to define the GPUS and WORKSPACE values in https://github.com/mozilla/firefox-translations-training/blob/main/taskcluster/ci/config.yml (this will have to be in a new top level key, as https://github.com/taskcluster/taskgraph/blob/0401b911ec0a5d3a8b66bdacc183362aa7811871/src/taskgraph/config.py#L43 does not allow extra configuration). From there, a transform can pull the values and insert them into the env of the necessary tasks.

When we have other worker pools we want to train on, we'll just need to adjust those entries and the worker provisioner/worker-type in config.yml to switch to, eg: the on prem workers. So for example, right now much of the training is done on:

        b-linux-v100-gpu-4-1tb:
            provisioner: '{trust-domain}-{level}'
            implementation: generic-worker
            os: linux
            worker-type: '{alias}'

When the on prem machines are available, this entry would change to something like:

        b-linux-v100-gpu-4-1tb:
            provisioner: '{trust-domain}-onprem'
            implementation: generic-worker
            os: linux
            worker-type: snakepit

(The b-linux-v100-gpu-4-1tb key is what the kinds use to look up workers, while the provisioner and worker-type are what map it back to the machines doing the work.)

eu9ene added the taskcluster Issues related to the Taskcluster implementation of the training pipeline label Nov 8, 2023

eu9ene added the on-prem Running the pipeline on-premises machines label Nov 9, 2023

eu9ene added the p1 label Nov 20, 2023

bhearsum self-assigned this Nov 27, 2023

bhearsum closed this as completed in 9f081b6 Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make workers configurable #254

Make workers configurable #254

eu9ene commented Nov 8, 2023

gabrielBusta commented Nov 8, 2023

bhearsum commented Nov 17, 2023

eu9ene commented Nov 17, 2023

bhearsum commented Nov 20, 2023

Make workers configurable #254

Make workers configurable #254

Comments

eu9ene commented Nov 8, 2023

gabrielBusta commented Nov 8, 2023

bhearsum commented Nov 17, 2023

eu9ene commented Nov 17, 2023

bhearsum commented Nov 20, 2023