Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make workers configurable #254

Closed
eu9ene opened this issue Nov 8, 2023 · 4 comments
Closed

Make workers configurable #254

eu9ene opened this issue Nov 8, 2023 · 4 comments
Assignees
Labels
on-prem Running the pipeline on-premises machines taskcluster Issues related to the Taskcluster implementation of the training pipeline

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Nov 8, 2023

Currently, it requires changing each step if we say, want to train the whole pipeline on a different worker pool. For example, we might want to experiment with different GPU types or run the pipeline on the on-prem cluster and set the specific GPU we want to use.

Can we configure workers from the training config? I think it should be possible with transforms.

@eu9ene eu9ene added the taskcluster Issues related to the Taskcluster implementation of the training pipeline label Nov 8, 2023
@gabrielBusta
Copy link
Member

It sounds possible. The transform can load the config and tweak the worker field in the task

@eu9ene eu9ene added the on-prem Running the pipeline on-premises machines label Nov 9, 2023
@bhearsum
Copy link
Collaborator

Note that transforms will work if each worker pool has a consistent value. If we have multiple workers in the same pool that require different values, we won't know which ones are correct until runtime.

IMO, in an ideal world, the workers would set GPUS/WORKSPACE themselves, and the tasks would inherit these values (since they are a property of the worker). I'm not sure at this point if that's possible. If not, runtime detection in a script is necessary for this case.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Nov 17, 2023

Yes, ideally those values should depend on a worker. Related to #253

@eu9ene eu9ene added the p1 label Nov 20, 2023
@bhearsum
Copy link
Collaborator

We talked about this a bit on Zoom today. We agreed that in the short, and maybe medium, term that we would keep worker pools consistent as far as their type and number of GPUs go. With this in mind we should be able to define the GPUS and WORKSPACE values in https://github.com/mozilla/firefox-translations-training/blob/main/taskcluster/ci/config.yml (this will have to be in a new top level key, as https://github.com/taskcluster/taskgraph/blob/0401b911ec0a5d3a8b66bdacc183362aa7811871/src/taskgraph/config.py#L43 does not allow extra configuration). From there, a transform can pull the values and insert them into the env of the necessary tasks.

When we have other worker pools we want to train on, we'll just need to adjust those entries and the worker provisioner/worker-type in config.yml to switch to, eg: the on prem workers. So for example, right now much of the training is done on:

        b-linux-v100-gpu-4-1tb:
            provisioner: '{trust-domain}-{level}'
            implementation: generic-worker
            os: linux
            worker-type: '{alias}'

When the on prem machines are available, this entry would change to something like:

        b-linux-v100-gpu-4-1tb:
            provisioner: '{trust-domain}-onprem'
            implementation: generic-worker
            os: linux
            worker-type: snakepit

(The b-linux-v100-gpu-4-1tb key is what the kinds use to look up workers, while the provisioner and worker-type are what map it back to the machines doing the work.)

@bhearsum bhearsum self-assigned this Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on-prem Running the pipeline on-premises machines taskcluster Issues related to the Taskcluster implementation of the training pipeline
Projects
None yet
Development

No branches or pull requests

3 participants