Allow training actions to be performed in PRs #4

bhearsum · 2023-05-04T17:31:46Z

No description provided.

It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment.

…rmat yaml in ci/config.yml

…dels. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine.

I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task.

This is largely built on the earlier work done on the `dataset` kind.

These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress)

We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have.

This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements.

These are an additional dependency for the `bicleaner` stage of the pipeline.

Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters

Most of this diff is just indentation changes.

…hanges

…ttributes.

bhearsum added 24 commits April 27, 2023 15:38

Adjust ci config for staging repo

6e9ea18

Use bhearsum's taskgraph repo for now

494f158

It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment.

Get rid of hello kind now that we know that Taskcluster works

18f7e9f

Add worker type for b-linux-large, for more CPU intensive tasks; refo…

9f38ed0

…rmat yaml in ci/config.yml

Add yamllint config for taskcluster files

7e8a176

Bump decision task image

b515a25

Add configuration for black and ruff for python formatting

571a221

Add clean stage of the training pipeline

835a98d

This is largely built on the earlier work done on the `dataset` kind.

Add treeherder symbol for decision task

da180ef

We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have.

Add bicleaner pack fetches

119aa6b

These are an additional dependency for the `bicleaner` stage of the pipeline.

Raise taskgraph level for pushes, cron, and actions to level 3

ba43205

Most of this diff is just indentation changes.

Re-adjust ci-config.yml for production repository

49b2733

Don't set treeherder routes for pull requests

c87d3dc

Use standard cache prefixes

06a0ae5

Add CODEOWNERS file to suggest RelEng as a reviewer for taskcluster c…

3a7950a

…hanges

Bump taskgraph version; re-enable pip hash checking

d38e247

Switch cache attributes to be nested, instead of multiple top level a…

7d97d5a

…ttributes.

Override compression scheme in pipeline steps.

3553279

Allow training actions to be performed in PRs

e768cbd

bhearsum force-pushed the main branch 6 times, most recently from c1ceb2d to 86cdd5e Compare May 12, 2023 20:50

bhearsum force-pushed the main branch 6 times, most recently from 0ba83fd to 58cc0f4 Compare May 17, 2023 16:20

gabrielBusta force-pushed the main branch from 58cc0f4 to b4468a6 Compare May 17, 2023 20:29

bhearsum force-pushed the main branch 17 times, most recently from bb2da96 to 9e8641b Compare November 29, 2024 01:53

bhearsum force-pushed the main branch 2 times, most recently from 1201a66 to f44726c Compare December 20, 2024 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow training actions to be performed in PRs #4

Allow training actions to be performed in PRs #4

bhearsum commented May 4, 2023

Allow training actions to be performed in PRs #4

Are you sure you want to change the base?

Allow training actions to be performed in PRs #4

Conversation

bhearsum commented May 4, 2023