forked from mozilla/translations
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow training actions to be performed in PRs #4
Open
bhearsum
wants to merge
24
commits into
main
Choose a base branch
from
train-in-pr
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
It contains a couple of changes that have yet to be upstreamed to taskgraph. This also requires that we disable pip hash verification for the moment.
…rmat yaml in ci/config.yml
…dels. Most of these are straight forward download and compiles, but there's a few callouts: - The CLI tools (marian, fast-align, etc.) already have build scripts used by the existing pipeline. For the most part, I'm just replacing them with my own version because they're just unpack/make/cmake. The exception is Marian, which has a little bit more going on with cmake definitions. Maybe I should just copy those in here though? - Some Python modules that don't have binary wheels available, which we ought to build to avoid needing to compile them at the start of training tasks. - CUDA (a NVIDIA Toolkit) is a huge pain. They don't have any real advertised way to just dump the files you want into a directory (they want you to run an installer). I _think_ I managed to get this work, but it's possible this will need a tweak in the future if a future task has trouble with the current toolchain. This also necessitated switching Docker images to Ubuntu, because some tools were not reasonably possible to make work on Alpine.
I initially tried to implement these as `fetch` tasks. This failed because the way that we get so many of these is just not compatible with the idea of a static url or having one file per task. Eg: many of these datasets are fetched by running a python script. (In theory this could be reverse engineered, but I just don't think it's worth it...especially if URLs or metadata ends up changing in the future.) Instead, we're making use of the existing pipeline scripts that know how to fetch these. As you can see, the kind generates tasks named after the provider, dataset, and locale pair. I'm not certain this is what we want to do long term (there's going to be an absurd number of tasks after we finish adding all of the datasets and language pairs)....but I think it's OK for now. We probably ought to revisit this before we start running full training pipelines - if we change it after that we'll end up rebuilding tasks due to having no cached tasks for the new names. This revision also builds out a couple of transforms that are used here, and will be used elsewhere: * One that can substitute provider name, dataset (in a few forms), and locale pairs into tasks. This is necessary to avoid needing to repeat things such as commands, treeherder symbols, etc. * Another one that configures caching, using attributes defined in the kind. Eventually we're going to be using all sorts of action task parameters as part of the cache digest -- so it's important that we can specify these things per-task.
This is largely built on the earlier work done on the `dataset` kind.
These is a few things: 1) Mark all the shell scripts as +x 2) Switch out pigz/gz for zstdmt/zst (in progress) 3) Add support for `auto` where `threads` is an argument, which uses `nproc` to decide how many threads to use (in progress). 4) Use `curl` instead of `wget` (in progress)
We need this for action tasks to be triggerable through Treeherder, and it's also generally nice to have.
This is very rough for now, but it enables us to kick off certain parts of the pipeline. I intend to look into the possibility of using the existing config format (eg: https://github.com/mozilla/firefox-translations-training/blob/main/configs/config.test.yml) as the schema here later, and there's various input checking that needs to be implemented, and other enhancements.
These are an additional dependency for the `bicleaner` stage of the pipeline.
Very similar to the `clean` and `dataset` stages that have already been implemented. The notable differences are: - The `bicleaner` tool that eventually gets called has a bunch of Python dependencies. Most of these are handled by the requirements file I'm adding, but there's two extra ones that don't have binary wheels available -- so we're grabbing them from our toolchain builds and using those. (In fact, `kenlm` isn't even declared as a dependency by `bicleaner`...so we'd have to install it by hand one way or another...) - At the moment, this is using a new `split_by_provider` transform that avoids us needing to list out each provider in the kind. This probably needs to go away, because I recently learned that many pipeline steps (such as this one) don't run for all providers. - An enhancement to the cache transform to allow specifying `parameters` that should contribute to the cache digest - A similar enchancement to the substitution transform allow substituting parameters
Most of this diff is just indentation changes.
bhearsum
force-pushed
the
main
branch
6 times, most recently
from
May 12, 2023 20:50
c1ceb2d
to
86cdd5e
Compare
bhearsum
force-pushed
the
main
branch
6 times, most recently
from
May 17, 2023 16:20
0ba83fd
to
58cc0f4
Compare
bhearsum
force-pushed
the
main
branch
17 times, most recently
from
November 29, 2024 01:53
bb2da96
to
9e8641b
Compare
bhearsum
force-pushed
the
main
branch
2 times, most recently
from
December 20, 2024 15:38
1201a66
to
f44726c
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.