Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to run starting from a specific task (fixes #227) #377

Merged
merged 2 commits into from
Feb 14, 2024

Update poetry dependencies to pull in newer taskgraph version

37fbf27
Select commit
Loading
Failed to load commit list.
Merged

Add the ability to run starting from a specific task (fixes #227) #377

Update poetry dependencies to pull in newer taskgraph version
37fbf27
Select commit
Loading
Failed to load commit list.
firefoxci-taskcluster / dataset-news-crawl-news_2008-ru succeeded Feb 13, 2024 in 11m 48s

FirefoxCI (pull_request)

Fetch news-crawl dataset for {src_locale}

Details

View task in Taskcluster
View logs in Taskcluster


[taskcluster 2024-02-13 19:56:25.259Z] Task ID: bqNTthVBTSS0un1EjVRclQ
[taskcluster 2024-02-13 19:56:25.260Z] Worker ID: 3179979791804037441
[taskcluster 2024-02-13 19:56:25.260Z] Worker Group: us-central1
[taskcluster 2024-02-13 19:56:25.260Z] Worker Node Type: projects/887720501152/machineTypes/n2-highmem-32
[taskcluster 2024-02-13 19:56:25.260Z] Worker Pool: translations-1/b-linux-large-gcp
[taskcluster 2024-02-13 19:56:25.260Z] Worker Version: 38.0.5
[taskcluster 2024-02-13 19:56:25.260Z] Public IP: 34.122.95.34
[taskcluster 2024-02-13 19:56:25.260Z] Hostname: translations-1-b-linux-large-gcp-hqyisvsorvolrcxzzwk1lw
[taskcluster 2024-02-13 19:56:25.260Z] using cache "translations-level-1-checkouts-v3-7afeb851dd97df8f3607-KTThW1rRQEWlZgGQ0sWCPQ" -> /builds/worker/checkouts

[taskcluster 2024-02-13 19:56:25.811Z] Downloading artifact "public/image.tar.zst" from task ID: KTThW1rRQEWlZgGQ0sWCPQ.
[taskcluster 2024-02-13 19:56:30.813Z] Download Progress: 73.95%
[taskcluster 2024-02-13 19:56:32.382Z] Downloaded artifact successfully.
[taskcluster 2024-02-13 19:56:32.382Z] Downloaded 776.725 mb
[taskcluster 2024-02-13 19:56:32.383Z] Decompressing downloaded image
[taskcluster 2024-02-13 19:56:38.933Z] Loading docker image from downloaded archive.
[taskcluster 2024-02-13 19:57:12.741Z] Image 'public/image.tar.zst' from task 'KTThW1rRQEWlZgGQ0sWCPQ' loaded.  Using image ID sha256:53facad048ff33f5c58e9d52d6e58e6cd4fcdd5a8e5788c85f46e559dd9deed5.
[taskcluster 2024-02-13 19:57:13.229Z] === Task Starting ===
[setup 2024-02-13T19:57:18.515Z] run-task started in /builds/worker
[setup 2024-02-13T19:57:18.515Z] Invoked by command: --firefox_translations_training-checkout=/builds/worker/checkouts/vcs/ -- bash -c $VCS_PATH/pipeline/data/download-mono.sh news-crawl_news.2008 ru 10000 $TASK_WORKDIR/artifacts/news_2008.ru.zst
[setup 2024-02-13T19:57:18.515Z] Python version: 3.10.12
[cache 2024-02-13T19:57:18.517Z] cache /builds/worker/checkouts is empty; writing requirements: gid=1000 uid=1000 version=1
[volume 2024-02-13T19:57:18.517Z] changing ownership of volume /builds/worker/.cache to 1000:1000
[volume 2024-02-13T19:57:18.517Z] volume /builds/worker/checkouts is a cache
[setup 2024-02-13T19:57:18.517Z] running as worker:worker
[vcs 2024-02-13T19:57:18.518Z] executing ['git', 'config', '--global', '--add', 'safe.directory', '/builds/worker/checkouts/vcs']
[vcs 2024-02-13T19:57:18.520Z] executing ['git', 'clone', 'https://github.com/mozilla/firefox-translations-training', '/builds/worker/checkouts/vcs']
[vcs 2024-02-13T19:57:18.522Z] Cloning into '/builds/worker/checkouts/vcs'...
[vcs 2024-02-13T19:57:19.275Z] executing ['git', 'fetch', '--no-tags', 'https://github.com/bhearsum/firefox-translations-training', 'start-specific']
[vcs 2024-02-13T19:57:19.498Z] From https://github.com/bhearsum/firefox-translations-training
[vcs 2024-02-13T19:57:19.498Z]  * branch            start-specific -> FETCH_HEAD
[vcs 2024-02-13T19:57:19.504Z] executing ['git', 'checkout', '-f', '-B', 'start-specific', '37fbf272d7eb316897377144111a3ef057becfd4']
[vcs 2024-02-13T19:57:19.568Z] Switched to a new branch 'start-specific'
[vcs 2024-02-13T19:57:19.569Z] executing ['git', 'submodule', 'init']
[vcs 2024-02-13T19:57:19.592Z] Submodule '3rd_party/browsermt-marian-dev' (https://github.com/browsermt/marian-dev) registered for path '3rd_party/browsermt-marian-dev'
[vcs 2024-02-13T19:57:19.592Z] Submodule 'extract-lex' (https://github.com/marian-nmt/extract-lex) registered for path '3rd_party/extract-lex'
[vcs 2024-02-13T19:57:19.593Z] Submodule 'fast_align' (https://github.com/clab/fast_align) registered for path '3rd_party/fast_align'
[vcs 2024-02-13T19:57:19.593Z] Submodule '3rd_party/kenlm' (https://github.com/kpu/kenlm) registered for path '3rd_party/kenlm'
[vcs 2024-02-13T19:57:19.593Z] Submodule '3rd_party/marian-dev' (https://github.com/marian-nmt/marian-dev) registered for path '3rd_party/marian-dev'
[vcs 2024-02-13T19:57:19.593Z] Submodule '3rd_party/preprocess' (https://github.com/kpu/preprocess.git) registered for path '3rd_party/preprocess'
[vcs 2024-02-13T19:57:19.594Z] executing ['git', 'submodule', 'update', '--force']
[vcs 2024-02-13T19:57:19.619Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/browsermt-marian-dev'...
[vcs 2024-02-13T19:57:20.840Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/extract-lex'...
[vcs 2024-02-13T19:57:21.131Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/fast_align'...
[vcs 2024-02-13T19:57:21.460Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/kenlm'...
[vcs 2024-02-13T19:57:22.108Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/marian-dev'...
[vcs 2024-02-13T19:57:23.491Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/preprocess'...
[vcs 2024-02-13T19:57:24.017Z] Submodule path '3rd_party/browsermt-marian-dev': checked out '11c6ae7c46be21ef96ed10c60f28022fa968939f'
[vcs 2024-02-13T19:57:24.029Z] Submodule path '3rd_party/extract-lex': checked out '42fa605b53f32eaf6c6e0b5677255c21c91b3d49'
[vcs 2024-02-13T19:57:24.041Z] Submodule path '3rd_party/fast_align': checked out 'cab1e9aac8d3bb02ff5ae58218d8d225a039fa11'
[vcs 2024-02-13T19:57:24.069Z] Submodule path '3rd_party/kenlm': checked out 'bbf4fc511266c5d4515047055d7bdec659a6e158'
[vcs 2024-02-13T19:57:24.187Z] Submodule path '3rd_party/marian-dev': checked out 'e8a1a2530fb84cbff7383302ebca393e5875c441'
[vcs 2024-02-13T19:57:24.208Z] Submodule path '3rd_party/preprocess': checked out '64307314b4d5a9a0bd529b5c1036b0710d995eec'
[vcs 2024-02-13T19:57:24.208Z] cleaning git checkout...
[vcs 2024-02-13T19:57:24.208Z] executing ['git', 'clean', '-nxdff']
[vcs 2024-02-13T19:57:24.212Z] removing []
[vcs 2024-02-13T19:57:24.212Z] successfully cleaned git checkout!
[vcs 2024-02-13T19:57:24.214Z] TinderboxPrint:<a href='https://github.com/bhearsum/firefox-translations-training/commit/37fbf272d7eb316897377144111a3ef057becfd4' title='Built from firefox-translations-training commit 37fbf272d7eb316897377144111a3ef057becfd4'>37fbf272d7eb316897377144111a3ef057becfd4</a>
[task 2024-02-13T19:57:24.214Z] executing ['bash', '-c', '$VCS_PATH/pipeline/data/download-mono.sh news-crawl_news.2008 ru 10000 $TASK_WORKDIR/artifacts/news_2008.ru.zst']
[task 2024-02-13T19:57:24.216Z] + set -euo pipefail
[task 2024-02-13T19:57:24.216Z] + dataset=news-crawl_news.2008
[task 2024-02-13T19:57:24.216Z] + lang=ru
[task 2024-02-13T19:57:24.216Z] + max_sent=10000
[task 2024-02-13T19:57:24.216Z] + output_path=/builds/worker/artifacts/news_2008.ru.zst
[task 2024-02-13T19:57:24.216Z] + coef=0.1
[task 2024-02-13T19:57:24.216Z] + COMPRESSION_CMD=zstdmt
[task 2024-02-13T19:57:24.216Z] + ARTIFACT_EXT=zst
[task 2024-02-13T19:57:24.216Z] + echo '###### Downloading monolingual data for language ru dataset news-crawl_news.2008'
[task 2024-02-13T19:57:24.216Z] ###### Downloading monolingual data for language ru dataset news-crawl_news.2008
[task 2024-02-13T19:57:24.217Z] ++ dirname /builds/worker/checkouts/vcs/pipeline/data/download-mono.sh
[task 2024-02-13T19:57:24.217Z] + cd /builds/worker/checkouts/vcs/pipeline/data
[task 2024-02-13T19:57:24.218Z] ++ dirname /builds/worker/artifacts/news_2008.ru.zst
[task 2024-02-13T19:57:24.218Z] + tmp=/builds/worker/artifacts/original
[task 2024-02-13T19:57:24.218Z] + mkdir -p /builds/worker/artifacts/original
[task 2024-02-13T19:57:24.220Z] + echo '### Downloading dataset'
[task 2024-02-13T19:57:24.220Z] ### Downloading dataset
[task 2024-02-13T19:57:24.220Z] + original_prefix=/builds/worker/artifacts/original/news-crawl_news.2008.original.ru
[task 2024-02-13T19:57:24.220Z] + name=news.2008
[task 2024-02-13T19:57:24.220Z] + type=news-crawl
[task 2024-02-13T19:57:24.220Z] + test -s /builds/worker/artifacts/original/news-crawl_news.2008.original.ru.zst
[task 2024-02-13T19:57:24.220Z] + bash importers/mono/news-crawl.sh ru /builds/worker/artifacts/original/news-crawl_news.2008.original.ru news.2008
[task 2024-02-13T19:57:24.221Z] + set -euo pipefail
[task 2024-02-13T19:57:24.221Z] + lang=ru
[task 2024-02-13T19:57:24.221Z] + output_prefix=/builds/worker/artifacts/original/news-crawl_news.2008.original.ru
[task 2024-02-13T19:57:24.221Z] + dataset=news.2008
[task 2024-02-13T19:57:24.221Z] + COMPRESSION_CMD=zstdmt
[task 2024-02-13T19:57:24.221Z] + ARTIFACT_EXT=zst
[task 2024-02-13T19:57:24.221Z] + WGET=wget
[task 2024-02-13T19:57:24.221Z] + echo '###### Downloading WMT newscrawl monolingual data'
[task 2024-02-13T19:57:24.221Z] ###### Downloading WMT newscrawl monolingual data
[task 2024-02-13T19:57:24.221Z] + wget -O - http://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz
[task 2024-02-13T19:57:24.222Z] + gunzip
[task 2024-02-13T19:57:24.222Z] + zstdmt -c
[task 2024-02-13T19:57:24.224Z] --2024-02-13 19:57:24--  http://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz
[task 2024-02-13T19:57:24.284Z] Resolving data.statmt.org (data.statmt.org)... 129.215.32.28
[task 2024-02-13T19:57:24.392Z] Connecting to data.statmt.org (data.statmt.org)|129.215.32.28|:80... connected.
[task 2024-02-13T19:57:24.499Z] HTTP request sent, awaiting response... 301 Moved Permanently
[task 2024-02-13T19:57:24.499Z] Location: https://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz [following]
[task 2024-02-13T19:57:24.499Z] --2024-02-13 19:57:24--  https://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz
[task 2024-02-13T19:57:24.608Z] Connecting to data.statmt.org (data.statmt.org)|129.215.32.28|:443... connected.
[task 2024-02-13T19:57:24.946Z] HTTP request sent, awaiting response... 200 OK
[task 2024-02-13T19:57:24.946Z] Length: 2312968 (2.2M) [application/x-gzip]
[task 2024-02-13T19:57:24.946Z] Saving to: ‘STDOUT’
[task 2024-02-13T19:57:24.946Z] 
[task 2024-02-13T19:57:25.162Z]      0K .......... .......... .......... .......... ..........  2%  232K 10s
[task 2024-02-13T19:57:25.270Z]     50K .......... .......... .......... .......... ..........  4%  462K 7s
[task 2024-02-13T19:57:25.270Z]    100K .......... .......... .......... .......... ..........  6%  101M 5s
[task 2024-02-13T19:57:25.378Z]    150K .......... .......... .......... .......... ..........  8%  465K 4s
[task 2024-02-13T19:57:25.378Z]    200K .......... .......... .......... .......... .......... 11% 88.5M 3s
[task 2024-02-13T19:57:25.379Z]    250K .......... .......... .......... .......... .......... 13%  113M 3s
[task 2024-02-13T19:57:25.385Z]    300K .......... .......... .......... .......... .......... 15% 7.92M 2s
[task 2024-02-13T19:57:25.484Z]    350K .......... .......... .......... .......... .......... 17%  503K 3s
[task 2024-02-13T19:57:25.485Z]    400K .......... .......... .......... .......... .......... 19% 69.4M 2s
[task 2024-02-13T19:57:25.487Z]    450K .......... .......... .......... .......... .......... 22% 29.0M 2s
[task 2024-02-13T19:57:25.487Z]    500K .......... .......... .......... .......... .......... 24%  145M 2s
[task 2024-02-13T19:57:25.493Z]    550K .......... .......... .......... .......... .......... 26% 8.06M 2s
[task 2024-02-13T19:57:25.493Z]    600K .......... .......... .......... .......... .......... 28%  192M 1s
[task 2024-02-13T19:57:25.494Z]    650K .......... .......... .......... .......... .......... 30%  224M 1s
[task 2024-02-13T19:57:25.494Z]    700K .......... .......... .......... .......... .......... 33%  198M 1s
[task 2024-02-13T19:57:25.591Z]    750K .......... .......... .......... .......... .......... 35%  515K 1s
[task 2024-02-13T19:57:25.591Z]    800K .......... .......... .......... .......... .......... 37%  152M 1s
[task 2024-02-13T19:57:25.592Z]    850K .......... .......... .......... .......... .......... 39%  142M 1s
[task 2024-02-13T19:57:25.592Z]    900K .......... .......... .......... .......... .......... 42%  113M 1s
[task 2024-02-13T19:57:25.593Z]    950K .......... .......... .......... .......... .......... 44% 41.9M 1s
[task 2024-02-13T19:57:25.594Z]   1000K .......... .......... .......... .......... .......... 46%  114M 1s
[task 2024-02-13T19:57:25.599Z]   1050K .......... .......... .......... .......... .......... 48% 8.84M 1s
[task 2024-02-13T19:57:25.599Z]   1100K .......... .......... .......... .......... .......... 50%  216M 1s
[task 2024-02-13T19:57:25.600Z]   1150K .......... .......... .......... .......... .......... 53%  232M 1s
[task 2024-02-13T19:57:25.600Z]   1200K .......... .......... .......... .......... .......... 55%  201M 1s
[task 2024-02-13T19:57:25.600Z]   1250K .......... .......... .......... .......... .......... 57%  247M 0s
[task 2024-02-13T19:57:25.606Z]   1300K .......... .......... .......... .......... .......... 59% 8.70M 0s
[task 2024-02-13T19:57:25.606Z]   1350K .......... .......... .......... .......... .......... 61%  213M 0s
[task 2024-02-13T19:57:25.606Z]   1400K .......... .......... .......... .......... .......... 64%  226M 0s
[task 2024-02-13T19:57:25.606Z]   1450K .......... .......... .......... .......... .......... 66%  198M 0s
[task 2024-02-13T19:57:25.607Z]   1500K .......... .......... .......... .......... .......... 68%  239M 0s
[task 2024-02-13T19:57:25.699Z]   1550K .......... .......... .......... .......... .......... 70%  540K 0s
[task 2024-02-13T19:57:25.700Z]   1600K .......... .......... .......... .......... .......... 73%  150M 0s
[task 2024-02-13T19:57:25.700Z]   1650K .......... .......... .......... .......... .......... 75%  155M 0s
[task 2024-02-13T19:57:25.700Z]   1700K .......... .......... .......... .......... .......... 77% 93.7M 0s
[task 2024-02-13T19:57:25.701Z]   1750K .......... .......... .......... .......... .......... 79%  123M 0s
[task 2024-02-13T19:57:25.701Z]   1800K .......... .......... .......... .......... .......... 81%  109M 0s
[task 2024-02-13T19:57:25.707Z]   1850K .......... .......... .......... .......... .......... 84% 9.18M 0s
[task 2024-02-13T19:57:25.707Z]   1900K .......... .......... .......... .......... .......... 86%  231M 0s
[task 2024-02-13T19:57:25.707Z]   1950K .......... .......... .......... .......... .......... 88%  203M 0s
[task 2024-02-13T19:57:25.707Z]   2000K .......... .......... .......... .......... .......... 90%  227M 0s
[task 2024-02-13T19:57:25.708Z]   2050K .......... .......... .......... .......... .......... 92%  230M 0s
[task 2024-02-13T19:57:25.713Z]   2100K .......... .......... .......... .......... .......... 95% 8.69M 0s
[task 2024-02-13T19:57:25.713Z]   2150K .......... .......... .......... .......... .......... 97%  216M 0s
[task 2024-02-13T19:57:25.714Z]   2200K .......... .......... .......... .......... .......... 99%  200M 0s
[task 2024-02-13T19:57:25.714Z]   2250K ........                                              100%  235M=0.8s
[task 2024-02-13T19:57:25.714Z] 
[task 2024-02-13T19:57:25.714Z] 2024-02-13 19:57:25 (2.87 MB/s) - written to stdout [2312968/2312968]
[task 2024-02-13T19:57:25.714Z] 
[task 2024-02-13T19:57:25.782Z] + echo '###### Done: Downloading WMT newscrawl monolingual data'
[task 2024-02-13T19:57:25.782Z] ###### Done: Downloading WMT newscrawl monolingual data
[task 2024-02-13T19:57:25.783Z] + echo '### Sampling dataset'
[task 2024-02-13T19:57:25.783Z] ### Sampling dataset
[task 2024-02-13T19:57:25.783Z] + set +o pipefail
[task 2024-02-13T19:57:25.783Z] + zstdmt -dc /builds/worker/artifacts/original/news-crawl_news.2008.original.ru.zst
[task 2024-02-13T19:57:25.783Z] + perl -ne 'print if(split(/\s/, $_) < 100)'
[task 2024-02-13T19:57:25.783Z] + head -n 10000
[task 2024-02-13T19:57:25.784Z] ++ bc -l
[task 2024-02-13T19:57:25.784Z] + zstdmt
[task 2024-02-13T19:57:25.785Z] + shuf -n 11000
[task 2024-02-13T19:57:25.856Z] + set -o pipefail
[task 2024-02-13T19:57:25.856Z] + rm -rf /builds/worker/artifacts/original/news-crawl_news.2008.original.ru.zst
[task 2024-02-13T19:57:25.858Z] + echo '###### Done: Downloading monolingual data'
[task 2024-02-13T19:57:25.858Z] ###### Done: Downloading monolingual data
[taskcluster 2024-02-13 19:57:26.175Z] === Task Finished ===
[taskcluster 2024-02-13 19:57:26.419Z] Successful task run with exit code: 0 completed in 61.161 seconds