Add the ability to run starting from a specific task (fixes #227) #377
firefoxci-taskcluster / dataset-news-crawl-news_2008-ru
succeeded
Feb 13, 2024 in 11m 48s
FirefoxCI (pull_request)
Fetch news-crawl dataset for {src_locale}
Details
View task in Taskcluster
View logs in Taskcluster
[taskcluster 2024-02-13 19:56:25.259Z] Task ID: bqNTthVBTSS0un1EjVRclQ
[taskcluster 2024-02-13 19:56:25.260Z] Worker ID: 3179979791804037441
[taskcluster 2024-02-13 19:56:25.260Z] Worker Group: us-central1
[taskcluster 2024-02-13 19:56:25.260Z] Worker Node Type: projects/887720501152/machineTypes/n2-highmem-32
[taskcluster 2024-02-13 19:56:25.260Z] Worker Pool: translations-1/b-linux-large-gcp
[taskcluster 2024-02-13 19:56:25.260Z] Worker Version: 38.0.5
[taskcluster 2024-02-13 19:56:25.260Z] Public IP: 34.122.95.34
[taskcluster 2024-02-13 19:56:25.260Z] Hostname: translations-1-b-linux-large-gcp-hqyisvsorvolrcxzzwk1lw
[taskcluster 2024-02-13 19:56:25.260Z] using cache "translations-level-1-checkouts-v3-7afeb851dd97df8f3607-KTThW1rRQEWlZgGQ0sWCPQ" -> /builds/worker/checkouts
[taskcluster 2024-02-13 19:56:25.811Z] Downloading artifact "public/image.tar.zst" from task ID: KTThW1rRQEWlZgGQ0sWCPQ.
[taskcluster 2024-02-13 19:56:30.813Z] Download Progress: 73.95%
[taskcluster 2024-02-13 19:56:32.382Z] Downloaded artifact successfully.
[taskcluster 2024-02-13 19:56:32.382Z] Downloaded 776.725 mb
[taskcluster 2024-02-13 19:56:32.383Z] Decompressing downloaded image
[taskcluster 2024-02-13 19:56:38.933Z] Loading docker image from downloaded archive.
[taskcluster 2024-02-13 19:57:12.741Z] Image 'public/image.tar.zst' from task 'KTThW1rRQEWlZgGQ0sWCPQ' loaded. Using image ID sha256:53facad048ff33f5c58e9d52d6e58e6cd4fcdd5a8e5788c85f46e559dd9deed5.
[taskcluster 2024-02-13 19:57:13.229Z] === Task Starting ===
[setup 2024-02-13T19:57:18.515Z] run-task started in /builds/worker
[setup 2024-02-13T19:57:18.515Z] Invoked by command: --firefox_translations_training-checkout=/builds/worker/checkouts/vcs/ -- bash -c $VCS_PATH/pipeline/data/download-mono.sh news-crawl_news.2008 ru 10000 $TASK_WORKDIR/artifacts/news_2008.ru.zst
[setup 2024-02-13T19:57:18.515Z] Python version: 3.10.12
[cache 2024-02-13T19:57:18.517Z] cache /builds/worker/checkouts is empty; writing requirements: gid=1000 uid=1000 version=1
[volume 2024-02-13T19:57:18.517Z] changing ownership of volume /builds/worker/.cache to 1000:1000
[volume 2024-02-13T19:57:18.517Z] volume /builds/worker/checkouts is a cache
[setup 2024-02-13T19:57:18.517Z] running as worker:worker
[vcs 2024-02-13T19:57:18.518Z] executing ['git', 'config', '--global', '--add', 'safe.directory', '/builds/worker/checkouts/vcs']
[vcs 2024-02-13T19:57:18.520Z] executing ['git', 'clone', 'https://github.com/mozilla/firefox-translations-training', '/builds/worker/checkouts/vcs']
[vcs 2024-02-13T19:57:18.522Z] Cloning into '/builds/worker/checkouts/vcs'...
[vcs 2024-02-13T19:57:19.275Z] executing ['git', 'fetch', '--no-tags', 'https://github.com/bhearsum/firefox-translations-training', 'start-specific']
[vcs 2024-02-13T19:57:19.498Z] From https://github.com/bhearsum/firefox-translations-training
[vcs 2024-02-13T19:57:19.498Z] * branch start-specific -> FETCH_HEAD
[vcs 2024-02-13T19:57:19.504Z] executing ['git', 'checkout', '-f', '-B', 'start-specific', '37fbf272d7eb316897377144111a3ef057becfd4']
[vcs 2024-02-13T19:57:19.568Z] Switched to a new branch 'start-specific'
[vcs 2024-02-13T19:57:19.569Z] executing ['git', 'submodule', 'init']
[vcs 2024-02-13T19:57:19.592Z] Submodule '3rd_party/browsermt-marian-dev' (https://github.com/browsermt/marian-dev) registered for path '3rd_party/browsermt-marian-dev'
[vcs 2024-02-13T19:57:19.592Z] Submodule 'extract-lex' (https://github.com/marian-nmt/extract-lex) registered for path '3rd_party/extract-lex'
[vcs 2024-02-13T19:57:19.593Z] Submodule 'fast_align' (https://github.com/clab/fast_align) registered for path '3rd_party/fast_align'
[vcs 2024-02-13T19:57:19.593Z] Submodule '3rd_party/kenlm' (https://github.com/kpu/kenlm) registered for path '3rd_party/kenlm'
[vcs 2024-02-13T19:57:19.593Z] Submodule '3rd_party/marian-dev' (https://github.com/marian-nmt/marian-dev) registered for path '3rd_party/marian-dev'
[vcs 2024-02-13T19:57:19.593Z] Submodule '3rd_party/preprocess' (https://github.com/kpu/preprocess.git) registered for path '3rd_party/preprocess'
[vcs 2024-02-13T19:57:19.594Z] executing ['git', 'submodule', 'update', '--force']
[vcs 2024-02-13T19:57:19.619Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/browsermt-marian-dev'...
[vcs 2024-02-13T19:57:20.840Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/extract-lex'...
[vcs 2024-02-13T19:57:21.131Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/fast_align'...
[vcs 2024-02-13T19:57:21.460Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/kenlm'...
[vcs 2024-02-13T19:57:22.108Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/marian-dev'...
[vcs 2024-02-13T19:57:23.491Z] Cloning into '/builds/worker/checkouts/vcs/3rd_party/preprocess'...
[vcs 2024-02-13T19:57:24.017Z] Submodule path '3rd_party/browsermt-marian-dev': checked out '11c6ae7c46be21ef96ed10c60f28022fa968939f'
[vcs 2024-02-13T19:57:24.029Z] Submodule path '3rd_party/extract-lex': checked out '42fa605b53f32eaf6c6e0b5677255c21c91b3d49'
[vcs 2024-02-13T19:57:24.041Z] Submodule path '3rd_party/fast_align': checked out 'cab1e9aac8d3bb02ff5ae58218d8d225a039fa11'
[vcs 2024-02-13T19:57:24.069Z] Submodule path '3rd_party/kenlm': checked out 'bbf4fc511266c5d4515047055d7bdec659a6e158'
[vcs 2024-02-13T19:57:24.187Z] Submodule path '3rd_party/marian-dev': checked out 'e8a1a2530fb84cbff7383302ebca393e5875c441'
[vcs 2024-02-13T19:57:24.208Z] Submodule path '3rd_party/preprocess': checked out '64307314b4d5a9a0bd529b5c1036b0710d995eec'
[vcs 2024-02-13T19:57:24.208Z] cleaning git checkout...
[vcs 2024-02-13T19:57:24.208Z] executing ['git', 'clean', '-nxdff']
[vcs 2024-02-13T19:57:24.212Z] removing []
[vcs 2024-02-13T19:57:24.212Z] successfully cleaned git checkout!
[vcs 2024-02-13T19:57:24.214Z] TinderboxPrint:<a href='https://github.com/bhearsum/firefox-translations-training/commit/37fbf272d7eb316897377144111a3ef057becfd4' title='Built from firefox-translations-training commit 37fbf272d7eb316897377144111a3ef057becfd4'>37fbf272d7eb316897377144111a3ef057becfd4</a>
[task 2024-02-13T19:57:24.214Z] executing ['bash', '-c', '$VCS_PATH/pipeline/data/download-mono.sh news-crawl_news.2008 ru 10000 $TASK_WORKDIR/artifacts/news_2008.ru.zst']
[task 2024-02-13T19:57:24.216Z] + set -euo pipefail
[task 2024-02-13T19:57:24.216Z] + dataset=news-crawl_news.2008
[task 2024-02-13T19:57:24.216Z] + lang=ru
[task 2024-02-13T19:57:24.216Z] + max_sent=10000
[task 2024-02-13T19:57:24.216Z] + output_path=/builds/worker/artifacts/news_2008.ru.zst
[task 2024-02-13T19:57:24.216Z] + coef=0.1
[task 2024-02-13T19:57:24.216Z] + COMPRESSION_CMD=zstdmt
[task 2024-02-13T19:57:24.216Z] + ARTIFACT_EXT=zst
[task 2024-02-13T19:57:24.216Z] + echo '###### Downloading monolingual data for language ru dataset news-crawl_news.2008'
[task 2024-02-13T19:57:24.216Z] ###### Downloading monolingual data for language ru dataset news-crawl_news.2008
[task 2024-02-13T19:57:24.217Z] ++ dirname /builds/worker/checkouts/vcs/pipeline/data/download-mono.sh
[task 2024-02-13T19:57:24.217Z] + cd /builds/worker/checkouts/vcs/pipeline/data
[task 2024-02-13T19:57:24.218Z] ++ dirname /builds/worker/artifacts/news_2008.ru.zst
[task 2024-02-13T19:57:24.218Z] + tmp=/builds/worker/artifacts/original
[task 2024-02-13T19:57:24.218Z] + mkdir -p /builds/worker/artifacts/original
[task 2024-02-13T19:57:24.220Z] + echo '### Downloading dataset'
[task 2024-02-13T19:57:24.220Z] ### Downloading dataset
[task 2024-02-13T19:57:24.220Z] + original_prefix=/builds/worker/artifacts/original/news-crawl_news.2008.original.ru
[task 2024-02-13T19:57:24.220Z] + name=news.2008
[task 2024-02-13T19:57:24.220Z] + type=news-crawl
[task 2024-02-13T19:57:24.220Z] + test -s /builds/worker/artifacts/original/news-crawl_news.2008.original.ru.zst
[task 2024-02-13T19:57:24.220Z] + bash importers/mono/news-crawl.sh ru /builds/worker/artifacts/original/news-crawl_news.2008.original.ru news.2008
[task 2024-02-13T19:57:24.221Z] + set -euo pipefail
[task 2024-02-13T19:57:24.221Z] + lang=ru
[task 2024-02-13T19:57:24.221Z] + output_prefix=/builds/worker/artifacts/original/news-crawl_news.2008.original.ru
[task 2024-02-13T19:57:24.221Z] + dataset=news.2008
[task 2024-02-13T19:57:24.221Z] + COMPRESSION_CMD=zstdmt
[task 2024-02-13T19:57:24.221Z] + ARTIFACT_EXT=zst
[task 2024-02-13T19:57:24.221Z] + WGET=wget
[task 2024-02-13T19:57:24.221Z] + echo '###### Downloading WMT newscrawl monolingual data'
[task 2024-02-13T19:57:24.221Z] ###### Downloading WMT newscrawl monolingual data
[task 2024-02-13T19:57:24.221Z] + wget -O - http://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz
[task 2024-02-13T19:57:24.222Z] + gunzip
[task 2024-02-13T19:57:24.222Z] + zstdmt -c
[task 2024-02-13T19:57:24.224Z] --2024-02-13 19:57:24-- http://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz
[task 2024-02-13T19:57:24.284Z] Resolving data.statmt.org (data.statmt.org)... 129.215.32.28
[task 2024-02-13T19:57:24.392Z] Connecting to data.statmt.org (data.statmt.org)|129.215.32.28|:80... connected.
[task 2024-02-13T19:57:24.499Z] HTTP request sent, awaiting response... 301 Moved Permanently
[task 2024-02-13T19:57:24.499Z] Location: https://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz [following]
[task 2024-02-13T19:57:24.499Z] --2024-02-13 19:57:24-- https://data.statmt.org/news-crawl/ru/news.2008.ru.shuffled.deduped.gz
[task 2024-02-13T19:57:24.608Z] Connecting to data.statmt.org (data.statmt.org)|129.215.32.28|:443... connected.
[task 2024-02-13T19:57:24.946Z] HTTP request sent, awaiting response... 200 OK
[task 2024-02-13T19:57:24.946Z] Length: 2312968 (2.2M) [application/x-gzip]
[task 2024-02-13T19:57:24.946Z] Saving to: ‘STDOUT’
[task 2024-02-13T19:57:24.946Z]
[task 2024-02-13T19:57:25.162Z] 0K .......... .......... .......... .......... .......... 2% 232K 10s
[task 2024-02-13T19:57:25.270Z] 50K .......... .......... .......... .......... .......... 4% 462K 7s
[task 2024-02-13T19:57:25.270Z] 100K .......... .......... .......... .......... .......... 6% 101M 5s
[task 2024-02-13T19:57:25.378Z] 150K .......... .......... .......... .......... .......... 8% 465K 4s
[task 2024-02-13T19:57:25.378Z] 200K .......... .......... .......... .......... .......... 11% 88.5M 3s
[task 2024-02-13T19:57:25.379Z] 250K .......... .......... .......... .......... .......... 13% 113M 3s
[task 2024-02-13T19:57:25.385Z] 300K .......... .......... .......... .......... .......... 15% 7.92M 2s
[task 2024-02-13T19:57:25.484Z] 350K .......... .......... .......... .......... .......... 17% 503K 3s
[task 2024-02-13T19:57:25.485Z] 400K .......... .......... .......... .......... .......... 19% 69.4M 2s
[task 2024-02-13T19:57:25.487Z] 450K .......... .......... .......... .......... .......... 22% 29.0M 2s
[task 2024-02-13T19:57:25.487Z] 500K .......... .......... .......... .......... .......... 24% 145M 2s
[task 2024-02-13T19:57:25.493Z] 550K .......... .......... .......... .......... .......... 26% 8.06M 2s
[task 2024-02-13T19:57:25.493Z] 600K .......... .......... .......... .......... .......... 28% 192M 1s
[task 2024-02-13T19:57:25.494Z] 650K .......... .......... .......... .......... .......... 30% 224M 1s
[task 2024-02-13T19:57:25.494Z] 700K .......... .......... .......... .......... .......... 33% 198M 1s
[task 2024-02-13T19:57:25.591Z] 750K .......... .......... .......... .......... .......... 35% 515K 1s
[task 2024-02-13T19:57:25.591Z] 800K .......... .......... .......... .......... .......... 37% 152M 1s
[task 2024-02-13T19:57:25.592Z] 850K .......... .......... .......... .......... .......... 39% 142M 1s
[task 2024-02-13T19:57:25.592Z] 900K .......... .......... .......... .......... .......... 42% 113M 1s
[task 2024-02-13T19:57:25.593Z] 950K .......... .......... .......... .......... .......... 44% 41.9M 1s
[task 2024-02-13T19:57:25.594Z] 1000K .......... .......... .......... .......... .......... 46% 114M 1s
[task 2024-02-13T19:57:25.599Z] 1050K .......... .......... .......... .......... .......... 48% 8.84M 1s
[task 2024-02-13T19:57:25.599Z] 1100K .......... .......... .......... .......... .......... 50% 216M 1s
[task 2024-02-13T19:57:25.600Z] 1150K .......... .......... .......... .......... .......... 53% 232M 1s
[task 2024-02-13T19:57:25.600Z] 1200K .......... .......... .......... .......... .......... 55% 201M 1s
[task 2024-02-13T19:57:25.600Z] 1250K .......... .......... .......... .......... .......... 57% 247M 0s
[task 2024-02-13T19:57:25.606Z] 1300K .......... .......... .......... .......... .......... 59% 8.70M 0s
[task 2024-02-13T19:57:25.606Z] 1350K .......... .......... .......... .......... .......... 61% 213M 0s
[task 2024-02-13T19:57:25.606Z] 1400K .......... .......... .......... .......... .......... 64% 226M 0s
[task 2024-02-13T19:57:25.606Z] 1450K .......... .......... .......... .......... .......... 66% 198M 0s
[task 2024-02-13T19:57:25.607Z] 1500K .......... .......... .......... .......... .......... 68% 239M 0s
[task 2024-02-13T19:57:25.699Z] 1550K .......... .......... .......... .......... .......... 70% 540K 0s
[task 2024-02-13T19:57:25.700Z] 1600K .......... .......... .......... .......... .......... 73% 150M 0s
[task 2024-02-13T19:57:25.700Z] 1650K .......... .......... .......... .......... .......... 75% 155M 0s
[task 2024-02-13T19:57:25.700Z] 1700K .......... .......... .......... .......... .......... 77% 93.7M 0s
[task 2024-02-13T19:57:25.701Z] 1750K .......... .......... .......... .......... .......... 79% 123M 0s
[task 2024-02-13T19:57:25.701Z] 1800K .......... .......... .......... .......... .......... 81% 109M 0s
[task 2024-02-13T19:57:25.707Z] 1850K .......... .......... .......... .......... .......... 84% 9.18M 0s
[task 2024-02-13T19:57:25.707Z] 1900K .......... .......... .......... .......... .......... 86% 231M 0s
[task 2024-02-13T19:57:25.707Z] 1950K .......... .......... .......... .......... .......... 88% 203M 0s
[task 2024-02-13T19:57:25.707Z] 2000K .......... .......... .......... .......... .......... 90% 227M 0s
[task 2024-02-13T19:57:25.708Z] 2050K .......... .......... .......... .......... .......... 92% 230M 0s
[task 2024-02-13T19:57:25.713Z] 2100K .......... .......... .......... .......... .......... 95% 8.69M 0s
[task 2024-02-13T19:57:25.713Z] 2150K .......... .......... .......... .......... .......... 97% 216M 0s
[task 2024-02-13T19:57:25.714Z] 2200K .......... .......... .......... .......... .......... 99% 200M 0s
[task 2024-02-13T19:57:25.714Z] 2250K ........ 100% 235M=0.8s
[task 2024-02-13T19:57:25.714Z]
[task 2024-02-13T19:57:25.714Z] 2024-02-13 19:57:25 (2.87 MB/s) - written to stdout [2312968/2312968]
[task 2024-02-13T19:57:25.714Z]
[task 2024-02-13T19:57:25.782Z] + echo '###### Done: Downloading WMT newscrawl monolingual data'
[task 2024-02-13T19:57:25.782Z] ###### Done: Downloading WMT newscrawl monolingual data
[task 2024-02-13T19:57:25.783Z] + echo '### Sampling dataset'
[task 2024-02-13T19:57:25.783Z] ### Sampling dataset
[task 2024-02-13T19:57:25.783Z] + set +o pipefail
[task 2024-02-13T19:57:25.783Z] + zstdmt -dc /builds/worker/artifacts/original/news-crawl_news.2008.original.ru.zst
[task 2024-02-13T19:57:25.783Z] + perl -ne 'print if(split(/\s/, $_) < 100)'
[task 2024-02-13T19:57:25.783Z] + head -n 10000
[task 2024-02-13T19:57:25.784Z] ++ bc -l
[task 2024-02-13T19:57:25.784Z] + zstdmt
[task 2024-02-13T19:57:25.785Z] + shuf -n 11000
[task 2024-02-13T19:57:25.856Z] + set -o pipefail
[task 2024-02-13T19:57:25.856Z] + rm -rf /builds/worker/artifacts/original/news-crawl_news.2008.original.ru.zst
[task 2024-02-13T19:57:25.858Z] + echo '###### Done: Downloading monolingual data'
[task 2024-02-13T19:57:25.858Z] ###### Done: Downloading monolingual data
[taskcluster 2024-02-13 19:57:26.175Z] === Task Finished ===
[taskcluster 2024-02-13 19:57:26.419Z] Successful task run with exit code: 0 completed in 61.161 seconds
Loading