Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examine strategies for more efficient alignments #663

Closed
Tracked by #453 ...
gregtatum opened this issue Jun 3, 2024 · 2 comments
Closed
Tracked by #453 ...

Examine strategies for more efficient alignments #663

gregtatum opened this issue Jun 3, 2024 · 2 comments
Assignees
Labels
cost & perf Speeding up and lowering cost for the pipeline

Comments

@gregtatum
Copy link
Member

I haven't looked into this too deeply, but we are failing with OOM when computing alignments with eflomal.

https://firefox-ci-tc.services.mozilla.com/tasks/WoiZo-oDQAuRuN_yTu2EKw

Perhaps there is a more efficient way to do this, or we need chunking. Right now we are just increasing machine memory size. There could also be a memory leak in the implementation. It might be worth looking into, especially when we go to train high resource languages.

[task 2024-06-02T17:19:20.158Z] /fetches/mono.en.zst : 25511 MB...     
[task 2024-06-02T17:19:20.158Z]                                                                                
[task 2024-06-02T17:19:20.158Z] /builds/worker/fetches/mono.en.zst: 26827774098 bytes 
[task 2024-06-02T17:19:21.064Z] [alignments] Using provided priors: /builds/worker/fetches/corpus.priors
[task 2024-06-02T17:19:21.064Z] [alignments] Calculating alignments...
[task 2024-06-02T18:15:25.545Z] [eflomal] Prepared 200000000 sentences for alignment
[task 2024-06-02T18:15:25.545Z] [eflomal] Reading lexical priors...
[task 2024-06-02T18:17:15.950Z] [eflomal] 15689941 (of 25390768) pairs of lexical priors used
[task 2024-06-02T18:18:28.093Z] /builds/worker/.local/lib/python3.10/site-packages/eflomal/bin/eflomal -m 3 -s /tmp/tmpoojka2as -t /tmp/tmpd_2c0qs7 -n 3 -N 0.2 -1 2 -2 1 -3 2 -f /builds/worker/artifacts/tmp/aln.fwd -r /builds/worker/artifacts/tmp/aln.rev -p /tmp/tmpgxzt5ktb
[task 2024-06-02T18:28:18.390Z] Read texts (200000000 sentences): 590.297 s
[task 2024-06-02T18:28:18.390Z] Vocabulary sizes are 21977072 (source), 14776735 (target)
[task 2024-06-02T18:29:23.088Z] Created alignment structures: 64.692 s
[task 2024-06-02T18:29:45.552Z] Created alignment structures: 87.154 s
[task 2024-06-02T18:30:12.480Z] Randomized alignment: 49.392 s
[task 2024-06-02T18:30:12.480Z] Aligning with model 1 (2 iterations)
[task 2024-06-02T18:30:32.341Z] Randomized alignment: 46.788 s
[task 2024-06-02T18:30:32.341Z] Aligning with model 1 (2 iterations)
[task 2024-06-02T18:38:40.182Z] Traceback (most recent call last):
[task 2024-06-02T18:38:40.254Z]   File "/builds/worker/checkouts/vcs/pipeline/alignments/align.py", line 227, in <module>
[task 2024-06-02T18:38:40.254Z]     main()
[task 2024-06-02T18:38:40.254Z]   File "/builds/worker/checkouts/vcs/pipeline/alignments/align.py", line 216, in main
[task 2024-06-02T18:38:40.254Z]     run(
[task 2024-06-02T18:38:40.254Z]   File "/builds/worker/checkouts/vcs/pipeline/alignments/align.py", line 53, in run
[task 2024-06-02T18:38:40.254Z]     fwd_path, rev_path = align(
[task 2024-06-02T18:38:40.254Z]   File "/builds/worker/checkouts/vcs/pipeline/alignments/align.py", line 97, in align
[task 2024-06-02T18:38:40.254Z]     aligner.align(
[task 2024-06-02T18:38:40.254Z]   File "/builds/worker/.local/lib/python3.10/site-packages/eflomal/__init__.py", line 72, in align
[task 2024-06-02T18:38:40.271Z]     align(srcf.name, trgf.name,
[task 2024-06-02T18:38:40.271Z]   File "python/eflomal/eflomal.pyx", line 161, in eflomal.cython.align
[task 2024-06-02T18:38:40.502Z]   File "/usr/lib/python3.10/subprocess.py", line 526, in run
[task 2024-06-02T18:38:40.575Z]     raise CalledProcessError(retcode, process.args,
[task 2024-06-02T18:38:40.576Z] subprocess.CalledProcessError: Command '['/builds/worker/.local/lib/python3.10/site-packages/eflomal/bin/eflomal', '-m', '3', '-s', '/tmp/tmpoojka2as', '-t', '/tmp/tmpd_2c0qs7', '-n', '3', '-N', '0.2', '-1', '2', '-2', '1', '-3', '2', '-f', '/builds/worker/artifacts/tmp/aln.fwd', '-r', '/builds/worker/artifacts/tmp/aln.rev', '-p', '/tmp/tmpgxzt5ktb']' died with <Signals.SIGKILL: 9>.
@gregtatum gregtatum added cost & perf Speeding up and lowering cost for the pipeline high resource labels Jun 3, 2024
@eu9ene
Copy link
Collaborator

eu9ene commented Jun 11, 2024

I think the proper tokenization can fix this. See #507

@eu9ene
Copy link
Collaborator

eu9ene commented Jul 16, 2024

Proper tokenization is currently implemented and seems working for the current languages. We might split it into chunks in future, for example like here: #715 but it's not necessary until it works without it. Ideally, we would want to split it into multiple tasks and run on preemptible instances but this increases complexity. I'll add a task about this to the optimization meta issue.

@eu9ene eu9ene closed this as completed Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cost & perf Speeding up and lowering cost for the pipeline
Projects
None yet
Development

No branches or pull requests

2 participants