Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate script doesn't support USFM file names using the "<nn>-<book>" naming format #456

Open
mmartin9684-sil opened this issue Jul 15, 2024 · 2 comments
Labels
enhancement New feature or request pipeline 6: infer Issue related to using a trained model to translate.

Comments

@mmartin9684-sil
Copy link
Collaborator

Paratext-compatible projects downloaded from Door43 (e.g., https://git.door43.org/unfoldingWord/el-x-koine_ugnt) use the "-" book name format for naming the USFM files in the project. For example:

  • 01-GEN.usfm
  • 02-EXO.usfm
  • 03-LEV.usfm
  • ...

When the translate script is run with one of these projects as the source projects, the script will error out because it doesn't properly handle this book naming format:

2024-07-13 07:58:52,216 - silnlp.nmt.translate - ERROR - Was not able to translate MIC.
Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/translate.py", line 122, in translate_books
    translator.translate_book(
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/common/translator.py", line 317, in translate_book
    raise RuntimeError(f"Can't find file {book_path} for book {book}")
RuntimeError: Can't find file /tmp/tmpu35zij1b/hbo_uhb_2024_07_10/33MIC.usfm for book MIC
@mmartin9684-sil mmartin9684-sil added bug Something isn't working pipeline 6: infer Issue related to using a trained model to translate. labels Jul 15, 2024
@mmartin9684-sil
Copy link
Collaborator Author

Note that these projects do not contain a Settings.xml file when they are downloaded. If a minimal Settings.xml file is created for them, it would look like this:

<ScriptureText>
  <Versification>4</Versification>
  <FileNamePostPart>.usfm</FileNamePostPart>
  <FileNameBookNameForm>41-MAT</FileNameBookNameForm>
  <LanguageIsoCode>hbo:::</LanguageIsoCode>
  <BiblicalTermsListSetting>Major::BiblicalTerms.xml</BiblicalTermsListSetting>
  <Naming PrePart="" PostPart=".usfm" BookNameForm="41-MAT" />
</ScriptureText>

But, although the BookNameForm is properly specified to match the USFM file names in the project folder, this file naming format is not properly handled and the files can't be opened for translation.

@ddaspit ddaspit moved this from 🆕 New to 🔖 Ready in SIL-NLP Research Jul 22, 2024
@ddaspit ddaspit added enhancement New feature or request and removed bug Something isn't working labels Jul 22, 2024
@ddaspit
Copy link
Collaborator

ddaspit commented Jul 22, 2024

Door43 has its own metadata format for translations called resource containers. We should add support for extracting from Door43 resource containers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pipeline 6: infer Issue related to using a trained model to translate.
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants