Smt build job #107

johnml1135 · 2024-05-06T19:11:35Z

Just like an NMT build job

This change is

ddaspit

The truecasing model needs to be trained as well. I think this means that we will need to port the UnigramTruecaser class over to Python.

Reviewed 1 of 4 files at r1, all commit messages.
Reviewable status: 1 of 4 files reviewed, all discussions resolved

codecov-commenter · 2024-05-10T13:42:50Z

Codecov Report

Attention: Patch coverage is 75.59395% with 113 lines in your changes missing coverage. Please review.

Project coverage is 87.97%. Comparing base (f8f3fc5) to head (8cbe134).

Files	Patch %	Lines
machine/translation/unigram_truecaser.py	49.57%	60 Missing ⚠️
machine/statistics/frequency_distribution.py	40.00%	21 Missing ⚠️
machine/jobs/local_shared_file_service.py	40.00%	18 Missing ⚠️
...e/statistics/conditional_frequency_distribution.py	76.47%	4 Missing ⚠️
machine/jobs/clearml_shared_file_service.py	60.00%	2 Missing ⚠️
machine/jobs/shared_file_service.py	71.42%	2 Missing ⚠️
machine/jobs/smt_engine_build_job.py	97.10%	2 Missing ⚠️
machine/corpora/usx_file_alignment_collection.py	50.00%	1 Missing ⚠️
machine/translation/thot/thot_smt_model.py	50.00%	1 Missing ⚠️
machine/translation/thot/thot_smt_model_trainer.py	66.66%	1 Missing ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #107      +/-   ##
==========================================
- Coverage   88.34%   87.97%   -0.37%     
==========================================
  Files         234      243       +9     
  Lines       13822    14238     +416     
==========================================
+ Hits        12211    12526     +315     
- Misses       1611     1712     +101

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ddaspit

Reviewed 1 of 4 files at r1, 2 of 3 files at r2, 6 of 8 files at r3, 5 of 7 files at r4, 9 of 9 files at r5, all commit messages.
Reviewable status: all files reviewed, 26 unresolved discussions (waiting on @johnml1135)

machine/corpora/token_processors.py line 17 at r5 (raw file):

def _get_normalization_form(normalization_form: str) -> Literal["NFC", "NFD", "NFKC", "NFKD"]:

What is this for?

machine/jobs/local_shared_file_service.py line 10 at r5 (raw file):

class LocalSharedFileService(SharedFileService):

You should export this class from the package __init__.py file.

machine/jobs/smt_engine_build_job.py line 25 at r5 (raw file):

class SmtEngineBuildJob:

You should export this class from the package __init__.py file.

machine/jobs/smt_engine_build_job.py line 45 at r5 (raw file):

        with TemporaryDirectory() as temp_dir:

            parameters = ThotSmtParameters(

The model should be initialized from a blank model. See the data/thot-new-model directory in the SIL.Machine.AspNetCore project.

machine/jobs/smt_engine_build_job.py line 77 at r5 (raw file):

            with UnigramTruecaserTrainer(
                corpus=target_corpus, model_path=os.path.join(temp_dir, "truecase.txt"), tokenizer=tokenizer

The filename that we use in the Machine engine server is unigram-casing-model.txt.

machine/statistics/conditional_frequency_distribution.py line 7 at r5 (raw file):

@dataclass

I would prefer that this class is not a dataclass. dataclass is usually used for models/DTOs/records and not for normal classes.

machine/statistics/conditional_frequency_distribution.py line 8 at r5 (raw file):

@dataclass
class ConditionalFrequencyDistribution:

You should export this class from the package __init__.py file.

machine/statistics/conditional_frequency_distribution.py line 11 at r5 (raw file):

    _freq_dist: Dict[str, FrequencyDistribution] = field(default_factory=dict)

    def get_conditions(self):

Don't forget the return type.

machine/statistics/frequency_distribution.py line 5 at r5 (raw file):

@dataclass

I would prefer that this class is not a dataclass.

machine/statistics/frequency_distribution.py line 6 at r5 (raw file):

@dataclass
class FrequencyDistribution:

You should export this class from the package __init__.py file.

machine/tokenization/__init__.py line 43 at r5 (raw file):

def get_tokenizer_detokenizer(name: str = "") -> Tuple[Tokenizer, Detokenizer]:

I would split this function into two functions: create_tokenizer and create_detokenizer and move them to the tokenization_utils.py module.

machine/tokenization/__init__.py line 45 at r5 (raw file):

def get_tokenizer_detokenizer(name: str = "") -> Tuple[Tokenizer, Detokenizer]:
    name_lower = name.lower()
    if "latin" in name_lower or name == "":

I would perform an equals check.

machine/translation/unigram_truecaser.py line 5 at r5 (raw file):

from typing import Dict, Sequence, Tuple

from machine.corpora.text_corpus import TextCorpus

You should use relative imports.

machine/translation/unigram_truecaser.py line 12 at r5 (raw file):

@dataclass

I would prefer that this class is not a dataclass.

machine/translation/unigram_truecaser.py line 13 at r5 (raw file):

@dataclass
class UnigramTruecaser(Truecaser):

You should export this class from the package __init__.py file.

machine/translation/unigram_truecaser.py line 33 at r5 (raw file):

    def create_trainer(self, corpus: TextCorpus) -> Trainer:
        from machine.translation.unigram_truecaser_trainer import SubclassUnigramTruecaserTrainer

In order to avoid the circular reference, I think it would be fine to move the trainer classes into this file.

machine/translation/unigram_truecaser.py line 87 at r5 (raw file):

    def _reset(self):
        self._casing = ConditionalFrequencyDistribution()

You can just call reset here.

machine/translation/unigram_truecaser.py line 88 at r5 (raw file):

    def _reset(self):
        self._casing = ConditionalFrequencyDistribution()
        self._bestTokens = {}

You can just call clear here.

machine/translation/unigram_truecaser_trainer.py line 4 at r5 (raw file):

from typing import Callable, Optional

from machine.tokenization.tokenizer import Tokenizer

You should use relative imports.

machine/translation/unigram_truecaser_trainer.py line 13 at r5 (raw file):

@dataclass

I would prefer that this class is not a dataclass.

machine/translation/unigram_truecaser_trainer.py line 14 at r5 (raw file):

@dataclass
class UnigramTruecaserTrainer(Trainer):

You should export this class from the package __init__.py file.

machine/translation/unigram_truecaser_trainer.py line 30 at r5 (raw file):

            step_count = self.corpus.count(include_empty=False)
        current_step = 0
        for row in self.corpus.tokenize(tokenizer=self.tokenizer).filter_nonempty():

When iterating over a corpus, you should use the context manager to ensure that files are closed properly:

with self.corpus.tokenize(tokenizer=self.tokenizer).filter_nonempty().get_rows() as rows:
    for row in rows:

machine/translation/unigram_truecaser_trainer.py line 44 at r5 (raw file):

class SubclassUnigramTruecaserTrainer(UnigramTruecaserTrainer):

I would prefix this with an underscore. I think _UnigramTruecaserTrainer should be fine.

machine/translation/thot/thot_smt_model.py line 89 at r5 (raw file):

        self.truecaser = truecaser
        if self.truecaser is None:
            self.truecaser = UnigramTruecaser()

Leave this with a default of None.

machine/translation/thot/thot_word_alignment_model_type.py line 13 at r5 (raw file):

def getThotWordAlignmentModelType(str) -> ThotWordAlignmentModelType:

Instead of using this function, update the ThotSmtModel and ThotSmtModelTrainerclasses to accept astrorThotWordAlignmentModelTypefor theword_alignment_model_type` parameter.

tests/jobs/test_smt_engine_build_job.py line 6 at r5 (raw file):

from pytest import raises

from machine.jobs.build_smt_engine import SmtEngineBuildJob

Instead of importing from the particular module, you should import from the top-level package.

johnml1135 · 2024-05-10T20:52:04Z

machine/corpora/token_processors.py line 17 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

What is this for?

Upgrading pyright threw and error - a difference between str and Literal. This got rid of the error. There could potentially be a better solution.

johnml1135 · 2024-05-20T13:38:55Z

machine/jobs/local_shared_file_service.py line 10 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should export this class from the package __init__.py file.

Done.

johnml1135 · 2024-05-20T13:42:14Z

machine/jobs/smt_engine_build_job.py line 25 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should export this class from the package __init__.py file.

Done.

johnml1135 · 2024-05-20T13:59:15Z

machine/jobs/smt_engine_build_job.py line 45 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The model should be initialized from a blank model. See the data/thot-new-model directory in the SIL.Machine.AspNetCore project.

Added new thot model

johnml1135 · 2024-05-20T13:59:57Z

machine/jobs/smt_engine_build_job.py line 77 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The filename that we use in the Machine engine server is unigram-casing-model.txt.

updated.

johnml1135 · 2024-05-20T14:04:16Z

machine/statistics/conditional_frequency_distribution.py line 7 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would prefer that this class is not a dataclass. dataclass is usually used for models/DTOs/records and not for normal classes.

done

johnml1135 · 2024-05-20T14:06:24Z

machine/statistics/conditional_frequency_distribution.py line 8 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should export this class from the package __init__.py file.

Done.

johnml1135 · 2024-05-20T14:07:01Z

machine/statistics/conditional_frequency_distribution.py line 11 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Don't forget the return type.

Done.

johnml1135 · 2024-05-20T14:07:08Z

machine/statistics/frequency_distribution.py line 5 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would prefer that this class is not a dataclass.

Done.

johnml1135 · 2024-05-20T14:07:11Z

machine/statistics/frequency_distribution.py line 6 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should export this class from the package __init__.py file.

Done.

johnml1135 · 2024-05-20T14:14:29Z

machine/tokenization/__init__.py line 43 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would split this function into two functions: create_tokenizer and create_detokenizer and move them to the tokenization_utils.py module.

Done.

johnml1135 · 2024-05-20T14:15:11Z

machine/tokenization/__init__.py line 45 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would perform an equals check.

Done.

johnml1135 · 2024-05-20T14:15:45Z

machine/translation/unigram_truecaser.py line 5 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should use relative imports.

Done.

johnml1135 · 2024-05-20T14:15:49Z

machine/translation/unigram_truecaser.py line 12 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would prefer that this class is not a dataclass.

Done.

johnml1135 · 2024-05-20T14:16:34Z

machine/translation/unigram_truecaser.py line 13 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should export this class from the package __init__.py file.

Done.

johnml1135 · 2024-05-20T14:50:17Z

machine/translation/unigram_truecaser.py line 33 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

In order to avoid the circular reference, I think it would be fine to move the trainer classes into this file.

Done.

johnml1135 · 2024-05-20T14:51:07Z

machine/translation/unigram_truecaser.py line 88 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You can just call clear here.

Done.

johnml1135 · 2024-05-20T14:51:09Z

machine/translation/unigram_truecaser.py line 87 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You can just call reset here.

Done.

johnml1135 · 2024-05-20T15:25:47Z

machine/translation/thot/thot_word_alignment_model_type.py line 13 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Instead of using this function, update the ThotSmtModel and ThotSmtModelTrainerclasses to accept astrorThotWordAlignmentModelTypefor theword_alignment_model_type` parameter.

Done.

johnml1135 · 2024-05-20T15:26:38Z

tests/jobs/test_smt_engine_build_job.py line 6 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Instead of importing from the particular module, you should import from the top-level package.

Done.

johnml1135 · 2024-05-20T15:26:56Z

machine/translation/unigram_truecaser_trainer.py line 4 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should use relative imports.

Done.

johnml1135 · 2024-05-21T11:35:44Z

machine/corpora/token_processors.py line 17 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

What if you made the type for the normalization_form of the normalize function: Literal["NFC", "NFD", "NFKC", "NFKD"]?

If we do, it throws off the normalize function below - there are some instances where an arbitrary string is passed to it rather than a fixed string - and this conversion needs to happen somewhere. Is there a better place for it?

johnml1135 · 2024-05-21T11:36:34Z

machine/jobs/build_smt_engine.py line 41 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I understand that nonlocal is not a common keyword, but I feel that this comment is unnecessary. The nonlocal keyword is sufficient documentation for the behavior.

done - you can thank github copilot for the comment - and me for thinking it would be ok.

johnml1135 · 2024-05-21T11:37:18Z

machine/jobs/smt_engine_build_job.py line 21 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This is a constant, so it should be in all caps.

Done.

johnml1135 · 2024-05-21T11:37:57Z

machine/statistics/conditional_frequency_distribution.py line 11 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would make the type Collection.

Done.

johnml1135 · 2024-05-21T11:38:55Z

machine/statistics/conditional_frequency_distribution.py line 12 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

There isn't a need to create a list instance.

Done.

johnml1135 · 2024-05-21T11:39:43Z

machine/tokenization/tokenization_utils.py line 31 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Naming it create_tokenizer and create_detokenizer makes it consistent with other similar functions in Machine, such as create_thot_word_alignment_model.

Done.

johnml1135 · 2024-05-21T11:55:46Z

machine/translation/unigram_truecaser.py line 21 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This code should be moved to the constructor.

Done.

johnml1135 · 2024-05-21T12:01:22Z

machine/translation/unigram_truecaser.py line 36 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We don't want to update the state of the current UnigramTruecaser instance during training, so that we can continue to use it. The state of the instance should be updated when Save is called on the trainer. See the UnigramTruecaser.Trainer class in Machine for .NET.

I don't understand exactly what you mean and what needs to be changed.

johnml1135 · 2024-05-21T12:17:24Z

machine/translation/thot/thot_smt_model_trainer.py line 178 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Please don't ignore type warnings.

Done.

johnml1135 · 2024-05-21T12:20:33Z

tests/jobs/test_smt_engine_build_job.py line 7 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should only import from the top-level packages, i.e. machine.jobs, machine.utils, etc.

Done.

johnml1135 · 2024-05-21T12:29:35Z

machine/translation/thot/thot_smt_model_trainer.py line 40 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be snake case and prefixed with an underscore.

Added pep8-naming to dev dependencies to catch these in the future.

johnml1135 · 2024-05-21T12:30:27Z

tests/translation/test_unigram_truecaser.py line 3 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should only import from the top-level packages, i.e. machine.corpora, machine.translation, etc.

Done.

johnml1135 · 2024-05-21T12:30:41Z

tests/translation/test_unigram_truecaser.py line 9 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This is a constant, so it should be all caps.

Done.

johnml1135 · 2024-05-21T12:32:26Z

tests/translation/test_unigram_truecaser.py line 72 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I'm not sure I understand what this is testing.

Renamed function - testing that the trainer trains properly.

johnml1135 · 2024-05-23T12:00:19Z

dockerfile.cpu_only line 1 at r7 (raw file):

#compatability with Tensorflow 2.6.0 as per https://www.tensorflow.org/install/source#gpu

This has been untested - and unverified to build. The CI routines need to be updated to create the image, push it to the GitHub container repo and pull it when doing CPU only jobs, and then verify it can actually work on the CPU shares.

johnml1135 · 2024-05-23T12:03:05Z

machine/translation/thot/thot_smt_model_trainer.py line 156 at r7 (raw file):

class ThotSmtModelTrainer(Trainer):

This is around 10x slower than the dotnet implementation - why? You can do comparative timing with SMT jobs being sent to either hangfire or ClearML. For some reason, it's not multi threading as well as only using about 10% of one CPU. I can't find any limiting happening anywhere (docker, ClearML, OpenMP, python, etc.). Something with dynamically loaded libraries? I don't know.

* Add unigram truecaser * Add CPU only docker image * Add Latin default tokenizer * Add vim to docker image for rebasing * Add SMT integration test * Update CI packages * better logging * Added pep8-naming * Add async progress updates for SMT job * Make startup faster, preinstall apt packages in docker

…uld be auto-deleted when build is complete.

ddaspit

Reviewed 7 of 19 files at r7, 70 of 76 files at r8, 8 of 8 files at r9, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)

machine/jobs/thot/thot_smt_model_factory.py line 95 at r9 (raw file):

            self._config.data_dir, self._config.shared_file_folder, "builds", self._config.build_id, "model"
        )
        return Path(shutil.make_archive(str(tar_file_basename), "zip", self._model_dir))

This should be saved as a .tar.gz file, so that it is consistent with the NMT factory.

machine/translation/unigram_truecaser.py line 36 at r6 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I don't understand exactly what you mean and what needs to be changed.

Done

machine/translation/thot/thot_smt_model_trainer.py line 156 at r7 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This is around 10x slower than the dotnet implementation - why? You can do comparative timing with SMT jobs being sent to either hangfire or ClearML. For some reason, it's not multi threading as well as only using about 10% of one CPU. I can't find any limiting happening anywhere (docker, ClearML, OpenMP, python, etc.). Something with dynamically loaded libraries? I don't know.

Done

johnml1135 · 2024-06-10T20:46:32Z

machine/jobs/thot/thot_smt_model_factory.py line 95 at r9 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be saved as a .tar.gz file, so that it is consistent with the NMT factory.

Done.

ddaspit

Reviewed 1 of 1 files at r10, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

johnml1135 requested a review from ddaspit May 6, 2024 19:11

ddaspit reviewed May 6, 2024

View reviewed changes

johnml1135 force-pushed the smt_build_job branch 4 times, most recently from 664c14f to bea466d Compare May 10, 2024 13:40

johnml1135 force-pushed the smt_build_job branch from ad26d92 to 87941fc Compare May 10, 2024 18:02

ddaspit requested changes May 10, 2024

View reviewed changes

johnml1135 force-pushed the smt_build_job branch from d5ed685 to ba850cf Compare May 21, 2024 12:36

ddaspit force-pushed the smt_build_job branch 3 times, most recently from 12334e6 to d19a762 Compare May 31, 2024 22:35

johnml1135 force-pushed the smt_build_job branch from 8f7360b to 6a154a0 Compare June 7, 2024 14:37

johnml1135 added 2 commits June 7, 2024 16:42

Update SMT saving location to be in build - it always happens and sho…

e425bf5

…uld be auto-deleted when build is complete.

dockerfile fix

91b0f42

ddaspit reviewed Jun 10, 2024

View reviewed changes

revert to tar.gz

8cbe134

ddaspit approved these changes Jun 10, 2024

View reviewed changes

johnml1135 merged commit 52964cf into main Jun 11, 2024
14 checks passed

johnml1135 deleted the smt_build_job branch June 11, 2024 12:42

Smt build job #107

Smt build job #107

Conversation

johnml1135 commented May 6, 2024 • edited by ddaspit Loading

ddaspit left a comment

Choose a reason for hiding this comment

codecov-commenter commented May 10, 2024 • edited Loading

Codecov Report

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented May 10, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 20, 2024

johnml1135 commented May 21, 2024

johnml1135 commented May 21, 2024

johnml1135 commented May 21, 2024

johnml1135 commented May 21, 2024

johnml1135 commented May 21, 2024

johnml1135 commented May 21, 2024

johnml1135 commented May 21, 2024

johnml1135 commented May 21, 2024

johnml1135 commented May 21, 2024

johnml1135 commented May 21, 2024

johnml1135 commented May 21, 2024

johnml1135 commented May 21, 2024

johnml1135 commented May 21, 2024

johnml1135 commented May 21, 2024

johnml1135 commented May 23, 2024

johnml1135 commented May 23, 2024

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Jun 10, 2024

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented May 6, 2024 •

edited by ddaspit

Loading

codecov-commenter commented May 10, 2024 •

edited

Loading