Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smt build job #107

Merged
merged 4 commits into from
Jun 11, 2024
Merged

Smt build job #107

merged 4 commits into from
Jun 11, 2024

Conversation

johnml1135
Copy link
Collaborator

@johnml1135 johnml1135 commented May 6, 2024

Just like an NMT build job


This change is Reviewable

@johnml1135 johnml1135 requested a review from ddaspit May 6, 2024 19:11
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The truecasing model needs to be trained as well. I think this means that we will need to port the UnigramTruecaser class over to Python.

Reviewed 1 of 4 files at r1, all commit messages.
Reviewable status: 1 of 4 files reviewed, all discussions resolved

@johnml1135 johnml1135 force-pushed the smt_build_job branch 4 times, most recently from 664c14f to bea466d Compare May 10, 2024 13:40
@codecov-commenter
Copy link

codecov-commenter commented May 10, 2024

Codecov Report

Attention: Patch coverage is 75.59395% with 113 lines in your changes missing coverage. Please review.

Project coverage is 87.97%. Comparing base (f8f3fc5) to head (8cbe134).

Files Patch % Lines
machine/translation/unigram_truecaser.py 49.57% 60 Missing ⚠️
machine/statistics/frequency_distribution.py 40.00% 21 Missing ⚠️
machine/jobs/local_shared_file_service.py 40.00% 18 Missing ⚠️
...e/statistics/conditional_frequency_distribution.py 76.47% 4 Missing ⚠️
machine/jobs/clearml_shared_file_service.py 60.00% 2 Missing ⚠️
machine/jobs/shared_file_service.py 71.42% 2 Missing ⚠️
machine/jobs/smt_engine_build_job.py 97.10% 2 Missing ⚠️
machine/corpora/usx_file_alignment_collection.py 50.00% 1 Missing ⚠️
machine/translation/thot/thot_smt_model.py 50.00% 1 Missing ⚠️
machine/translation/thot/thot_smt_model_trainer.py 66.66% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #107      +/-   ##
==========================================
- Coverage   88.34%   87.97%   -0.37%     
==========================================
  Files         234      243       +9     
  Lines       13822    14238     +416     
==========================================
+ Hits        12211    12526     +315     
- Misses       1611     1712     +101     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r1, 2 of 3 files at r2, 6 of 8 files at r3, 5 of 7 files at r4, 9 of 9 files at r5, all commit messages.
Reviewable status: all files reviewed, 26 unresolved discussions (waiting on @johnml1135)


machine/corpora/token_processors.py line 17 at r5 (raw file):

def _get_normalization_form(normalization_form: str) -> Literal["NFC", "NFD", "NFKC", "NFKD"]:

What is this for?


machine/jobs/local_shared_file_service.py line 10 at r5 (raw file):

class LocalSharedFileService(SharedFileService):

You should export this class from the package __init__.py file.


machine/jobs/smt_engine_build_job.py line 25 at r5 (raw file):

class SmtEngineBuildJob:

You should export this class from the package __init__.py file.


machine/jobs/smt_engine_build_job.py line 45 at r5 (raw file):

        with TemporaryDirectory() as temp_dir:

            parameters = ThotSmtParameters(

The model should be initialized from a blank model. See the data/thot-new-model directory in the SIL.Machine.AspNetCore project.


machine/jobs/smt_engine_build_job.py line 77 at r5 (raw file):

            with UnigramTruecaserTrainer(
                corpus=target_corpus, model_path=os.path.join(temp_dir, "truecase.txt"), tokenizer=tokenizer

The filename that we use in the Machine engine server is unigram-casing-model.txt.


machine/statistics/conditional_frequency_distribution.py line 7 at r5 (raw file):

@dataclass

I would prefer that this class is not a dataclass. dataclass is usually used for models/DTOs/records and not for normal classes.


machine/statistics/conditional_frequency_distribution.py line 8 at r5 (raw file):

@dataclass
class ConditionalFrequencyDistribution:

You should export this class from the package __init__.py file.


machine/statistics/conditional_frequency_distribution.py line 11 at r5 (raw file):

    _freq_dist: Dict[str, FrequencyDistribution] = field(default_factory=dict)

    def get_conditions(self):

Don't forget the return type.


machine/statistics/frequency_distribution.py line 5 at r5 (raw file):

@dataclass

I would prefer that this class is not a dataclass.


machine/statistics/frequency_distribution.py line 6 at r5 (raw file):

@dataclass
class FrequencyDistribution:

You should export this class from the package __init__.py file.


machine/tokenization/__init__.py line 43 at r5 (raw file):

def get_tokenizer_detokenizer(name: str = "") -> Tuple[Tokenizer, Detokenizer]:

I would split this function into two functions: create_tokenizer and create_detokenizer and move them to the tokenization_utils.py module.


machine/tokenization/__init__.py line 45 at r5 (raw file):

def get_tokenizer_detokenizer(name: str = "") -> Tuple[Tokenizer, Detokenizer]:
    name_lower = name.lower()
    if "latin" in name_lower or name == "":

I would perform an equals check.


machine/translation/unigram_truecaser.py line 5 at r5 (raw file):

from typing import Dict, Sequence, Tuple

from machine.corpora.text_corpus import TextCorpus

You should use relative imports.


machine/translation/unigram_truecaser.py line 12 at r5 (raw file):

@dataclass

I would prefer that this class is not a dataclass.


machine/translation/unigram_truecaser.py line 13 at r5 (raw file):

@dataclass
class UnigramTruecaser(Truecaser):

You should export this class from the package __init__.py file.


machine/translation/unigram_truecaser.py line 33 at r5 (raw file):

    def create_trainer(self, corpus: TextCorpus) -> Trainer:
        from machine.translation.unigram_truecaser_trainer import SubclassUnigramTruecaserTrainer

In order to avoid the circular reference, I think it would be fine to move the trainer classes into this file.


machine/translation/unigram_truecaser.py line 87 at r5 (raw file):

    def _reset(self):
        self._casing = ConditionalFrequencyDistribution()

You can just call reset here.


machine/translation/unigram_truecaser.py line 88 at r5 (raw file):

    def _reset(self):
        self._casing = ConditionalFrequencyDistribution()
        self._bestTokens = {}

You can just call clear here.


machine/translation/unigram_truecaser_trainer.py line 4 at r5 (raw file):

from typing import Callable, Optional

from machine.tokenization.tokenizer import Tokenizer

You should use relative imports.


machine/translation/unigram_truecaser_trainer.py line 13 at r5 (raw file):

@dataclass

I would prefer that this class is not a dataclass.


machine/translation/unigram_truecaser_trainer.py line 14 at r5 (raw file):

@dataclass
class UnigramTruecaserTrainer(Trainer):

You should export this class from the package __init__.py file.


machine/translation/unigram_truecaser_trainer.py line 30 at r5 (raw file):

            step_count = self.corpus.count(include_empty=False)
        current_step = 0
        for row in self.corpus.tokenize(tokenizer=self.tokenizer).filter_nonempty():

When iterating over a corpus, you should use the context manager to ensure that files are closed properly:

with self.corpus.tokenize(tokenizer=self.tokenizer).filter_nonempty().get_rows() as rows:
    for row in rows:

machine/translation/unigram_truecaser_trainer.py line 44 at r5 (raw file):

class SubclassUnigramTruecaserTrainer(UnigramTruecaserTrainer):

I would prefix this with an underscore. I think _UnigramTruecaserTrainer should be fine.


machine/translation/thot/thot_smt_model.py line 89 at r5 (raw file):

        self.truecaser = truecaser
        if self.truecaser is None:
            self.truecaser = UnigramTruecaser()

Leave this with a default of None.


machine/translation/thot/thot_word_alignment_model_type.py line 13 at r5 (raw file):

def getThotWordAlignmentModelType(str) -> ThotWordAlignmentModelType:

Instead of using this function, update the ThotSmtModel and ThotSmtModelTrainerclasses to accept astrorThotWordAlignmentModelTypefor theword_alignment_model_type` parameter.


tests/jobs/test_smt_engine_build_job.py line 6 at r5 (raw file):

from pytest import raises

from machine.jobs.build_smt_engine import SmtEngineBuildJob

Instead of importing from the particular module, you should import from the top-level package.

@johnml1135
Copy link
Collaborator Author

machine/corpora/token_processors.py line 17 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

What is this for?

Upgrading pyright threw and error - a difference between str and Literal. This got rid of the error. There could potentially be a better solution.

@johnml1135
Copy link
Collaborator Author

machine/jobs/local_shared_file_service.py line 10 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should export this class from the package __init__.py file.

Done.

@johnml1135
Copy link
Collaborator Author

machine/jobs/smt_engine_build_job.py line 25 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should export this class from the package __init__.py file.

Done.

@johnml1135
Copy link
Collaborator Author

machine/jobs/smt_engine_build_job.py line 45 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The model should be initialized from a blank model. See the data/thot-new-model directory in the SIL.Machine.AspNetCore project.

Added new thot model

@johnml1135
Copy link
Collaborator Author

machine/jobs/smt_engine_build_job.py line 77 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The filename that we use in the Machine engine server is unigram-casing-model.txt.

updated.

@johnml1135
Copy link
Collaborator Author

machine/statistics/conditional_frequency_distribution.py line 7 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would prefer that this class is not a dataclass. dataclass is usually used for models/DTOs/records and not for normal classes.

done

@johnml1135
Copy link
Collaborator Author

machine/statistics/conditional_frequency_distribution.py line 8 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should export this class from the package __init__.py file.

Done.

@johnml1135
Copy link
Collaborator Author

machine/statistics/conditional_frequency_distribution.py line 11 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Don't forget the return type.

Done.

@johnml1135
Copy link
Collaborator Author

machine/statistics/frequency_distribution.py line 5 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would prefer that this class is not a dataclass.

Done.

@johnml1135
Copy link
Collaborator Author

machine/statistics/frequency_distribution.py line 6 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should export this class from the package __init__.py file.

Done.

@johnml1135
Copy link
Collaborator Author

machine/tokenization/__init__.py line 43 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would split this function into two functions: create_tokenizer and create_detokenizer and move them to the tokenization_utils.py module.

Done.

@johnml1135
Copy link
Collaborator Author

machine/tokenization/__init__.py line 45 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would perform an equals check.

Done.

@johnml1135
Copy link
Collaborator Author

machine/translation/unigram_truecaser.py line 5 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should use relative imports.

Done.

@johnml1135
Copy link
Collaborator Author

machine/translation/unigram_truecaser.py line 12 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would prefer that this class is not a dataclass.

Done.

@johnml1135
Copy link
Collaborator Author

machine/translation/unigram_truecaser.py line 13 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should export this class from the package __init__.py file.

Done.

@johnml1135
Copy link
Collaborator Author

machine/translation/unigram_truecaser.py line 33 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

In order to avoid the circular reference, I think it would be fine to move the trainer classes into this file.

Done.

@johnml1135
Copy link
Collaborator Author

machine/translation/unigram_truecaser.py line 88 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You can just call clear here.

Done.

@johnml1135
Copy link
Collaborator Author

machine/translation/unigram_truecaser.py line 87 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You can just call reset here.

Done.

@johnml1135
Copy link
Collaborator Author

machine/translation/thot/thot_word_alignment_model_type.py line 13 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Instead of using this function, update the ThotSmtModel and ThotSmtModelTrainerclasses to accept astrorThotWordAlignmentModelTypefor theword_alignment_model_type` parameter.

Done.

@johnml1135
Copy link
Collaborator Author

tests/jobs/test_smt_engine_build_job.py line 6 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Instead of importing from the particular module, you should import from the top-level package.

Done.

@johnml1135
Copy link
Collaborator Author

machine/translation/unigram_truecaser_trainer.py line 4 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should use relative imports.

Done.

@johnml1135
Copy link
Collaborator Author

machine/corpora/token_processors.py line 17 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

What if you made the type for the normalization_form of the normalize function: Literal["NFC", "NFD", "NFKC", "NFKD"]?

If we do, it throws off the normalize function below - there are some instances where an arbitrary string is passed to it rather than a fixed string - and this conversion needs to happen somewhere. Is there a better place for it?

@johnml1135
Copy link
Collaborator Author

machine/jobs/build_smt_engine.py line 41 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I understand that nonlocal is not a common keyword, but I feel that this comment is unnecessary. The nonlocal keyword is sufficient documentation for the behavior.

done - you can thank github copilot for the comment - and me for thinking it would be ok.

@johnml1135
Copy link
Collaborator Author

machine/jobs/smt_engine_build_job.py line 21 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This is a constant, so it should be in all caps.

Done.

@johnml1135
Copy link
Collaborator Author

machine/statistics/conditional_frequency_distribution.py line 11 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I would make the type Collection.

Done.

@johnml1135
Copy link
Collaborator Author

machine/statistics/conditional_frequency_distribution.py line 12 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

There isn't a need to create a list instance.

Done.

@johnml1135
Copy link
Collaborator Author

machine/tokenization/tokenization_utils.py line 31 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Naming it create_tokenizer and create_detokenizer makes it consistent with other similar functions in Machine, such as create_thot_word_alignment_model.

Done.

@johnml1135
Copy link
Collaborator Author

machine/translation/unigram_truecaser.py line 21 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This code should be moved to the constructor.

Done.

@johnml1135
Copy link
Collaborator Author

machine/translation/unigram_truecaser.py line 36 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We don't want to update the state of the current UnigramTruecaser instance during training, so that we can continue to use it. The state of the instance should be updated when Save is called on the trainer. See the UnigramTruecaser.Trainer class in Machine for .NET.

I don't understand exactly what you mean and what needs to be changed.

@johnml1135
Copy link
Collaborator Author

machine/translation/thot/thot_smt_model_trainer.py line 178 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Please don't ignore type warnings.

Done.

@johnml1135
Copy link
Collaborator Author

tests/jobs/test_smt_engine_build_job.py line 7 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should only import from the top-level packages, i.e. machine.jobs, machine.utils, etc.

Done.

@johnml1135
Copy link
Collaborator Author

machine/translation/thot/thot_smt_model_trainer.py line 40 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be snake case and prefixed with an underscore.

Added pep8-naming to dev dependencies to catch these in the future.

@johnml1135
Copy link
Collaborator Author

tests/translation/test_unigram_truecaser.py line 3 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should only import from the top-level packages, i.e. machine.corpora, machine.translation, etc.

Done.

@johnml1135
Copy link
Collaborator Author

tests/translation/test_unigram_truecaser.py line 9 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This is a constant, so it should be all caps.

Done.

@johnml1135
Copy link
Collaborator Author

tests/translation/test_unigram_truecaser.py line 72 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I'm not sure I understand what this is testing.

Renamed function - testing that the trainer trains properly.

@johnml1135
Copy link
Collaborator Author

dockerfile.cpu_only line 1 at r7 (raw file):

#compatability with Tensorflow 2.6.0 as per https://www.tensorflow.org/install/source#gpu

This has been untested - and unverified to build. The CI routines need to be updated to create the image, push it to the GitHub container repo and pull it when doing CPU only jobs, and then verify it can actually work on the CPU shares.

@johnml1135
Copy link
Collaborator Author

machine/translation/thot/thot_smt_model_trainer.py line 156 at r7 (raw file):

class ThotSmtModelTrainer(Trainer):

This is around 10x slower than the dotnet implementation - why? You can do comparative timing with SMT jobs being sent to either hangfire or ClearML. For some reason, it's not multi threading as well as only using about 10% of one CPU. I can't find any limiting happening anywhere (docker, ClearML, OpenMP, python, etc.). Something with dynamically loaded libraries? I don't know.

@ddaspit ddaspit force-pushed the smt_build_job branch 3 times, most recently from 12334e6 to d19a762 Compare May 31, 2024 22:35
* Add unigram truecaser
* Add CPU only docker image
* Add Latin default tokenizer
* Add vim to docker image for rebasing
* Add SMT integration test
* Update CI packages
* better logging
* Added pep8-naming
* Add async progress updates for SMT job
* Make startup faster, preinstall apt packages in docker
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 19 files at r7, 70 of 76 files at r8, 8 of 8 files at r9, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)


machine/jobs/thot/thot_smt_model_factory.py line 95 at r9 (raw file):

            self._config.data_dir, self._config.shared_file_folder, "builds", self._config.build_id, "model"
        )
        return Path(shutil.make_archive(str(tar_file_basename), "zip", self._model_dir))

This should be saved as a .tar.gz file, so that it is consistent with the NMT factory.


machine/translation/unigram_truecaser.py line 36 at r6 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I don't understand exactly what you mean and what needs to be changed.

Done


machine/translation/thot/thot_smt_model_trainer.py line 156 at r7 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This is around 10x slower than the dotnet implementation - why? You can do comparative timing with SMT jobs being sent to either hangfire or ClearML. For some reason, it's not multi threading as well as only using about 10% of one CPU. I can't find any limiting happening anywhere (docker, ClearML, OpenMP, python, etc.). Something with dynamically loaded libraries? I don't know.

Done

@johnml1135
Copy link
Collaborator Author

machine/jobs/thot/thot_smt_model_factory.py line 95 at r9 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be saved as a .tar.gz file, so that it is consistent with the NMT factory.

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r10, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

@johnml1135 johnml1135 merged commit 52964cf into main Jun 11, 2024
14 checks passed
@johnml1135 johnml1135 deleted the smt_build_job branch June 11, 2024 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants