-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smt build job #107
Smt build job #107
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The truecasing model needs to be trained as well. I think this means that we will need to port the UnigramTruecaser
class over to Python.
Reviewed 1 of 4 files at r1, all commit messages.
Reviewable status: 1 of 4 files reviewed, all discussions resolved
664c14f
to
bea466d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r1, 2 of 3 files at r2, 6 of 8 files at r3, 5 of 7 files at r4, 9 of 9 files at r5, all commit messages.
Reviewable status: all files reviewed, 26 unresolved discussions (waiting on @johnml1135)
machine/corpora/token_processors.py
line 17 at r5 (raw file):
def _get_normalization_form(normalization_form: str) -> Literal["NFC", "NFD", "NFKC", "NFKD"]:
What is this for?
machine/jobs/local_shared_file_service.py
line 10 at r5 (raw file):
class LocalSharedFileService(SharedFileService):
You should export this class from the package __init__.py
file.
machine/jobs/smt_engine_build_job.py
line 25 at r5 (raw file):
class SmtEngineBuildJob:
You should export this class from the package __init__.py
file.
machine/jobs/smt_engine_build_job.py
line 45 at r5 (raw file):
with TemporaryDirectory() as temp_dir: parameters = ThotSmtParameters(
The model should be initialized from a blank model. See the data/thot-new-model
directory in the SIL.Machine.AspNetCore
project.
machine/jobs/smt_engine_build_job.py
line 77 at r5 (raw file):
with UnigramTruecaserTrainer( corpus=target_corpus, model_path=os.path.join(temp_dir, "truecase.txt"), tokenizer=tokenizer
The filename that we use in the Machine engine server is unigram-casing-model.txt
.
machine/statistics/conditional_frequency_distribution.py
line 7 at r5 (raw file):
@dataclass
I would prefer that this class is not a dataclass
. dataclass
is usually used for models/DTOs/records and not for normal classes.
machine/statistics/conditional_frequency_distribution.py
line 8 at r5 (raw file):
@dataclass class ConditionalFrequencyDistribution:
You should export this class from the package __init__.py
file.
machine/statistics/conditional_frequency_distribution.py
line 11 at r5 (raw file):
_freq_dist: Dict[str, FrequencyDistribution] = field(default_factory=dict) def get_conditions(self):
Don't forget the return type.
machine/statistics/frequency_distribution.py
line 5 at r5 (raw file):
@dataclass
I would prefer that this class is not a dataclass
.
machine/statistics/frequency_distribution.py
line 6 at r5 (raw file):
@dataclass class FrequencyDistribution:
You should export this class from the package __init__.py
file.
machine/tokenization/__init__.py
line 43 at r5 (raw file):
def get_tokenizer_detokenizer(name: str = "") -> Tuple[Tokenizer, Detokenizer]:
I would split this function into two functions: create_tokenizer
and create_detokenizer
and move them to the tokenization_utils.py
module.
machine/tokenization/__init__.py
line 45 at r5 (raw file):
def get_tokenizer_detokenizer(name: str = "") -> Tuple[Tokenizer, Detokenizer]: name_lower = name.lower() if "latin" in name_lower or name == "":
I would perform an equals check.
machine/translation/unigram_truecaser.py
line 5 at r5 (raw file):
from typing import Dict, Sequence, Tuple from machine.corpora.text_corpus import TextCorpus
You should use relative imports.
machine/translation/unigram_truecaser.py
line 12 at r5 (raw file):
@dataclass
I would prefer that this class is not a dataclass
.
machine/translation/unigram_truecaser.py
line 13 at r5 (raw file):
@dataclass class UnigramTruecaser(Truecaser):
You should export this class from the package __init__.py
file.
machine/translation/unigram_truecaser.py
line 33 at r5 (raw file):
def create_trainer(self, corpus: TextCorpus) -> Trainer: from machine.translation.unigram_truecaser_trainer import SubclassUnigramTruecaserTrainer
In order to avoid the circular reference, I think it would be fine to move the trainer classes into this file.
machine/translation/unigram_truecaser.py
line 87 at r5 (raw file):
def _reset(self): self._casing = ConditionalFrequencyDistribution()
You can just call reset
here.
machine/translation/unigram_truecaser.py
line 88 at r5 (raw file):
def _reset(self): self._casing = ConditionalFrequencyDistribution() self._bestTokens = {}
You can just call clear
here.
machine/translation/unigram_truecaser_trainer.py
line 4 at r5 (raw file):
from typing import Callable, Optional from machine.tokenization.tokenizer import Tokenizer
You should use relative imports.
machine/translation/unigram_truecaser_trainer.py
line 13 at r5 (raw file):
@dataclass
I would prefer that this class is not a dataclass
.
machine/translation/unigram_truecaser_trainer.py
line 14 at r5 (raw file):
@dataclass class UnigramTruecaserTrainer(Trainer):
You should export this class from the package __init__.py
file.
machine/translation/unigram_truecaser_trainer.py
line 30 at r5 (raw file):
step_count = self.corpus.count(include_empty=False) current_step = 0 for row in self.corpus.tokenize(tokenizer=self.tokenizer).filter_nonempty():
When iterating over a corpus, you should use the context manager to ensure that files are closed properly:
with self.corpus.tokenize(tokenizer=self.tokenizer).filter_nonempty().get_rows() as rows:
for row in rows:
machine/translation/unigram_truecaser_trainer.py
line 44 at r5 (raw file):
class SubclassUnigramTruecaserTrainer(UnigramTruecaserTrainer):
I would prefix this with an underscore. I think _UnigramTruecaserTrainer
should be fine.
machine/translation/thot/thot_smt_model.py
line 89 at r5 (raw file):
self.truecaser = truecaser if self.truecaser is None: self.truecaser = UnigramTruecaser()
Leave this with a default of None
.
machine/translation/thot/thot_word_alignment_model_type.py
line 13 at r5 (raw file):
def getThotWordAlignmentModelType(str) -> ThotWordAlignmentModelType:
Instead of using this function, update the ThotSmtModel
and ThotSmtModelTrainerclasses to accept a
stror
ThotWordAlignmentModelTypefor the
word_alignment_model_type` parameter.
tests/jobs/test_smt_engine_build_job.py
line 6 at r5 (raw file):
from pytest import raises from machine.jobs.build_smt_engine import SmtEngineBuildJob
Instead of importing from the particular module, you should import from the top-level package.
Previously, ddaspit (Damien Daspit) wrote…
Upgrading pyright threw and error - a difference between str and Literal. This got rid of the error. There could potentially be a better solution. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Added new thot model |
Previously, ddaspit (Damien Daspit) wrote…
updated. |
Previously, ddaspit (Damien Daspit) wrote…
done |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
If we do, it throws off the normalize function below - there are some instances where an arbitrary string is passed to it rather than a fixed string - and this conversion needs to happen somewhere. Is there a better place for it? |
Previously, ddaspit (Damien Daspit) wrote…
done - you can thank github copilot for the comment - and me for thinking it would be ok. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
I don't understand exactly what you mean and what needs to be changed. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Added pep8-naming to dev dependencies to catch these in the future. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Renamed function - testing that the trainer trains properly. |
This has been untested - and unverified to build. The CI routines need to be updated to create the image, push it to the GitHub container repo and pull it when doing CPU only jobs, and then verify it can actually work on the CPU shares. |
This is around 10x slower than the dotnet implementation - why? You can do comparative timing with SMT jobs being sent to either hangfire or ClearML. For some reason, it's not multi threading as well as only using about 10% of one CPU. I can't find any limiting happening anywhere (docker, ClearML, OpenMP, python, etc.). Something with dynamically loaded libraries? I don't know. |
12334e6
to
d19a762
Compare
* Add unigram truecaser * Add CPU only docker image * Add Latin default tokenizer * Add vim to docker image for rebasing * Add SMT integration test * Update CI packages * better logging * Added pep8-naming * Add async progress updates for SMT job * Make startup faster, preinstall apt packages in docker
…uld be auto-deleted when build is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 19 files at r7, 70 of 76 files at r8, 8 of 8 files at r9, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)
machine/jobs/thot/thot_smt_model_factory.py
line 95 at r9 (raw file):
self._config.data_dir, self._config.shared_file_folder, "builds", self._config.build_id, "model" ) return Path(shutil.make_archive(str(tar_file_basename), "zip", self._model_dir))
This should be saved as a .tar.gz
file, so that it is consistent with the NMT factory.
machine/translation/unigram_truecaser.py
line 36 at r6 (raw file):
Previously, johnml1135 (John Lambert) wrote…
I don't understand exactly what you mean and what needs to be changed.
Done
machine/translation/thot/thot_smt_model_trainer.py
line 156 at r7 (raw file):
Previously, johnml1135 (John Lambert) wrote…
This is around 10x slower than the dotnet implementation - why? You can do comparative timing with SMT jobs being sent to either hangfire or ClearML. For some reason, it's not multi threading as well as only using about 10% of one CPU. I can't find any limiting happening anywhere (docker, ClearML, OpenMP, python, etc.). Something with dynamically loaded libraries? I don't know.
Done
Previously, ddaspit (Damien Daspit) wrote…
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r10, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)
Just like an NMT build job
This change is