Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate accidentally removed logic from https://github.com/sillsdev/machine/pull/207 #467

Merged
merged 6 commits into from
Aug 28, 2024

Conversation

Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Aug 26, 2024

Not sure how this logic got removed. I think I was away for all of this, so if I'm missing something, let me know.

Fixes: #460


This change is Reviewable

@Enkidu93 Enkidu93 requested review from ddaspit and johnml1135 August 26, 2024 16:20
@Enkidu93
Copy link
Collaborator Author

src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 379 at r1 (raw file):

    private static IEnumerable<Row> AlignPretranslateCorpus(Corpus corpus, ITextCorpus srcCorpus, ITextCorpus trgCorpus)
    {
        IEnumerable<string>? textIds = corpus.PretranslateChapters is not null

I can move this logic out of the method if you'd prefer to avoid passing corpus along, but I thought I'd try to be as consistent as possible with how it previously was for an initial pass. Lmk.

@johnml1135
Copy link
Collaborator

src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 263 at r1 (raw file):

    private static IEnumerable<Row?[]> AlignTrainCorpus(IReadOnlyList<ITextCorpus> srcCorpora, ITextCorpus trgCorpus)
    {

Does it need this code as well:

        IEnumerable<string>? textIds = corpus.TrainOnChapters is not null
            ? corpus.TrainOnChapters.Keys
            : corpus.TrainOnTextIds;
        srcCorpora = srcCorpora.Select(sc => sc.FilterTexts(textIds)).ToArray();
        trgCorpus = trgCorpus.FilterTexts(textIds);

removed from the same SMT job fix (that I did)?

@johnml1135
Copy link
Collaborator

We also need a test to make sure that this doesn't happen again. One way could be to stub out the parsing routine with just throwing an error and have the textIds filter out all text. Make sure that the error is not thrown. This would need to be both for training and pretranslating alignments.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure how the logic got removed. It looks like it happened when we moved SMT to ClearML.

Texts should also be filtered out in AlignTrainCorpus as well. Check out the code in the original PR.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93 and @johnml1135)


src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 379 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I can move this logic out of the method if you'd prefer to avoid passing corpus along, but I thought I'd try to be as consistent as possible with how it previously was for an initial pass. Lmk.

This looks fine.

@johnml1135
Copy link
Collaborator

src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 379 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This looks fine.

If you can think of a better way to do that, more power to you. It would also need to be replicated in AlignTrainCorpus whichever way you decide.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)

@johnml1135
Copy link
Collaborator

:Ahem: I think it was my fault :-(.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, yes, I forgot - thanks, Damien. And yes, John, I'll try and find a way to test this.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yeah, I think it happened when we move stuff from Nmt... to just plain old PreprocessBuildJob - easy mistake.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 263 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Does it need this code as well:

        IEnumerable<string>? textIds = corpus.TrainOnChapters is not null
            ? corpus.TrainOnChapters.Keys
            : corpus.TrainOnTextIds;
        srcCorpora = srcCorpora.Select(sc => sc.FilterTexts(textIds)).ToArray();
        trgCorpus = trgCorpus.FilterTexts(textIds);

removed from the same SMT job fix (that I did)?

Done.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@codecov-commenter
Copy link

codecov-commenter commented Aug 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.67%. Comparing base (723e74c) to head (61c4106).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #467      +/-   ##
==========================================
+ Coverage   56.65%   56.67%   +0.02%     
==========================================
  Files         275      275              
  Lines       14133    14143      +10     
  Branches     1895     1897       +2     
==========================================
+ Hits         8007     8016       +9     
  Misses       5541     5541              
- Partials      585      586       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests have been added.

Reviewable status: 1 of 2 files reviewed, all discussions resolved (waiting on @ddaspit and @johnml1135)

@ddaspit ddaspit requested a review from johnml1135 August 27, 2024 16:15
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)


src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 1 at r3 (raw file):

using System.Collections;

These should be added to the global usings.


src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 266 at r3 (raw file):

                }
            );
        await env.RunBuildJobAsync(corpus);

You should assert that the call does not throw an exception, so it is explicit what is being tested.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 3 files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 1 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

These should be added to the global usings.

Done.


src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs line 266 at r3 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You should assert that the call does not throw an exception, so it is explicit what is being tested.

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r4, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r4, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@johnml1135 johnml1135 merged commit 237b117 into main Aug 28, 2024
3 checks passed
@johnml1135 johnml1135 deleted the do_not_parse_unused_data branch August 28, 2024 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't parse books that aren't selected
4 participants