Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A better fix for #516. #521

Merged
merged 5 commits into from
Oct 29, 2024
Merged

A better fix for #516. #521

merged 5 commits into from
Oct 29, 2024

Conversation

johnml1135
Copy link
Collaborator

@johnml1135 johnml1135 commented Oct 28, 2024

#516 - this should be applied independently for pretranslate and trainOn.


This change is Reviewable

@codecov-commenter
Copy link

codecov-commenter commented Oct 28, 2024

Codecov Report

Attention: Patch coverage is 98.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 56.86%. Comparing base (ac1193f) to head (5e6d647).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...l/src/Serval.Translation/Services/EngineService.cs 98.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #521      +/-   ##
==========================================
+ Coverage   56.75%   56.86%   +0.11%     
==========================================
  Files         299      299              
  Lines       15647    15686      +39     
  Branches     2159     2165       +6     
==========================================
+ Hits         8880     8920      +40     
  Misses       6114     6114              
+ Partials      653      652       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93 and @johnml1135)


src/Serval/src/Serval.Translation/Services/EngineService.cs line 632 at r1 (raw file):

        TrainingCorpus? trainingCorpus,
        PretranslateCorpus? pretranslateCorpus,
        bool noTrainingCorpusDefined,

Could we rename these parameters to trainOnAll and pretranslateAll (or trainOnAllCorpora and pretranslateAllCorpora)? That helps the readability of the code.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, John! I'm fine with this as long as it's the behavior we want. It depends on how interdependent you'd like trainOn and pretranslate to be and what kinds of filters you want to make easier to compose and which more difficult - i.e., something is implicit now which wasn't implicit before and something would have to be made explicit which was previously implicit.

I apologize if I did this logic last week differently than you expected; I thought you were expecting it to work differently. This underscores the issue I've mentioned before which is that we need to have a clearer definition of intended behavior across all filtering arrangements. There are so many fields that can be null, empty, or set or situations in which corpora can specified multiple times in different places. It might be good to work on a document to describe this more fully & then make sure our tests cover that.

Reviewed 2 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)


src/Serval/test/Serval.Translation.Tests/Services/EngineServiceTests.cs line 470 at r1 (raw file):

    [Test]
    public async Task StartBuildAsync_OneEachOfMultipleCorpora()

There's already a test named StartBuildAsync_ParallelCorpus_OneOfMultipleCorpora, I believe. I think a more descriptive name for each of them is in order to differentiate the two (or merge them into one).

@johnml1135
Copy link
Collaborator Author

src/Serval/src/Serval.Translation/Services/EngineService.cs line 632 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Could we rename these parameters to trainOnAll and pretranslateAll (or trainOnAllCorpora and pretranslateAllCorpora)? That helps the readability of the code.

How about noTrainingFilter and noPretranslationFilter? That allowed the logic that it is used within to be clearer - either there is "noTrainingFilter" or "the source filter is null" . It is used to combine the logic for when to "trainOnAll."

@johnml1135
Copy link
Collaborator Author

src/Serval/src/Serval.Translation/Services/EngineService.cs line 632 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

How about noTrainingFilter and noPretranslationFilter? That allowed the logic that it is used within to be clearer - either there is "noTrainingFilter" or "the source filter is null" . It is used to combine the logic for when to "trainOnAll."

or "noTrainingCorpusFilter".

@johnml1135
Copy link
Collaborator Author

src/Serval/test/Serval.Translation.Tests/Services/EngineServiceTests.cs line 470 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

There's already a test named StartBuildAsync_ParallelCorpus_OneOfMultipleCorpora, I believe. I think a more descriptive name for each of them is in order to differentiate the two (or merge them into one).

TrainOnOnePretranslateOnTheOther

@johnml1135
Copy link
Collaborator Author

Updated documentation.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where?

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Serval/test/Serval.Translation.Tests/Services/EngineServiceTests.cs line 470 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

TrainOnOnePretranslateOnTheOther

Or TrainOnOnePretranslateTheOther?

@johnml1135
Copy link
Collaborator Author

Coming...

@johnml1135
Copy link
Collaborator Author

Updated.

@johnml1135
Copy link
Collaborator Author

src/Serval/test/Serval.Translation.Tests/Services/EngineServiceTests.cs line 470 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Or TrainOnOnePretranslateTheOther?

better.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r1, 3 of 4 files at r2, all commit messages.
Reviewable status: 4 of 5 files reviewed, 5 unresolved discussions (waiting on @ddaspit and @johnml1135)


src/Serval/src/Serval.Translation/Controllers/TranslationEnginesController.cs line 994 at r2 (raw file):

    /// <remarks>
    /// Specify the corpora and textIds/scriptureRanges within those corpora to train on. Only one type of corpus may be used: either (legacy) corpora (see /translation/engines/{id}/corpora) or parallel corpora (see /translation/engines/{id}/parallel-corpora).
    /// A (legacy) corpus is selected by specifying CorpusId and a parallel corpus is selected for training by specifying the appropriate CorpusId in SourceFilters or TargetFilters.

I would make the language consistent across the legacy corpus and the parallel corpus. I'd also mention the fact that the parallel corpus id does need to be specified. Something like: "A (legacy) corpus is selected by specifying CorpusId and a parallel corpus is selected by specifying ParallelCorpusId, A parallel corpus can be further filtered by specifying particular CorpusIds in SourceFilters or TargetFilters"


src/Serval/src/Serval.Translation/Controllers/TranslationEnginesController.cs line 998 at r2 (raw file):

    /// Filters can also be supplied via scriptureRange parameter as ranges of biblical text. See [here](https://github.com/sillsdev/serval/wiki/Filtering-Paratext-Project-Data-with-a-Scripture-Range)
    /// All Paratext project filtering follows original versification. See [here](https://github.com/sillsdev/serval/wiki/Versification-in-Serval) for more information.
    /// If `trainOn` or `pretranslate` is not provided, all corpora will be used for the respective task.

It might be more clear to say "...provided, all corpora will be used for training or pretranslation respectively", but this is fine too


src/Serval/src/Serval.Translation/Controllers/TranslationEnginesController.cs line 999 at r2 (raw file):

    /// All Paratext project filtering follows original versification. See [here](https://github.com/sillsdev/serval/wiki/Versification-in-Serval) for more information.
    /// If `trainOn` or `pretranslate` is not provided, all corpora will be used for the respective task.
    /// If a corpus is selected for training or pretranslation and neither scriptureRange or textIds are defined, all of the selected corpus will be used.

Not necessary but I'd suggest: "...neither...nor..."


src/Serval/src/Serval.Translation/Controllers/TranslationEnginesController.cs line 1001 at r2 (raw file):

    /// If a corpus is selected for training or pretranslation and neither scriptureRange or textIds are defined, all of the selected corpus will be used.
    /// If a corpus is selected for training or pretranslation and an empty scriptureRange or textIds is defined, none of the selected corpus will be used.
    /// If a corpus is selected for training or pretranslation, all corpora that are not selected will not be used for the respective task.

I might further qualify it; something like: "If a corpus is selected for training or pretranslation but no further filters are provided, all selected corpora will be used for training or pretranslation respectively"

@johnml1135
Copy link
Collaborator Author

src/Serval/src/Serval.Translation/Controllers/TranslationEnginesController.cs line 994 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I would make the language consistent across the legacy corpus and the parallel corpus. I'd also mention the fact that the parallel corpus id does need to be specified. Something like: "A (legacy) corpus is selected by specifying CorpusId and a parallel corpus is selected by specifying ParallelCorpusId, A parallel corpus can be further filtered by specifying particular CorpusIds in SourceFilters or TargetFilters"

Updated.

@johnml1135
Copy link
Collaborator Author

src/Serval/src/Serval.Translation/Controllers/TranslationEnginesController.cs line 998 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

It might be more clear to say "...provided, all corpora will be used for training or pretranslation respectively", but this is fine too

updated

@johnml1135
Copy link
Collaborator Author

src/Serval/src/Serval.Translation/Controllers/TranslationEnginesController.cs line 999 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Not necessary but I'd suggest: "...neither...nor..."

done

@johnml1135
Copy link
Collaborator Author

src/Serval/src/Serval.Translation/Controllers/TranslationEnginesController.cs line 1001 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I might further qualify it; something like: "If a corpus is selected for training or pretranslation but no further filters are provided, all selected corpora will be used for training or pretranslation respectively"

Done.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to see Damien's approval too, but it looks good to me. Thank you for doing this, John!

Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @ddaspit)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peter asked us to mark the deprecated endpoints and properties. Should we do that in this PR or a separate PR? We just need to add the Obsolete attribute to the appropriate controller methods and DTO properties.

Reviewed 2 of 4 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


src/Serval/src/Serval.Translation/Services/EngineService.cs line 632 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

or "noTrainingCorpusFilter".

Although noTrainingCorpusFilter and noPretranslateCorpusFilter accurately describe the state of the data model, I would prefer to use names that represent what those states actually mean. When no training corpora is defined, it means that we want to train on all corpora. This makes the code clearer and more readable.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to do that in my original mixed source PR, and it yelled at me, so it might be more complicated than it seems and worth separating into a different PR.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)

@johnml1135
Copy link
Collaborator Author

I agree, let's make it a separate PR.

@johnml1135
Copy link
Collaborator Author

src/Serval/src/Serval.Translation/Services/EngineService.cs line 632 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Although noTrainingCorpusFilter and noPretranslateCorpusFilter accurately describe the state of the data model, I would prefer to use names that represent what those states actually mean. When no training corpora is defined, it means that we want to train on all corpora. This makes the code clearer and more readable.

Updated naming.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)

@johnml1135 johnml1135 merged commit f7060c7 into main Oct 29, 2024
2 of 3 checks passed
@johnml1135 johnml1135 deleted the fix_pretranslate_filtering_2 branch October 29, 2024 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants