-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SF-2900 Mixed Source: Selecting books from each source #2825
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2825 +/- ##
==========================================
+ Coverage 79.19% 79.75% +0.56%
==========================================
Files 533 534 +1
Lines 30962 30952 -10
Branches 5052 5030 -22
==========================================
+ Hits 24520 24687 +167
+ Misses 5668 5511 -157
+ Partials 774 754 -20 ☔ View full report in Codecov by Sentry. |
fba511b
to
cab2906
Compare
cab2906
to
ecef1d9
Compare
48b32dd
to
012546d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! I can definitely see what you meant when you mentioned waiting on SF-2525 for these changes.
Reviewed 6 of 9 files at r1, 23 of 23 files at r2, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @pmachapman)
src/SIL.XForge.Scripture/Models/ServalAdditionalTrainingData.cs
line 7 at r2 (raw file):
/// <summary> /// Serval Configuration for Additional Training Data. /// </summary>
Is there a link or class in Serval for reference to this?
Code quote:
/// </summary>
src/SIL.XForge.Scripture/Services/MachineApiService.cs
line 729 at r2 (raw file):
); }
Can you add a comment here to specify Translation section and above (line 702) for Training Section? The code is so similar I thought it was duplicated.
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 394 at r2 (raw file):
// If there is an alternate source, ensure that writing system and RTL is correct if (projectDoc.Data.TranslateConfig.DraftConfig.AlternateSource is not null)
We might consider updating the settings.FullName
, as that can also be changed in the Paratext registry.
Code quote:
if (projectDoc.Data.TranslateConfig.DraftConfig.AlternateSource is not null)
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 408 at r2 (raw file):
alternateSourceSettings.IsRightToLeft ); if (alternateSourceSettings.LanguageTag is not null)
While unlikely, should we verify the primary source and alternate/additional source language tags match and error if they do not?
Code quote:
if (alternateSourceSettings.LanguageTag is not null)
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 587 at r2 (raw file):
/// <returns>The parallel corpus identifier.</returns> /// <remarks>This can be mocked in unit tests.</remarks> protected internal virtual async Task<string> CreateOrUpdateParallelCorpusAsync(
Curious to the reasoning behind making this method's virtual?
Code quote:
internal virtual async
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 860 at r2 (raw file):
} // Update the source writing system tag
It might be an unlikely scenario, but I wonder if we may want to check if RTL has changed for the source?
Code quote:
// Update the source writing system tag
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 1162 at r2 (raw file):
cancellationToken ); bool recreateTranslationEngine = false;
If RTL has changed should we also recreate the translation engine?
Code quote:
recreateTranslationEngine
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 1346 at r2 (raw file):
// Upload the source texts List<ServalCorpusFile> sourceCorpusFiles = [.. additionalTrainingData.CorpusFiles];
Seen the Range operator used a few times with this PR, what are some of the benefits to using this over simply having = additionalTrainingData.CorpusFiles;
?
Code quote:
[.. additionalTrainingData.CorpusFiles]
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 1701 at r2 (raw file):
translationParallelCorpusId ); }
Do we need the if/else here?
I wonder if we can differentiate between NMT and SMT in the few places like this:
string sourceProjectId = preTranslate ? hasAlternateSource
? project.TranslateConfig.DraftConfig.AlternateSource.ProjectRef
: project.TranslateConfig.Source.ProjectRef
: project.TranslateConfig.Source.ProjectRef;
And then put the alternate/additional sources in a single if(preTranslate) { ...}
?
Code quote:
else
{
// Set up the parallel corpus for SMT translation
string sourceProjectId = project.TranslateConfig.Source.ProjectRef;
List<ServalCorpusFile> sourceCorpora = [servalCorpusFiles.Single(f => f.ProjectId == sourceProjectId)];
List<ServalCorpusFile> targetCorpora = [servalCorpusFiles.Single(f => f.ProjectId == project.Id)];
List<string> sourceCorpusIds = sourceCorpora.Select(f => f.CorpusId).ToList();
List<string> targetCorpusIds = targetCorpora.Select(f => f.CorpusId).ToList();
// Get the pre-translate parallel corpus id (might be null)
translationParallelCorpusId = projectSecret.ServalData.ParallelCorpusIdForSmt;
// Create or update the pre-translate parallel corpora
translationParallelCorpusId = await CreateOrUpdateParallelCorpusAsync(
translationEngineId,
translationParallelCorpusId,
name: "SmtTranslation",
sourceCorpusIds,
targetCorpusIds,
cancellationToken
);
// Record the corpus sync info
corporaSyncInfo = RecordServalCorpusSyncInfo(
corporaSyncInfo,
sourceCorpora,
targetCorpora,
translationParallelCorpusId
);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 25 of 30 files reviewed, 8 unresolved discussions (waiting on @kylebuss)
src/SIL.XForge.Scripture/Models/ServalAdditionalTrainingData.cs
line 7 at r2 (raw file):
Previously, kylebuss (Kyle Buss) wrote…
Is there a link or class in Serval for reference to this?
No, because this is the class where we keep our configuration data for specifying the additional training data on Serval. I have updated the comment to be clearer.
src/SIL.XForge.Scripture/Services/MachineApiService.cs
line 729 at r2 (raw file):
Previously, kylebuss (Kyle Buss) wrote…
Can you add a comment here to specify Translation section and above (line 702) for Training Section? The code is so similar I thought it was duplicated.
Done.
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 394 at r2 (raw file):
Previously, kylebuss (Kyle Buss) wrote…
We might consider updating the
settings.FullName
, as that can also be changed in the Paratext registry.
Done. Good point.
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 408 at r2 (raw file):
Previously, kylebuss (Kyle Buss) wrote…
While unlikely, should we verify the primary source and alternate/additional source language tags match and error if they do not?
I think our checks before we start are sufficient, and Serval will return if they are not compatible.
There is also the complexity of en_GB vs en_NZ - both are English, and so we may incorrectly mark as not equal. I think it's best not to do the checking here, and let Serval notify us if there is something wrong.
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 587 at r2 (raw file):
Previously, kylebuss (Kyle Buss) wrote…
Curious to the reasoning behind making this method's virtual?
So it can be overridden in unit tests.
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 860 at r2 (raw file):
Previously, kylebuss (Kyle Buss) wrote…
It might be an unlikely scenario, but I wonder if we may want to check if RTL has changed for the source?
This code is for a specific bug with back translations not being in the registry (see the big comment above the block). IsRighttoLeft was not affected by it, as far as I am aware. I'll make a note to investigate further if it is, as it would affect other areas of Scripture Forge if it is applicable.
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 1162 at r2 (raw file):
Previously, kylebuss (Kyle Buss) wrote…
If RTL has changed should we also recreate the translation engine?
No. We don't send RTL to Serval.
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 1346 at r2 (raw file):
Previously, kylebuss (Kyle Buss) wrote…
Seen the Range operator used a few times with this PR, what are some of the benefits to using this over simply having
= additionalTrainingData.CorpusFiles;
?
This syntax passes the items in the list, not a reference to the list. The older (pre-.NET 8.0) syntax would be:
List<ServalCorpusFile> sourceCorpusFiles = new List<ServalCorpusFile>(additionalTrainingData.CorpusFiles);
Passing a reference to the list would cause the original list to get trashed and all sorts of other bugs.
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 1701 at r2 (raw file):
Previously, kylebuss (Kyle Buss) wrote…
Do we need the if/else here?
I wonder if we can differentiate between NMT and SMT in the few places like this:
string sourceProjectId = preTranslate ? hasAlternateSource ? project.TranslateConfig.DraftConfig.AlternateSource.ProjectRef : project.TranslateConfig.Source.ProjectRef : project.TranslateConfig.Source.ProjectRef;
And then put the alternate/additional sources in a single
if(preTranslate) { ...}
?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for the explanations and clean up. I especially like the GetPreTranslationParametersAsync
that was added!
Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @pmachapman)
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 587 at r2 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
So it can be overridden in unit tests.
Thanks for clarifying. Looks like Raymond had it correct in our discussion!
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 1346 at r2 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
This syntax passes the items in the list, not a reference to the list. The older (pre-.NET 8.0) syntax would be:
List<ServalCorpusFile> sourceCorpusFiles = new List<ServalCorpusFile>(additionalTrainingData.CorpusFiles);Passing a reference to the list would cause the original list to get trashed and all sorts of other bugs.
Raymond was right again! And thanks for the explanation, I had never seen the [.. ] before this.
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 1701 at r2 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
Done.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I have a few minor suggestions and a couple blocking comments. Can you add acceptance tests to the Jira issue? It may be a big list of scenarios so I almost don't want to ask hah
Reviewed 5 of 9 files at r1, 20 of 23 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @pmachapman)
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 639 at r3 (raw file):
else { await translationEnginesClient.UpdateParallelCorpusAsync(
Calling UpdateParallelCorpusAsync will update the parallelCorpusId? It doesn't seem like a good idea to do that in an async method. Can you leave a comment stating this?
Code quote:
await translationEnginesClient.UpdateParallelCorpusAsync(
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 838 at r3 (raw file):
string translationEngineId = preTranslate ? projectSecret.ServalData?.PreTranslationEngineId : projectSecret.ServalData?.TranslationEngineId;
I have seen this a number of times. Perhaps we would be worth creating a helper method to get the translation engine Id and keep it more readable.
Code quote:
string translationEngineId = preTranslate
? projectSecret.ServalData?.PreTranslationEngineId
: projectSecret.ServalData?.TranslationEngineId;
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 932 at r3 (raw file):
if (string.IsNullOrWhiteSpace(translationEngineId)) { throw new DataNotFoundException("The translation engine is not specified.");
A better message would be "Failed to create a translation engine"
Code quote:
throw new DataNotFoundException("The translation engine is not specified.");
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 1335 at r3 (raw file):
/// <returns>The updated <paramref name="additionalTrainingData"/>.</returns> /// <remarks>This can be mocked in unit tests.</remarks> protected internal virtual async Task<ServalAdditionalTrainingData?> SyncAdditionalTrainingData(
It seems that this method is really doing two separate things. If the buildConfig.TrainingDataFiles exists, then we get it and update the parallel corpus. However, if it trainingDataFiles is empty, then we delete the parallel corpus and return null. Would it be more clear to split this into two methods have the caller can decide which one to call?
Code quote:
protected internal virtual async Task<ServalAdditionalTrainingData?> SyncAdditionalTrainingData(
test/SIL.XForge.Scripture.Tests/Services/MachineProjectServiceTests.cs
line 129 at r3 (raw file):
Assert.IsNull(env.ProjectSecrets.Get(Project02).ServalData!.PreTranslationErrorMessage); Assert.IsNull(env.ProjectSecrets.Get(Project01).ServalData!.PreTranslationJobId); Assert.IsNull(env.ProjectSecrets.Get(Project02).ServalData!.PreTranslationQueuedAt);
Why are we getting the project secrets for both Project01 and Project02 here? I was expecting just Project01.
Code quote:
Assert.IsNull(env.ProjectSecrets.Get(Project02).ServalData!.PreTranslationErrorMessage);
Assert.IsNull(env.ProjectSecrets.Get(Project01).ServalData!.PreTranslationJobId);
Assert.IsNull(env.ProjectSecrets.Get(Project02).ServalData!.PreTranslationQueuedAt);
d6e0a16
to
3ebcd2b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add acceptance tests to the Jira issue?
I will when the changes to Serval are finalized (and before I send to QA). I still have a few questions I am investigating on the full scope of these changes across all systems (i.e. Serval, SF Backend, previous versions of SF Frontend).
Dismissed @Github-advanced-security[bot] from 2 discussions.
Reviewable status: 23 of 30 files reviewed, 3 unresolved discussions (waiting on @kylebuss and @RaymondLuong3)
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 639 at r3 (raw file):
Previously, RaymondLuong3 (Raymond Luong) wrote…
Calling UpdateParallelCorpusAsync will update the parallelCorpusId? It doesn't seem like a good idea to do that in an async method. Can you leave a comment stating this?
I am unsure what you mean? I have updated the code documentation to be a bit more descriptive.
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 838 at r3 (raw file):
Previously, RaymondLuong3 (Raymond Luong) wrote…
I have seen this a number of times. Perhaps we would be worth creating a helper method to get the translation engine Id and keep it more readable.
Done.
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 932 at r3 (raw file):
Previously, RaymondLuong3 (Raymond Luong) wrote…
A better message would be "Failed to create a translation engine"
Done.
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 1335 at r3 (raw file):
Previously, RaymondLuong3 (Raymond Luong) wrote…
It seems that this method is really doing two separate things. If the buildConfig.TrainingDataFiles exists, then we get it and update the parallel corpus. However, if it trainingDataFiles is empty, then we delete the parallel corpus and return null. Would it be more clear to split this into two methods have the caller can decide which one to call?
My thinking is whether it was creating, updating, or deleting, it would be keeping the data in sync between Serval, and so named this function "SyncAdditionalTrainingData". I have updated the method code documentation to make this a bit clearer.
test/SIL.XForge.Scripture.Tests/Services/MachineProjectServiceTests.cs
line 129 at r3 (raw file):
Previously, RaymondLuong3 (Raymond Luong) wrote…
Why are we getting the project secrets for both Project01 and Project02 here? I was expecting just Project01.
Fixed. Good spotting - thank you!
3ebcd2b
to
913650d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I will wait for the acceptance tests before giving the full approval.
Reviewed 4 of 7 files at r4, 4 of 4 files at r5, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @pmachapman)
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 639 at r3 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
I am unsure what you mean? I have updated the code documentation to be a bit more descriptive.
Right, what I mean is that we can run into problems when we use asynchronous code to update parameters provided to the method since other threads could depend on the value of that parameter if my understanding is correct. Since this is awaited, it should be fine. Thanks for the comments.
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 1335 at r3 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
My thinking is whether it was creating, updating, or deleting, it would be keeping the data in sync between Serval, and so named this function "SyncAdditionalTrainingData". I have updated the method code documentation to make this a bit clearer.
Right, that does make sense when you think of Sync that way. Thanks for the additional comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @pmachapman)
src/SIL.XForge.Scripture/Services/MachineProjectService.cs
line 639 at r3 (raw file):
Right, what I mean is that we can run into problems when we use asynchronous code to update parameters provided to the method since other threads could depend on the value of that parameter if my understanding is correct.
Yes, your understanding is correct. This is why, for example, you cannot have a ref
parameter in an async
method. Good thinking!
423980f
to
2df386f
Compare
be93f2a
to
5cbc8aa
Compare
5cbc8aa
to
4ecf7f8
Compare
b6e99ff
to
8bed61c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @pmachapman)
tools/ServalDownloader/Program.cs
line 93 at r7 (raw file):
Previously, RaymondLuong3 (Raymond Luong) wrote…
Should this be moved outside the loop?
Done. Thank you!
tools/ServalDownloader/Program.cs
line 117 at r7 (raw file):
Previously, RaymondLuong3 (Raymond Luong) wrote…
Could this be moved outside the loop?
Done. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good. I was able to follow the acceptance tests on the issue and existing features continued to work as expected. I am moving this to ready to test. Thanks Peter!
Reviewed 1 of 1 files at r8, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @pmachapman)
@RaymondLuong3 This PR is not yet ready for testing, as I am awaiting an update to Serval QA. |
Thanks Peter, sounds good. |
This PR adds support for parallel corpora in Serval, and backend support for selecting books from each source.
This is a key requirement for the mixed source logic, which is currently behind a feature flag.
The frontend does not yet allow selection of different books for each source.
Blockers for QA
This change is