-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move preprocess logic to toolkit #512
Conversation
Pull the most recent from main - we are up to Machine 3.4.0 - and there are 4 locations that all need to be updated together. |
Pretranslations are not universal. Could this be broken up into a few major pieces - and then one Machine specific function inherit from this class to tie the main pieces together? The main processing of "monocorpus" and "parallelCorpus" are universal - and we should be able to have a few routines to slice and dice these things so that the 10 line linq expression only happens once... |
So does this follow the logic - Mix the sources, chose the first target? I can't see it clearly - should it be documented here? |
Align Train and Align Pretranslate. If these will be in the ServiceToolkit, could we change the name to something more generic? AlignTrainCorpus may be good enough, but I would prefer something like "AlignResultsGenerationCorpus" or even more generically, "AlignMixSource" and "AlignChooseFirstSource". Would that work? |
Add a new set of unit tests in service toolkit? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 22 of 22 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ddaspit and @Enkidu93)
Fix build error Update Echo engine to use toolkit Fix async stream issue Fix test: Add ability to specify CorpusService mock
5a61291
to
3ec2bce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a unit test in the toolkit. I think ideally we'll port almost all the current PreprocessBuildTests to the toolkit. I'm wondering though if we should try to get this in first and then follow up with that in a separate PR since this PR will be difficult to merge once other changes have been put through.
Reviewable status: 17 of 29 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/Echo/src/EchoTranslationEngine/TranslationEngineServiceV1.cs
line 98 at r3 (raw file):
}, cancellationToken );
@ddaspit, it might just be because it's late, but it's not clear to me how to go about doing this: I seem to be doing something wrong with how I'm passing in this async delegate and it's causing the Echo engine to hang and not respond. Any thoughts? I've not exhausted my debugging (although the debugger keeps crashing on me), but I thought maybe you'd just see it right off.
src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs
line 94 at r3 (raw file):
[Test] public async Task RunAsync_PretranslateTextIdsOverlapWithTrainOnTextIds()
I also fixed a bug regarding this situation and added a test.
src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs
line 157 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Pretranslations are not universal. Could this be broken up into a few major pieces - and then one Machine specific function inherit from this class to tie the main pieces together? The main processing of "monocorpus" and "parallelCorpus" are universal - and we should be able to have a few routines to slice and dice these things so that the 10 line linq expression only happens once...
Discussed in meeting
src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs
line 179 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
So does this follow the logic - Mix the sources, chose the first target? I can't see it clearly - should it be documented here?
That happens elsewhere, but yep, that's the logic. I can add more comments if something seems unclear.
src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs
line 295 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Align Train and Align Pretranslate. If these will be in the ServiceToolkit, could we change the name to something more generic? AlignTrainCorpus may be good enough, but I would prefer something like "AlignResultsGenerationCorpus" or even more generically, "AlignMixSource" and "AlignChooseFirstSource". Would that work?
Discussed in meeting.
src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs
line 313 at r3 (raw file):
.SelectMany(sc => trgCorpora.Select(tc => sc.AlignRows(tc, allSourceRows: true))) .ZipMany(rows => rows.Where(r => r.SourceSegment.Count > 0 && r.TargetSegment.Count == 0).FirstOrDefault()
This addresses #514
src/Serval/src/Serval.Shared/Serval.Shared.csproj
line 22 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Pull the most recent from main - we are up to Machine 3.4.0 - and there are 4 locations that all need to be updated together.
Done.
…anslations for segments with target text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 17 of 29 files reviewed, 7 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/Echo/src/EchoTranslationEngine/TranslationEngineServiceV1.cs
line 98 at r3 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
@ddaspit, it might just be because it's late, but it's not clear to me how to go about doing this: I seem to be doing something wrong with how I'm passing in this async delegate and it's causing the Echo engine to hang and not respond. Any thoughts? I've not exhausted my debugging (although the debugger keeps crashing on me), but I thought maybe you'd just see it right off.
Nevermind. It didn't have to do with that at all. I just had an outdated version of Machine locally and was hitting the bug John had fixed in the ZipMany
function where it hits and infinite loop.
Were these tests wrong originally? Why did the values change? |
These should be removed from Machine/test, they are no longer needed. |
Agreed. Let's just do this small thing and get it released to solve the immediate issue. |
Previously, Enkidu93 (Eli C. Lowry) wrote…
Great! |
Previously, Enkidu93 (Eli C. Lowry) wrote…
Looks good to me. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #512 +/- ##
==========================================
- Coverage 57.25% 57.03% -0.22%
==========================================
Files 299 302 +3
Lines 15664 15605 -59
Branches 2163 2151 -12
==========================================
- Hits 8968 8900 -68
- Misses 6054 6063 +9
Partials 642 642 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge, everything is working properly now on this branch. The echo engine is only covered by an E2E test, so we should consider either giving up on testing the Echo engine too extensively OR adding an echo-specific suite of tests. I'm not too bothered by this since the logic is now shared via the toolkit (i.e., the tricky part is tested elsewhere). Also, like I mentioned above, the preprocess build job tests should be ported eventually.
Reviewable status: 16 of 31 files reviewed, 10 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/Machine/test/Serval.Machine.Shared.Tests/Services/PreprocessBuildJobTests.cs
line 209 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
Were these tests wrong originally? Why did the values change?
Most of these are reversions to the values they had pre-mixed-source-support. I was confused about the intended functionality regarding pretranslation, so I changed the values. But with the changes in this PR, they can be changed back. Let me hand reckon the numbers one more time to make sure.
src/Serval/test/Serval.E2ETests/ServalClientHelper.cs
line 194 at r6 (raw file):
FileFormat.Text, targetLanguage, isTarget: true
@johnml1135, do you mind if I just do it like this? I found the echo-specific logic in the helper to be confusing.
src/ServiceToolkit/test/SIL.ServiceToolkit/Utils/data/source2.txt
line 1 at r4 (raw file):
Previously, johnml1135 (John Lambert) wrote…
These should be removed from Machine/test, they are no longer needed.
Once the other tests are ported, they won't be needed, right. But for now, they're duplicated.
I'll go ahead and rebase this once the machine side is through, but I've made the updates here so you can see how the NParallelTextCorpus will be used. |
Also fixes #536 |
Fixes #533 as well |
This also follows up on sillsdev/machine#263 |
I believe that this is ready for review, @ddaspit & @johnml1135 |
Previously, ddaspit (Damien Daspit) wrote…
I agree - the readability would be improved. |
Oh, I'm sorry, @ddaspit. I don't know how I missed your earlier comments. I'll get right on those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 22 files at r8, 3 of 10 files at r10, 2 of 6 files at r11, 2 of 5 files at r12, 1 of 1 files at r13, 1 of 2 files at r14, 6 of 6 files at r15, 1 of 1 files at r16, 4 of 4 files at r17, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ddaspit and @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 38 of 39 files reviewed, 5 unresolved discussions (waiting on @ddaspit and @johnml1135)
src/Serval/src/Serval.Translation/Services/EngineService.cs
line 687 at r6 (raw file):
Previously, johnml1135 (John Lambert) wrote…
What is the significance of the logic change?
It's been a while, so I might not be answering your question correctly: If a deprecated corpus has no source files or target files, that ought to map to a parallel corpus that has no source corpora or target corpora respectively.
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs
line 43 at r10 (raw file):
Previously, johnml1135 (John Lambert) wrote…
I agree - the readability would be improved.
Done.
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs
line 46 at r10 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
There is no need to perform this
Where
operation ifTrainOnChapters
isnull
. Also, we should callFilterTexts
on the books specified inTrainOnChapters
, so we don't unnecessarily parse books.
Done.
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs
line 56 at r10 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
I would go ahead and pull this out to a separate function as well to improve readability.
Done.
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs
line 63 at r10 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
We exclude verses from the training corpus when
PretranslateTextIds
orPretranslateChapters
is specified. We should also exclude verses from the training corpus when neither is specified.
At the moment, we're doing it the other way around: excluding training content from pretranslation, not pretranslation from training content. Regardless, is this the behavior we want? At the moment, I think the issue you are describing is 'covered' in the PreprocessBuildJob
:
if (row.SourceSegment.Length > 0 && row.TargetSegment.Length == 0)
{
pretranslateWriter.WriteStartObject();
...
pretranslateCount++;
}
If trainOn
is null but so is pretranslate
or if pretranslateTextIds
has a value, what should our behavior be? Do we only want to exclude things at the textId/book level or in these circumstances, wouldn't it be better to look at the row-level? Otherwise, how could you get the train-on-all-parallel-data-pretranslate-all-non-parallel data functionality (which seems like the most straightforward use)?
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs
line 89 at r10 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This check can be moved up higher.
Done.
src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs
line 3 at r6 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
This should be implemented as a service that we inject using DI. You should create an extension method that adds the required services to the DI in an
IServiceCollectionExtensions
class.
This is something you fixed while I was gone, right?
src/ServiceToolkit/test/SIL.ServiceToolkit/SIL.ServiceToolkit.Tests.csproj
line at r7 (raw file):
Previously, ddaspit (Damien Daspit) wrote…
The directory name should match the assembly name.
This is something you fixed while I was gone, right?
Sorry for beating the dead horse - but I thought that we arrived at this logic for pretranslating: #533 (comment) - On a per ParallelCorpus basis, if there is target training data that passes the filter, those verses will not be pretranslated. Therefore, the following would hold:
|
Previously, johnml1135 (John Lambert) wrote…
You're totally right, John. There's an inconsistency here. I think I misunderstood the #533 comment and missed the 'if there is data' part, so I was thinking that we were only using the filters to include/exclude, but I kept the old behavior with train on all. If that is what we want, then the changes I just made should make it function like that. This also addresses my comment elsewhere to @ddaspit regarding filtering behavior. This is more consistent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 6 files at r11, 4 of 5 files at r12, 2 of 2 files at r14, 5 of 6 files at r15, 1 of 1 files at r16, 4 of 4 files at r17, 2 of 2 files at r19, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)
src/ServiceToolkit/src/SIL.ServiceToolkit/Services/ParallelCorpusPreprocessingService.cs
line 63 at r10 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
At the moment, we're doing it the other way around: excluding training content from pretranslation, not pretranslation from training content. Regardless, is this the behavior we want? At the moment, I think the issue you are describing is 'covered' in the
PreprocessBuildJob
:if (row.SourceSegment.Length > 0 && row.TargetSegment.Length == 0) { pretranslateWriter.WriteStartObject(); ... pretranslateCount++; }
If
trainOn
is null but so ispretranslate
or ifpretranslateTextIds
has a value, what should our behavior be? Do we only want to exclude things at the textId/book level or in these circumstances, wouldn't it be better to look at the row-level? Otherwise, how could you get the train-on-all-parallel-data-pretranslate-all-non-parallel data functionality (which seems like the most straightforward use)?
The code in PreprocessBuildJob
should cover it. Yes, we need to check on a row-level if trainOn
and pretranslate
are not specified.
src/ServiceToolkit/src/SIL.ServiceToolkit/Utils/ParallelCorpusPreprocessor.cs
line 3 at r6 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
This is something you fixed while I was gone, right?
Yes, you're correct.
src/ServiceToolkit/test/SIL.ServiceToolkit/SIL.ServiceToolkit.Tests.csproj
line at r7 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
This is something you fixed while I was gone, right?
Yes, I fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r19.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)
Fixes #396
How should we go about testing the new Echo engine functionality? Just add integration tests?
This change is