-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore ellipses #469
Ignore ellipses #469
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93 and @johnml1135)
src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs
line 294 at r1 (raw file):
// filter out every list that only contains completely empty rows .Where(rows => rows.Any(r =>
We also want to transform the ellipses to an empty string, so that there aren't any ellipses in the training corpus.
Previously, ddaspit (Damien Daspit) wrote…
And add a test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)
Sorry for the delay here. Really lost as to why one test passed locally but failed in CI 🤔. |
e8855cc
to
ac9b563
Compare
Do you believe there is sufficient testing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #469 +/- ##
==========================================
- Coverage 56.64% 56.63% -0.01%
==========================================
Files 275 275
Lines 14169 14174 +5
Branches 1897 1902 +5
==========================================
+ Hits 8026 8028 +2
Misses 5557 5557
- Partials 586 589 +3 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 4 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)
src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs
line 294 at r1 (raw file):
Previously, johnml1135 (John Lambert) wrote…
And add a test.
I believe this is sufficiently tested. I'm not sure why the test was failing previously - now it's passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'm happy to add more though if you have something in mind that I'm missing.
Reviewable status: 1 of 4 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code will only remove ellipses from text file corpora and not Paratext corpora. If you actually look at the generated corpus files in the RunAsync_MixedSource_Paratext
test, you will find that it still contains ellipses. I added a commit that uses the Transform
operation to clean ellipsis segments. It uses a CleanSegment
function, which will give us someplace to do any further cleaning of segments.
We will need better unit tests, since the current unit test didn't catch the error. It would probably be good to have a specific unit test to test stripping ellipsis segments.
Also, if your branch falls behind the master branch, you can just rebase the branch on top of master. This will keep the history graph a lot cleaner.
Reviewed 1 of 1 files at r3, 3 of 3 files at r4.
Reviewable status: 2 of 4 files reviewed, all discussions resolved (waiting on @johnml1135)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use your DummyCorpus
class to create specific unit tests for this issue. Also, I would recommend updating the DummyCorpus
to extend DictionaryTextCorpus
that way you don't have to have implement all of the methods.
Reviewable status: 2 of 4 files reviewed, all discussions resolved (waiting on @johnml1135)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)
OK! I see: Yep, OK. The way things are being tested seems a little odd. Is there a reason we can't just compare to the extracted text itself instead of relying on these counts? Or at least make that an option? I get that for some of the terms tests, the expected string would be too large, but for these toy examples, it seems like that would be more fool proof. I'm sorry. I hit rebase in the web gui, but then when I went to the source control gui in vscode, it listed a whole bunch of outgoing commits - not sure what happened - anyways, once again, I should probably just use the command line as convenient as the guis seem haha. I apologize - I did mean to rebase 🫡. OK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, revisiting your idea of having the DummyCorpus
inherit from DictionaryTextCorpus
: I would need to override some of those methods. Can I use the new
keyword and just hide DictionaryTextCorpus
's implementations? Or what did you have in mind? (Just making sure I'm following).
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)
For now, I've added a test using the existing file-based strategy. I still need to extend it at least a bit to cover the pretranslation filtering, but is this kind of test acceptable for now? Having more unit-test-style tests would be preferable, but the I'm concerned about using the |
bce36bb
to
3428cc1
Compare
My recommendation is to finish this quickly (this morning) to get it to QA for SF and then prioritize a refactoring of the relevant tests. |
There are now tests although like you say, refactoring at some point might be beneficial - that way, we can easily and more quickly get better coverage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are perfectly fine for now. We can always refactor things later.
Reviewed 1 of 2 files at r5, 2 of 2 files at r7, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r7, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)
Fixes #458
Replacement for sillsdev/machine#235.
This change is