Ignore ellipses #469

Enkidu93 · 2024-08-27T19:54:40Z

Fixes #458

This change is

ddaspit

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93 and @johnml1135)

src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 294 at r1 (raw file):

            // filter out every list that only contains completely empty rows
            .Where(rows =>
                rows.Any(r =>

We also want to transform the ellipses to an empty string, so that there aren't any ellipses in the training corpus.

johnml1135 · 2024-08-28T12:17:11Z

src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 294 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We also want to transform the ellipses to an empty string, so that there aren't any ellipses in the training corpus.

And add a test.

johnml1135

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

Enkidu93 · 2024-08-28T18:59:49Z

Sorry for the delay here. Really lost as to why one test passed locally but failed in CI 🤔.

johnml1135 · 2024-08-28T19:26:15Z

Do you believe there is sufficient testing?

johnml1135

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

codecov-commenter · 2024-08-28T19:41:02Z

Codecov Report

Attention: Patch coverage is 66.66667% with 3 lines in your changes missing coverage. Please review.

Project coverage is 56.63%. Comparing base (8857819) to head (35a4aa2).

Files with missing lines	Patch %	Lines
...rval.Machine.Shared/Services/PreprocessBuildJob.cs	66.66%	0 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #469      +/-   ##
==========================================
- Coverage   56.64%   56.63%   -0.01%     
==========================================
  Files         275      275              
  Lines       14169    14174       +5     
  Branches     1897     1902       +5     
==========================================
+ Hits         8026     8028       +2     
  Misses       5557     5557              
- Partials      586      589       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Enkidu93

Reviewable status: 1 of 4 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)

src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 294 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

And add a test.

I believe this is sufficiently tested. I'm not sure why the test was failing previously - now it's passing.

Enkidu93

Yes. I'm happy to add more though if you have something in mind that I'm missing.

Reviewable status: 1 of 4 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)

johnml1135 · 2024-08-28T21:03:51Z

johnml1135

Reviewed 3 of 3 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

ddaspit

The current code will only remove ellipses from text file corpora and not Paratext corpora. If you actually look at the generated corpus files in the RunAsync_MixedSource_Paratext test, you will find that it still contains ellipses. I added a commit that uses the Transform operation to clean ellipsis segments. It uses a CleanSegment function, which will give us someplace to do any further cleaning of segments.

We will need better unit tests, since the current unit test didn't catch the error. It would probably be good to have a specific unit test to test stripping ellipsis segments.

Also, if your branch falls behind the master branch, you can just rebase the branch on top of master. This will keep the history graph a lot cleaner.

Reviewed 1 of 1 files at r3, 3 of 3 files at r4.
Reviewable status: 2 of 4 files reviewed, all discussions resolved (waiting on @johnml1135)

ddaspit

You could use your DummyCorpus class to create specific unit tests for this issue. Also, I would recommend updating the DummyCorpus to extend DictionaryTextCorpus that way you don't have to have implement all of the methods.

Reviewable status: 2 of 4 files reviewed, all discussions resolved (waiting on @johnml1135)

johnml1135

Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

Enkidu93 · 2024-08-29T13:05:46Z

OK! I see: GetTrainCountAsync() wasn't behaving as I expected - that makes sense.

Yep, OK. The way things are being tested seems a little odd. Is there a reason we can't just compare to the extracted text itself instead of relying on these counts? Or at least make that an option? I get that for some of the terms tests, the expected string would be too large, but for these toy examples, it seems like that would be more fool proof.

I'm sorry. I hit rebase in the web gui, but then when I went to the source control gui in vscode, it listed a whole bunch of outgoing commits - not sure what happened - anyways, once again, I should probably just use the command line as convenient as the guis seem haha. I apologize - I did mean to rebase 🫡.

OK.

Enkidu93

Hmm, revisiting your idea of having the DummyCorpus inherit from DictionaryTextCorpus: I would need to override some of those methods. Can I use the new keyword and just hide DictionaryTextCorpus's implementations? Or what did you have in mind? (Just making sure I'm following).

Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

Enkidu93 · 2024-08-29T13:55:25Z

For now, I've added a test using the existing file-based strategy. I still need to extend it at least a bit to cover the pretranslation filtering, but is this kind of test acceptable for now? Having more unit-test-style tests would be preferable, but the I'm concerned about using the DummyCorpus both for that and for the fails on/exception throwing testing. Certainly doable though, @ddaspit, if you'd prefer. Lmk.

johnml1135 · 2024-08-29T14:25:09Z

My recommendation is to finish this quickly (this morning) to get it to QA for SF and then prioritize a refactoring of the relevant tests.

Enkidu93 · 2024-08-29T14:37:48Z

There are now tests although like you say, refactoring at some point might be beneficial - that way, we can easily and more quickly get better coverage.

ddaspit

The tests are perfectly fine for now. We can always refactor things later.

Reviewed 1 of 2 files at r5, 2 of 2 files at r7, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

johnml1135

Reviewed 2 of 2 files at r7, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

Enkidu93 requested review from ddaspit and johnml1135 August 27, 2024 19:54

Enkidu93 mentioned this pull request Aug 27, 2024

Ignore freestanding ellipses sillsdev/machine#235

Closed

ddaspit requested changes Aug 27, 2024

View reviewed changes

johnml1135 requested a review from ddaspit August 28, 2024 15:27

johnml1135 reviewed Aug 28, 2024

View reviewed changes

Enkidu93 force-pushed the ignore_ellipses branch from e8855cc to ac9b563 Compare August 28, 2024 19:15

johnml1135 reviewed Aug 28, 2024

View reviewed changes

Enkidu93 commented Aug 28, 2024

View reviewed changes

johnml1135 approved these changes Aug 28, 2024

View reviewed changes

ddaspit reviewed Aug 28, 2024

View reviewed changes

johnml1135 reviewed Aug 29, 2024

View reviewed changes

Enkidu93 commented Aug 29, 2024

View reviewed changes

Filter out ellipses segments

3428cc1

Enkidu93 force-pushed the ignore_ellipses branch from bce36bb to 3428cc1 Compare August 29, 2024 14:14

Test pretranslation content

35a4aa2

ddaspit approved these changes Aug 29, 2024

View reviewed changes

johnml1135 approved these changes Aug 29, 2024

View reviewed changes

johnml1135 merged commit a111c67 into main Aug 29, 2024
3 checks passed

ddaspit deleted the ignore_ellipses branch September 6, 2024 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore ellipses #469

Ignore ellipses #469

Enkidu93 commented Aug 27, 2024 •

edited by ddaspit

Loading

ddaspit left a comment

johnml1135 commented Aug 28, 2024

johnml1135 left a comment

Enkidu93 commented Aug 28, 2024

johnml1135 commented Aug 28, 2024

johnml1135 left a comment

codecov-commenter commented Aug 28, 2024 •

edited

Loading

Enkidu93 left a comment

Enkidu93 left a comment

johnml1135 commented Aug 28, 2024

johnml1135 left a comment

ddaspit left a comment •

edited

Loading

ddaspit left a comment

johnml1135 left a comment

Enkidu93 commented Aug 29, 2024

Enkidu93 left a comment

Enkidu93 commented Aug 29, 2024

johnml1135 commented Aug 29, 2024

Enkidu93 commented Aug 29, 2024

ddaspit left a comment

johnml1135 left a comment

Ignore ellipses #469

Ignore ellipses #469

Conversation

Enkidu93 commented Aug 27, 2024 • edited by ddaspit Loading

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Aug 28, 2024

johnml1135 left a comment

Choose a reason for hiding this comment

Enkidu93 commented Aug 28, 2024

johnml1135 commented Aug 28, 2024

johnml1135 left a comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 28, 2024 • edited Loading

Codecov Report

Enkidu93 left a comment

Choose a reason for hiding this comment

Enkidu93 left a comment

Choose a reason for hiding this comment

johnml1135 commented Aug 28, 2024

johnml1135 left a comment

Choose a reason for hiding this comment

ddaspit left a comment • edited Loading

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 left a comment

Choose a reason for hiding this comment

Enkidu93 commented Aug 29, 2024

Enkidu93 left a comment

Choose a reason for hiding this comment

Enkidu93 commented Aug 29, 2024

johnml1135 commented Aug 29, 2024

Enkidu93 commented Aug 29, 2024

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 left a comment

Choose a reason for hiding this comment

Enkidu93 commented Aug 27, 2024 •

edited by ddaspit

Loading

codecov-commenter commented Aug 28, 2024 •

edited

Loading

ddaspit left a comment •

edited

Loading