Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore ellipses #469

Merged
merged 2 commits into from
Aug 29, 2024
Merged

Ignore ellipses #469

merged 2 commits into from
Aug 29, 2024

Conversation

Enkidu93
Copy link
Collaborator

@Enkidu93 Enkidu93 commented Aug 27, 2024

Fixes #458

Replacement for sillsdev/machine#235.


This change is Reviewable

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93 and @johnml1135)


src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 294 at r1 (raw file):

            // filter out every list that only contains completely empty rows
            .Where(rows =>
                rows.Any(r =>

We also want to transform the ellipses to an empty string, so that there aren't any ellipses in the training corpus.

@johnml1135
Copy link
Collaborator

src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 294 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We also want to transform the ellipses to an empty string, so that there aren't any ellipses in the training corpus.

And add a test.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

@Enkidu93
Copy link
Collaborator Author

Sorry for the delay here. Really lost as to why one test passed locally but failed in CI 🤔.

@johnml1135
Copy link
Collaborator

Do you believe there is sufficient testing?

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

@codecov-commenter
Copy link

codecov-commenter commented Aug 28, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 3 lines in your changes missing coverage. Please review.

Project coverage is 56.63%. Comparing base (8857819) to head (35a4aa2).

Files with missing lines Patch % Lines
...rval.Machine.Shared/Services/PreprocessBuildJob.cs 66.66% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #469      +/-   ##
==========================================
- Coverage   56.64%   56.63%   -0.01%     
==========================================
  Files         275      275              
  Lines       14169    14174       +5     
  Branches     1897     1902       +5     
==========================================
+ Hits         8026     8028       +2     
  Misses       5557     5557              
- Partials      586      589       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 4 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)


src/Machine/src/Serval.Machine.Shared/Services/PreprocessBuildJob.cs line 294 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

And add a test.

I believe this is sufficiently tested. I'm not sure why the test was failing previously - now it's passing.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'm happy to add more though if you have something in mind that I'm missing.

Reviewable status: 1 of 4 files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)

@johnml1135
Copy link
Collaborator

:lgtm:

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code will only remove ellipses from text file corpora and not Paratext corpora. If you actually look at the generated corpus files in the RunAsync_MixedSource_Paratext test, you will find that it still contains ellipses. I added a commit that uses the Transform operation to clean ellipsis segments. It uses a CleanSegment function, which will give us someplace to do any further cleaning of segments.

We will need better unit tests, since the current unit test didn't catch the error. It would probably be good to have a specific unit test to test stripping ellipsis segments.

Also, if your branch falls behind the master branch, you can just rebase the branch on top of master. This will keep the history graph a lot cleaner.

Reviewed 1 of 1 files at r3, 3 of 3 files at r4.
Reviewable status: 2 of 4 files reviewed, all discussions resolved (waiting on @johnml1135)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use your DummyCorpus class to create specific unit tests for this issue. Also, I would recommend updating the DummyCorpus to extend DictionaryTextCorpus that way you don't have to have implement all of the methods.

Reviewable status: 2 of 4 files reviewed, all discussions resolved (waiting on @johnml1135)

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@Enkidu93
Copy link
Collaborator Author

OK! I see: GetTrainCountAsync() wasn't behaving as I expected - that makes sense.

Yep, OK. The way things are being tested seems a little odd. Is there a reason we can't just compare to the extracted text itself instead of relying on these counts? Or at least make that an option? I get that for some of the terms tests, the expected string would be too large, but for these toy examples, it seems like that would be more fool proof.

I'm sorry. I hit rebase in the web gui, but then when I went to the source control gui in vscode, it listed a whole bunch of outgoing commits - not sure what happened - anyways, once again, I should probably just use the command line as convenient as the guis seem haha. I apologize - I did mean to rebase 🫡.

OK.

Copy link
Collaborator Author

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, revisiting your idea of having the DummyCorpus inherit from DictionaryTextCorpus: I would need to override some of those methods. Can I use the new keyword and just hide DictionaryTextCorpus's implementations? Or what did you have in mind? (Just making sure I'm following).

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@Enkidu93
Copy link
Collaborator Author

For now, I've added a test using the existing file-based strategy. I still need to extend it at least a bit to cover the pretranslation filtering, but is this kind of test acceptable for now? Having more unit-test-style tests would be preferable, but the I'm concerned about using the DummyCorpus both for that and for the fails on/exception throwing testing. Certainly doable though, @ddaspit, if you'd prefer. Lmk.

@johnml1135
Copy link
Collaborator

My recommendation is to finish this quickly (this morning) to get it to QA for SF and then prioritize a refactoring of the relevant tests.

@Enkidu93
Copy link
Collaborator Author

There are now tests although like you say, refactoring at some point might be beneficial - that way, we can easily and more quickly get better coverage.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are perfectly fine for now. We can always refactor things later.

Reviewed 1 of 2 files at r5, 2 of 2 files at r7, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r7, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Enkidu93)

@johnml1135 johnml1135 merged commit a111c67 into main Aug 29, 2024
3 checks passed
@ddaspit ddaspit deleted the ignore_ellipses branch September 6, 2024 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suppress "..." verse text during preprocessing of the model training data
4 participants