Port USFM code from Machine up to commit a9058ce #111

mshannon-sil · 2024-07-19T14:22:58Z

This change is

ddaspit

Is this ready to be reviewed? Does this include all of the commits that need to be ported?

Reviewable status: 0 of 4 files reviewed, all discussions resolved (waiting on @mshannon-sil)

mshannon-sil

Not yet, currently finishing up the commit for non-verse text support. Should be done with that by the end of tomorrow. There's still more commits to port after that, but if you'd like to review the PR midway through the porting process, that could be a good point to do so.

Reviewable status: 0 of 4 files reviewed, all discussions resolved (waiting on @mshannon-sil)

… corpora

codecov-commenter · 2024-08-13T18:13:39Z

Codecov Report

Attention: Patch coverage is 94.52381% with 46 lines in your changes missing coverage. Please review.

Project coverage is 88.12%. Comparing base (2f7f44f) to head (5f411c3).

Files	Patch %	Lines
machine/corpora/scripture_ref.py	84.94%	14 Missing ⚠️
machine/corpora/scripture_element.py	76.92%	9 Missing ⚠️
machine/scripture/verse_ref.py	77.27%	5 Missing ⚠️
machine/corpora/text_corpus.py	75.00%	4 Missing ⚠️
machine/corpora/usfm_text_updater.py	95.55%	4 Missing ⚠️
machine/corpora/standard_parallel_text_corpus.py	91.42%	3 Missing ⚠️
...e/corpora/paratext_project_settings_parser_base.py	66.66%	2 Missing ⚠️
machine/corpora/flatten.py	75.00%	1 Missing ⚠️
machine/corpora/parallel_text_row.py	50.00%	1 Missing ⚠️
machine/corpora/scripture_text.py	96.55%	1 Missing ⚠️
... and 2 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #111      +/-   ##
==========================================
+ Coverage   87.97%   88.12%   +0.15%     
==========================================
  Files         243      247       +4     
  Lines       14239    14799     +560     
==========================================
+ Hits        12527    13042     +515     
- Misses       1712     1757      +45

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mshannon-sil

Reviewable status: 0 of 35 files reviewed, 7 unresolved discussions

machine/corpora/scripture_ref_usfm_parser_handler.py line 23 at r6 (raw file):

    def __init__(self) -> None:
        self._cur_verse_ref: VerseRef = VerseRef()
        self._cur_elements_stack: List[ScriptureElement] = []

For variables that were a Stack type in C#, I used a list in python, since it's fast to append and pop items from a list like a stack. I added _stack to the end of these variables to document that they should be used as a stack.

machine/corpora/scripture_ref_usfm_parser_handler.py line 173 at r6 (raw file):

        )
        # No need to reverse unlike in Machine, elements are already added in correct order
        path = [e for e in self._cur_elements_stack if e.position > 0]

In Machine, there's a need to reverse the order of the stack based on how elements are added to the stack. However when I ported it to machine.py and used a list, all the elements were actually being added in the correct order, and reversing just caused bugs, so I removed that in machine.py.

machine/corpora/scripture_text.py line 43 at r6 (raw file):

        else:
            yield from self._create_rows_scripture_ref(ref, text, is_sentence_start)

In Machine, this method was overloaded to have two separate implementations for either a VerseRef or a List of ScriptureRef as input. Since I can't do exactly the same thing in python, I created a function with the general create_rows name, which checks the type of the input and calls the corresponding create_rows method for that type.

machine/corpora/text_corpus.py line 203 at r6 (raw file):

class _FilterTextCorpus(TextCorpus):

The class _FilterTextCorpus in machine.py is called WhereTextCorpus in Machine. I thought this might be due to different naming conventions in python vs c#, but then _TextFilterTextCorpus in machine.py has the same name in Machine. Is the naming difference here intentional?

machine/scripture/scripture_ref.py line 18 at r6 (raw file):

        self._path: List[ScriptureElement] = path if path is not None else []

    _empty: Optional[ScriptureRef] = None

I was concerned about properties of the _empty ScriptureRef (verse_ref and path) getting modified and affecting all instances so that they are no longer empty. But since the public properties only have getters not setters, it should be fine as long as the corresponding private properties aren't modified.

machine/scripture/verse_ref.py line 380 at r6 (raw file):

        return 0

    def __eq__(self, other: object) -> bool:

The Equals method in SIL.Scripture/VerseRef.cs is the same as the exact_equals method that was in machine.py. I changed the machine.py version to be the eq method instead so that it is called when == is used, since this seemed to me to be what we would want, and it didn't break any tests. Let me know if there's a reason to not make it the default equality comparison.

machine/scripture/verse_ref.py line 488 at r6 (raw file):

def are_overlapping_verse_ranges_vref(verse_ref1: VerseRef, verse_ref2: VerseRef) -> bool:
    if verse_ref1.is_default or verse_ref2.is_default:

I think there's actually a bug in SIL.Scripture/VerseRef.cs. The corresponding section in OverlappingVersesRanges for the VerseRef type input is if (verseRef1.IsDefault || verseRef1.IsDefault) but it should be if (verseRef1.IsDefault || verseRef2.IsDefault).

ddaspit

Reviewed 1 of 1 files at r5, 34 of 34 files at r6, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @mshannon-sil)

machine/corpora/scripture_ref_usfm_parser_handler.py line 23 at r6 (raw file):

Previously, mshannon-sil wrote…

For variables that were a Stack type in C#, I used a list in python, since it's fast to append and pop items from a list like a stack. I added _stack to the end of these variables to document that they should be used as a stack.

Makes sense to me.

machine/corpora/scripture_text.py line 43 at r6 (raw file):

Previously, mshannon-sil wrote…

In Machine, this method was overloaded to have two separate implementations for either a VerseRef or a List of ScriptureRef as input. Since I can't do exactly the same thing in python, I created a function with the general create_rows name, which checks the type of the input and calls the corresponding create_rows method for that type.

That is a good solution.

machine/corpora/scripture_text.py line 82 at r6 (raw file):

            )

    def _create_row(

Because Python doesn't support overloading, we sometimes have to deviate from the C# code a bit, as you discovered for the _create_rows method. I think it would be good to name this something else, maybe _create_scripture_row, then you wouldn't have to call the base class _create_row method using super(). If you rename this method, then you should also rename the _create_rows method to match.

machine/corpora/scripture_text_corpus.py line 111 at r6 (raw file):

def is_scripture(text_corpus: TextCorpus) -> bool:

We want to export this function from the __init__.py of the corpora package.

machine/corpora/standard_parallel_text_corpus.py line 255 at r6 (raw file):

        trg_refs = [] if trg_row is None else [trg_row.ref]

        if len(trg_refs) == 0 and isinstance(self._target_corpus, ScriptureTextCorpus):

This was a bug in the C# code, you should use the is_scripture function.

machine/corpora/text_corpus.py line 203 at r6 (raw file):

Previously, mshannon-sil wrote…

The class _FilterTextCorpus in machine.py is called WhereTextCorpus in Machine. I thought this might be due to different naming conventions in python vs c#, but then _TextFilterTextCorpus in machine.py has the same name in Machine. Is the naming difference here intentional?

As you guessed, this is because of the difference in function names between Python and C#. Python uses map and filter. C# uses Select and Where. I wanted to match the standard names in each language.

machine/scripture/scripture_element.py line 10 at r6 (raw file):

@total_ordering
class ScriptureElement(Comparable):

This class should be moved to the corpora package. This class should also be exported from __init__.py in the corpora package.

machine/scripture/scripture_ref.py line 13 at r6 (raw file):

@total_ordering
class ScriptureRef(Comparable):

This class should be moved to the corpora package.

machine/scripture/scripture_ref.py line 18 at r6 (raw file):

Previously, mshannon-sil wrote…

I was concerned about properties of the _empty ScriptureRef (verse_ref and path) getting modified and affecting all instances so that they are no longer empty. But since the public properties only have getters not setters, it should be fine as long as the corresponding private properties aren't modified.

You can make this a top-level constant, i.e. EMPTY_SCRIPTURE_REF.

machine/scripture/verse_ref.py line 380 at r6 (raw file):

Previously, mshannon-sil wrote…

The Equals method in SIL.Scripture/VerseRef.cs is the same as the exact_equals method that was in machine.py. I changed the machine.py version to be the eq method instead so that it is called when == is used, since this seemed to me to be what we would want, and it didn't break any tests. Let me know if there's a reason to not make it the default equality comparison.

The Comparable base class provides an implementation of __eq__ based on the compare_to method. This makes all of the comparison methods consistent with each other. exact_equals does not convert the versification, so it serves a different purpose. This does deviate from the C# implementation, but it has more consistent behavior.

machine/scripture/verse_ref.py line 488 at r6 (raw file):

Previously, mshannon-sil wrote…

I think there's actually a bug in SIL.Scripture/VerseRef.cs. The corresponding section in OverlappingVersesRanges for the VerseRef type input is if (verseRef1.IsDefault || verseRef1.IsDefault) but it should be if (verseRef1.IsDefault || verseRef2.IsDefault).

Yeah, that looks like a bug in the C# code.

tests/corpora/test_scripture_ref.py line 6 at r6 (raw file):

class TestScriptureRef(unittest.TestCase):

Is there some reason why you used a test case here? I would prefer it to be consistent with other test fixtures. If it is because you need to specify a description for each assert, you can do so with a normal assert.

mshannon-sil · 2024-08-14T18:54:26Z

machine/corpora/scripture_text.py line 82 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Because Python doesn't support overloading, we sometimes have to deviate from the C# code a bit, as you discovered for the _create_rows method. I think it would be good to name this something else, maybe _create_scripture_row, then you wouldn't have to call the base class _create_row method using super(). If you rename this method, then you should also rename the _create_rows method to match.

Should I do this for the get_rows() method as well? Renaming it to get_scripture_rows() so that in the method, rather than calling super().get_rows(), I just call self.get_rows()?

mshannon-sil

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @ddaspit)

machine/scripture/scripture_element.py line 10 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This class should be moved to the corpora package. This class should also be exported from __init__.py in the corpora package.

I realize I didn't export most of the classes, I'll go ahead and do that.

machine/scripture/scripture_ref.py line 13 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This class should be moved to the corpora package.

Okay. I had put it in the scripture folder since it seemed to parallel the VerseRef class in some ways. Is the idea of the scripture folder just to include python equivalents of SIL.Scripture?

machine/scripture/verse_ref.py line 488 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Yeah, that looks like a bug in the C# code.

Do we need to submit a PR to SIL.Scripture to fix it?

tests/corpora/test_scripture_ref.py line 6 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Is there some reason why you used a test case here? I would prefer it to be consistent with other test fixtures. If it is because you need to specify a description for each assert, you can do so with a normal assert.

I was trying to follow the format of the corresponding test file in Machine, since it uses test cases unlike the other corpora test files. But I agree, it makes sense to keep it consistent. I'll go ahead and change it to match the other files in machine.py.

ddaspit

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @mshannon-sil)

machine/corpora/scripture_text.py line 82 at r6 (raw file):

Previously, mshannon-sil wrote…

Should I do this for the get_rows() method as well? Renaming it to get_scripture_rows() so that in the method, rather than calling super().get_rows(), I just call self.get_rows()?

In the case of get_rows, we are overriding the method, so we don't want to change the name. It is also pretty normal to call the base class method when overriding, because you are often adding behavior to the base class implementation.

machine/scripture/scripture_ref.py line 13 at r6 (raw file):

Previously, mshannon-sil wrote…

Okay. I had put it in the scripture folder since it seemed to parallel the VerseRef class in some ways. Is the idea of the scripture folder just to include python equivalents of SIL.Scripture?

Yes, that is correct.

machine/scripture/verse_ref.py line 488 at r6 (raw file):

Previously, mshannon-sil wrote…

Do we need to submit a PR to SIL.Scripture to fix it?

It would be the neighborly thing to do, but probably not a high priority.

…to corpora folder, use top level constant for empty ScriptureRef, change __eq__ back to exact_equals, keep test files consistent

mshannon-sil

Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @ddaspit)

machine/corpora/scripture_text.py line 82 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

In the case of get_rows, we are overriding the method, so we don't want to change the name. It is also pretty normal to call the base class method when overriding, because you are often adding behavior to the base class implementation.

Done.

machine/corpora/scripture_text_corpus.py line 111 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We want to export this function from the __init__.py of the corpora package.

Done.

machine/corpora/standard_parallel_text_corpus.py line 255 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This was a bug in the C# code, you should use the is_scripture function.

Done.

machine/scripture/scripture_element.py line 10 at r6 (raw file):

Previously, mshannon-sil wrote…

I realize I didn't export most of the classes, I'll go ahead and do that.

Done.

machine/scripture/scripture_ref.py line 13 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Yes, that is correct.

Done.

machine/scripture/scripture_ref.py line 18 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You can make this a top-level constant, i.e. EMPTY_SCRIPTURE_REF.

Done.

machine/scripture/verse_ref.py line 380 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The Comparable base class provides an implementation of __eq__ based on the compare_to method. This makes all of the comparison methods consistent with each other. exact_equals does not convert the versification, so it serves a different purpose. This does deviate from the C# implementation, but it has more consistent behavior.

Okay, I changed it back.

tests/corpora/test_scripture_ref.py line 6 at r6 (raw file):

Previously, mshannon-sil wrote…

I was trying to follow the format of the corresponding test file in Machine, since it uses test cases unlike the other corpora test files. But I agree, it makes sense to keep it consistent. I'll go ahead and change it to match the other files in machine.py.

Done.

ddaspit

Reviewed 19 of 19 files at r7, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

mshannon-sil added the enhancement New feature or request label Jul 19, 2024

mshannon-sil self-assigned this Jul 19, 2024

mshannon-sil marked this pull request as draft July 19, 2024 14:23

ddaspit reviewed Jul 23, 2024

View reviewed changes

mshannon-sil commented Jul 23, 2024

View reviewed changes

mshannon-sil added 5 commits August 13, 2024 14:09

port commit 436a67d that moved logic to parallel text corpus

66f5e58

port commit 436a67d that adds test cases for scripture text corpus

795378e

port commit 29c9799 that adds support for mixed source corpora

b38f975

port commit fa65835, default to major biblical terms

64dba67

port commit a9058ce, support for non-verse text segments in Scripture…

fb630bf

… corpora

mshannon-sil force-pushed the #102_usfm branch from 98e645f to fb630bf Compare August 13, 2024 18:09

mshannon-sil commented Aug 13, 2024

View reviewed changes

ddaspit requested changes Aug 14, 2024

View reviewed changes

mshannon-sil commented Aug 14, 2024

View reviewed changes

ddaspit reviewed Aug 14, 2024

View reviewed changes

handle overloading, update __init__.py, use is_scripture, move files …

5f411c3

…to corpora folder, use top level constant for empty ScriptureRef, change __eq__ back to exact_equals, keep test files consistent

mshannon-sil commented Aug 15, 2024

View reviewed changes

ddaspit approved these changes Aug 15, 2024

View reviewed changes

mshannon-sil marked this pull request as ready for review August 15, 2024 19:05

mshannon-sil changed the title ~~Port USFM code from Machine #102~~ Port USFM code from Machine, Part 1 Aug 15, 2024

mshannon-sil changed the title ~~Port USFM code from Machine, Part 1~~ Port USFM code from Machine up to commit a9058ce Aug 15, 2024

mshannon-sil merged commit 953f203 into main Aug 15, 2024
14 checks passed

mshannon-sil deleted the #102_usfm branch August 15, 2024 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port USFM code from Machine up to commit a9058ce #111

Port USFM code from Machine up to commit a9058ce #111

mshannon-sil commented Jul 19, 2024 •

edited by ddaspit

Loading

ddaspit left a comment

mshannon-sil left a comment

codecov-commenter commented Aug 13, 2024 •

edited

Loading

mshannon-sil left a comment

ddaspit left a comment

mshannon-sil commented Aug 14, 2024

mshannon-sil left a comment

ddaspit left a comment

mshannon-sil left a comment

ddaspit left a comment

Port USFM code from Machine up to commit a9058ce #111

Port USFM code from Machine up to commit a9058ce #111

Conversation

mshannon-sil commented Jul 19, 2024 • edited by ddaspit Loading

ddaspit left a comment

Choose a reason for hiding this comment

mshannon-sil left a comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 13, 2024 • edited Loading

Codecov Report

mshannon-sil left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

mshannon-sil commented Aug 14, 2024

mshannon-sil left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

mshannon-sil left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

mshannon-sil commented Jul 19, 2024 •

edited by ddaspit

Loading

codecov-commenter commented Aug 13, 2024 •

edited

Loading