Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port USFM code from Machine up to commit a9058ce #111

Merged
merged 6 commits into from
Aug 15, 2024
Merged

Conversation

mshannon-sil
Copy link
Collaborator

@mshannon-sil mshannon-sil commented Jul 19, 2024

This change is Reviewable

@mshannon-sil mshannon-sil added the enhancement New feature or request label Jul 19, 2024
@mshannon-sil mshannon-sil self-assigned this Jul 19, 2024
@mshannon-sil mshannon-sil marked this pull request as draft July 19, 2024 14:23
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ready to be reviewed? Does this include all of the commits that need to be ported?

Reviewable status: 0 of 4 files reviewed, all discussions resolved (waiting on @mshannon-sil)

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, currently finishing up the commit for non-verse text support. Should be done with that by the end of tomorrow. There's still more commits to port after that, but if you'd like to review the PR midway through the porting process, that could be a good point to do so.

Reviewable status: 0 of 4 files reviewed, all discussions resolved (waiting on @mshannon-sil)

@codecov-commenter
Copy link

codecov-commenter commented Aug 13, 2024

Codecov Report

Attention: Patch coverage is 94.52381% with 46 lines in your changes missing coverage. Please review.

Project coverage is 88.12%. Comparing base (2f7f44f) to head (5f411c3).

Files Patch % Lines
machine/corpora/scripture_ref.py 84.94% 14 Missing ⚠️
machine/corpora/scripture_element.py 76.92% 9 Missing ⚠️
machine/scripture/verse_ref.py 77.27% 5 Missing ⚠️
machine/corpora/text_corpus.py 75.00% 4 Missing ⚠️
machine/corpora/usfm_text_updater.py 95.55% 4 Missing ⚠️
machine/corpora/standard_parallel_text_corpus.py 91.42% 3 Missing ⚠️
...e/corpora/paratext_project_settings_parser_base.py 66.66% 2 Missing ⚠️
machine/corpora/flatten.py 75.00% 1 Missing ⚠️
machine/corpora/parallel_text_row.py 50.00% 1 Missing ⚠️
machine/corpora/scripture_text.py 96.55% 1 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #111      +/-   ##
==========================================
+ Coverage   87.97%   88.12%   +0.15%     
==========================================
  Files         243      247       +4     
  Lines       14239    14799     +560     
==========================================
+ Hits        12527    13042     +515     
- Misses       1712     1757      +45     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 35 files reviewed, 7 unresolved discussions


machine/corpora/scripture_ref_usfm_parser_handler.py line 23 at r6 (raw file):

    def __init__(self) -> None:
        self._cur_verse_ref: VerseRef = VerseRef()
        self._cur_elements_stack: List[ScriptureElement] = []

For variables that were a Stack type in C#, I used a list in python, since it's fast to append and pop items from a list like a stack. I added _stack to the end of these variables to document that they should be used as a stack.


machine/corpora/scripture_ref_usfm_parser_handler.py line 173 at r6 (raw file):

        )
        # No need to reverse unlike in Machine, elements are already added in correct order
        path = [e for e in self._cur_elements_stack if e.position > 0]

In Machine, there's a need to reverse the order of the stack based on how elements are added to the stack. However when I ported it to machine.py and used a list, all the elements were actually being added in the correct order, and reversing just caused bugs, so I removed that in machine.py.


machine/corpora/scripture_text.py line 43 at r6 (raw file):

        else:
            yield from self._create_rows_scripture_ref(ref, text, is_sentence_start)

In Machine, this method was overloaded to have two separate implementations for either a VerseRef or a List of ScriptureRef as input. Since I can't do exactly the same thing in python, I created a function with the general create_rows name, which checks the type of the input and calls the corresponding create_rows method for that type.


machine/corpora/text_corpus.py line 203 at r6 (raw file):

class _FilterTextCorpus(TextCorpus):

The class _FilterTextCorpus in machine.py is called WhereTextCorpus in Machine. I thought this might be due to different naming conventions in python vs c#, but then _TextFilterTextCorpus in machine.py has the same name in Machine. Is the naming difference here intentional?


machine/scripture/scripture_ref.py line 18 at r6 (raw file):

        self._path: List[ScriptureElement] = path if path is not None else []

    _empty: Optional[ScriptureRef] = None

I was concerned about properties of the _empty ScriptureRef (verse_ref and path) getting modified and affecting all instances so that they are no longer empty. But since the public properties only have getters not setters, it should be fine as long as the corresponding private properties aren't modified.


machine/scripture/verse_ref.py line 380 at r6 (raw file):

        return 0

    def __eq__(self, other: object) -> bool:

The Equals method in SIL.Scripture/VerseRef.cs is the same as the exact_equals method that was in machine.py. I changed the machine.py version to be the eq method instead so that it is called when == is used, since this seemed to me to be what we would want, and it didn't break any tests. Let me know if there's a reason to not make it the default equality comparison.


machine/scripture/verse_ref.py line 488 at r6 (raw file):

def are_overlapping_verse_ranges_vref(verse_ref1: VerseRef, verse_ref2: VerseRef) -> bool:
    if verse_ref1.is_default or verse_ref2.is_default:

I think there's actually a bug in SIL.Scripture/VerseRef.cs. The corresponding section in OverlappingVersesRanges for the VerseRef type input is if (verseRef1.IsDefault || verseRef1.IsDefault) but it should be if (verseRef1.IsDefault || verseRef2.IsDefault).

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r5, 34 of 34 files at r6, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @mshannon-sil)


machine/corpora/scripture_ref_usfm_parser_handler.py line 23 at r6 (raw file):

Previously, mshannon-sil wrote…

For variables that were a Stack type in C#, I used a list in python, since it's fast to append and pop items from a list like a stack. I added _stack to the end of these variables to document that they should be used as a stack.

Makes sense to me.


machine/corpora/scripture_text.py line 43 at r6 (raw file):

Previously, mshannon-sil wrote…

In Machine, this method was overloaded to have two separate implementations for either a VerseRef or a List of ScriptureRef as input. Since I can't do exactly the same thing in python, I created a function with the general create_rows name, which checks the type of the input and calls the corresponding create_rows method for that type.

That is a good solution.


machine/corpora/scripture_text.py line 82 at r6 (raw file):

            )

    def _create_row(

Because Python doesn't support overloading, we sometimes have to deviate from the C# code a bit, as you discovered for the _create_rows method. I think it would be good to name this something else, maybe _create_scripture_row, then you wouldn't have to call the base class _create_row method using super(). If you rename this method, then you should also rename the _create_rows method to match.


machine/corpora/scripture_text_corpus.py line 111 at r6 (raw file):

def is_scripture(text_corpus: TextCorpus) -> bool:

We want to export this function from the __init__.py of the corpora package.


machine/corpora/standard_parallel_text_corpus.py line 255 at r6 (raw file):

        trg_refs = [] if trg_row is None else [trg_row.ref]

        if len(trg_refs) == 0 and isinstance(self._target_corpus, ScriptureTextCorpus):

This was a bug in the C# code, you should use the is_scripture function.


machine/corpora/text_corpus.py line 203 at r6 (raw file):

Previously, mshannon-sil wrote…

The class _FilterTextCorpus in machine.py is called WhereTextCorpus in Machine. I thought this might be due to different naming conventions in python vs c#, but then _TextFilterTextCorpus in machine.py has the same name in Machine. Is the naming difference here intentional?

As you guessed, this is because of the difference in function names between Python and C#. Python uses map and filter. C# uses Select and Where. I wanted to match the standard names in each language.


machine/scripture/scripture_element.py line 10 at r6 (raw file):

@total_ordering
class ScriptureElement(Comparable):

This class should be moved to the corpora package. This class should also be exported from __init__.py in the corpora package.


machine/scripture/scripture_ref.py line 13 at r6 (raw file):

@total_ordering
class ScriptureRef(Comparable):

This class should be moved to the corpora package.


machine/scripture/scripture_ref.py line 18 at r6 (raw file):

Previously, mshannon-sil wrote…

I was concerned about properties of the _empty ScriptureRef (verse_ref and path) getting modified and affecting all instances so that they are no longer empty. But since the public properties only have getters not setters, it should be fine as long as the corresponding private properties aren't modified.

You can make this a top-level constant, i.e. EMPTY_SCRIPTURE_REF.


machine/scripture/verse_ref.py line 380 at r6 (raw file):

Previously, mshannon-sil wrote…

The Equals method in SIL.Scripture/VerseRef.cs is the same as the exact_equals method that was in machine.py. I changed the machine.py version to be the eq method instead so that it is called when == is used, since this seemed to me to be what we would want, and it didn't break any tests. Let me know if there's a reason to not make it the default equality comparison.

The Comparable base class provides an implementation of __eq__ based on the compare_to method. This makes all of the comparison methods consistent with each other. exact_equals does not convert the versification, so it serves a different purpose. This does deviate from the C# implementation, but it has more consistent behavior.


machine/scripture/verse_ref.py line 488 at r6 (raw file):

Previously, mshannon-sil wrote…

I think there's actually a bug in SIL.Scripture/VerseRef.cs. The corresponding section in OverlappingVersesRanges for the VerseRef type input is if (verseRef1.IsDefault || verseRef1.IsDefault) but it should be if (verseRef1.IsDefault || verseRef2.IsDefault).

Yeah, that looks like a bug in the C# code.


tests/corpora/test_scripture_ref.py line 6 at r6 (raw file):

class TestScriptureRef(unittest.TestCase):

Is there some reason why you used a test case here? I would prefer it to be consistent with other test fixtures. If it is because you need to specify a description for each assert, you can do so with a normal assert.

@mshannon-sil
Copy link
Collaborator Author

machine/corpora/scripture_text.py line 82 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Because Python doesn't support overloading, we sometimes have to deviate from the C# code a bit, as you discovered for the _create_rows method. I think it would be good to name this something else, maybe _create_scripture_row, then you wouldn't have to call the base class _create_row method using super(). If you rename this method, then you should also rename the _create_rows method to match.

Should I do this for the get_rows() method as well? Renaming it to get_scripture_rows() so that in the method, rather than calling super().get_rows(), I just call self.get_rows()?

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @ddaspit)


machine/scripture/scripture_element.py line 10 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This class should be moved to the corpora package. This class should also be exported from __init__.py in the corpora package.

I realize I didn't export most of the classes, I'll go ahead and do that.


machine/scripture/scripture_ref.py line 13 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This class should be moved to the corpora package.

Okay. I had put it in the scripture folder since it seemed to parallel the VerseRef class in some ways. Is the idea of the scripture folder just to include python equivalents of SIL.Scripture?


machine/scripture/verse_ref.py line 488 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Yeah, that looks like a bug in the C# code.

Do we need to submit a PR to SIL.Scripture to fix it?


tests/corpora/test_scripture_ref.py line 6 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Is there some reason why you used a test case here? I would prefer it to be consistent with other test fixtures. If it is because you need to specify a description for each assert, you can do so with a normal assert.

I was trying to follow the format of the corresponding test file in Machine, since it uses test cases unlike the other corpora test files. But I agree, it makes sense to keep it consistent. I'll go ahead and change it to match the other files in machine.py.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @mshannon-sil)


machine/corpora/scripture_text.py line 82 at r6 (raw file):

Previously, mshannon-sil wrote…

Should I do this for the get_rows() method as well? Renaming it to get_scripture_rows() so that in the method, rather than calling super().get_rows(), I just call self.get_rows()?

In the case of get_rows, we are overriding the method, so we don't want to change the name. It is also pretty normal to call the base class method when overriding, because you are often adding behavior to the base class implementation.


machine/scripture/scripture_ref.py line 13 at r6 (raw file):

Previously, mshannon-sil wrote…

Okay. I had put it in the scripture folder since it seemed to parallel the VerseRef class in some ways. Is the idea of the scripture folder just to include python equivalents of SIL.Scripture?

Yes, that is correct.


machine/scripture/verse_ref.py line 488 at r6 (raw file):

Previously, mshannon-sil wrote…

Do we need to submit a PR to SIL.Scripture to fix it?

It would be the neighborly thing to do, but probably not a high priority.

…to corpora folder, use top level constant for empty ScriptureRef, change __eq__ back to exact_equals, keep test files consistent
Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @ddaspit)


machine/corpora/scripture_text.py line 82 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

In the case of get_rows, we are overriding the method, so we don't want to change the name. It is also pretty normal to call the base class method when overriding, because you are often adding behavior to the base class implementation.

Done.


machine/corpora/scripture_text_corpus.py line 111 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We want to export this function from the __init__.py of the corpora package.

Done.


machine/corpora/standard_parallel_text_corpus.py line 255 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This was a bug in the C# code, you should use the is_scripture function.

Done.


machine/scripture/scripture_element.py line 10 at r6 (raw file):

Previously, mshannon-sil wrote…

I realize I didn't export most of the classes, I'll go ahead and do that.

Done.


machine/scripture/scripture_ref.py line 13 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Yes, that is correct.

Done.


machine/scripture/scripture_ref.py line 18 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You can make this a top-level constant, i.e. EMPTY_SCRIPTURE_REF.

Done.


machine/scripture/verse_ref.py line 380 at r6 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The Comparable base class provides an implementation of __eq__ based on the compare_to method. This makes all of the comparison methods consistent with each other. exact_equals does not convert the versification, so it serves a different purpose. This does deviate from the C# implementation, but it has more consistent behavior.

Okay, I changed it back.


tests/corpora/test_scripture_ref.py line 6 at r6 (raw file):

Previously, mshannon-sil wrote…

I was trying to follow the format of the corresponding test file in Machine, since it uses test cases unlike the other corpora test files. But I agree, it makes sense to keep it consistent. I'll go ahead and change it to match the other files in machine.py.

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 19 of 19 files at r7, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @mshannon-sil)

@mshannon-sil mshannon-sil marked this pull request as ready for review August 15, 2024 19:05
@mshannon-sil mshannon-sil changed the title Port USFM code from Machine #102 Port USFM code from Machine, Part 1 Aug 15, 2024
@mshannon-sil mshannon-sil changed the title Port USFM code from Machine, Part 1 Port USFM code from Machine up to commit a9058ce Aug 15, 2024
@mshannon-sil mshannon-sil merged commit 953f203 into main Aug 15, 2024
14 checks passed
@mshannon-sil mshannon-sil deleted the #102_usfm branch August 15, 2024 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

3 participants