PP-149 Refactor Acquisition feeds and Annotators #1308

RishiDiwanTT · 2023-08-04T10:51:36Z

Currently, please ignore all OPDS2 classes, they are just POCs.

Description

This aims to remove all XML directives from within the feed and annotator workflows And also to simplify the workflow making it more linear and less circular Only a Serializer should write XML to the feed.
The feed workflow should always be AcquisitionFeed -> Annotator -> Serializer, as opposed to the back and forth that was occurring earlier.
About 70% of the time that went into this PR refactoring tests and understanding the different paradigms used in the existing code.
This is not a complete rewrite, all the logic has been left as-is without any changes.

Feed Class

Only bothers with the outer shell of the feed, which is everything except the publications in OPDS2 and <entry>'s in OPDS1.
Eg. The facet links and other metadata.

Annotator Class

Populates the data required for each work entry and may also potentially add some additional metadata into the outer feed.

Serializer Class

Accepts pydantic models and outputs the OPDS feed in whichever format it is responsible for.

Pydantic Models

We use pydantic models (FeedEntryType's) to store the annotated information for the feed as well as for each entry.
The FeedEntryType model is the base model for any nestable attribute of the OPDS feeds.
Eg. If we have an acquisition link, we know this is nestable with indirectAcquisitions, so all links are inherited from the FeedEntryType models so they can have arbitrary information added into them.

⚠️ XML cache fields ⚠️

The model fields Work.simple_opds_entry and Work.verbose_opds_entry are not used in this implementation because the implementation is specifically not targetting XML.
This means we no longer have an internal partial cache of XML entries per work record.

Motivation and Context

In order to allow all API feeds to be served in both OPDS1 and OPDS2 protocols all the XML generation had to be extricated from the Feed and Annotator classes, and pushed into a separate serializer class.
JIRA

How Has This Been Tested?

All the Feed and Annotator tests were copied and refactored to work with the new style classes.
All the Feeds and Annotators have unit tests that test the equivalence between the old and new style classes, thereby ensuring we are creating exactly the same feeds.
Manually tested whichever endpoints came to mind.

Checklist

I have updated the documentation accordingly.
All new and existing tests passed.

codecov · 2023-08-17T11:58:54Z

Codecov Report

Patch coverage: 92.09% and project coverage change: +0.13% 🎉

Comparison is base (4cfb3be) 90.02% compared to head (7ada55c) 90.15%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1308      +/-   ##
==========================================
+ Coverage   90.02%   90.15%   +0.13%     
==========================================
  Files         204      219      +15     
  Lines       28051    30067    +2016     
  Branches     6458     6959     +501     
==========================================
+ Hits        25252    27107    +1855     
- Misses       1821     1893      +72     
- Partials      978     1067      +89

Files Changed	Coverage Δ
core/opds_schema.py	`76.78% <0.00%> (-2.85%)`	⬇️
core/feed/annotator/circulation.py	`89.30% <89.30%> (ø)`
core/feed/annotator/admin.py	`89.47% <89.47%> (ø)`
core/feed/serializer/opds.py	`89.75% <89.75%> (ø)`
core/feed/types.py	`91.41% <91.41%> (ø)`
core/feed/annotator/base.py	`92.09% <92.09%> (ø)`
core/feed/annotator/loan_and_hold.py	`92.15% <92.15%> (ø)`
core/feed/serializer/opds2.py	`93.66% <93.66%> (ø)`
core/feed/acquisition.py	`93.76% <93.76%> (ø)`
core/feed/annotator/verbose.py	`96.82% <96.82%> (ø)`
... and 14 more

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jonathangreen · 2023-08-24T18:44:01Z

I'm just starting the code review on this, but one thing that I see immediately is it would be really nice if we had some type hints on all this new code. I know a lot of the logic is recycled from the existing classes, but since we're moving it around and cleaning it up, having type hints on it would make things a lot easier to understand.

Ideally I'd like to see core.feed_protocol.* added to the strict section of the tool.mypy.overrides config, so that we enforce type checking there.

diff --git a/pyproject.toml b/pyproject.toml
index 6e503ae69..9e9d59a81 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -74,6 +74,7 @@ module = [
     "api.adobe_vendor_id",
     "api.circulation",
     "api.integration.*",
+    "core.feed_protocol.*",
     "core.integration.*",
     "core.model.announcements",
     "core.model.hassessioncache",

jonathangreen · 2023-08-24T19:15:38Z

I might just be misunderstanding the design here, but why use Pydantic over a normal Python dataclasses for the classes in types.py? Pydantic is great for validating data, but I don't see where we are taking advantage of that validation functionality here.

RishiDiwanTT · 2023-08-28T05:47:49Z

I might just be misunderstanding the design here, but why use Pydantic over a normal Python dataclasses for the classes in types.py? Pydantic is great for validating data, but I don't see where we are taking advantage of that validation functionality here.

One reason was to have validations, which ended up not being required (yet).
Another reason was to allow arbitrary values into the init method as is used with the FeedEntryType.
This class can be used to add arbitrary data to any part of the feed/entry.
Eg. The subject classifications are FeedEntryTypes with additional keys 'scheme', 'label' and 'term'.
If we use dataclasses, we would need to create special data types for all the different kinds of tags/keys possible, which isn't terrible but just adds to the coding overhead.

RishiDiwanTT · 2023-08-29T09:40:08Z

I'm just starting the code review on this, but one thing that I see immediately is it would be really nice if we had some type hints on all this new code. I know a lot of the logic is recycled from the existing classes, but since we're moving it around and cleaning it up, having type hints on it would make things a lot easier to understand.

Ideally I'd like to see core.feed_protocol.* added to the strict section of the tool.mypy.overrides config, so that we enforce type checking there.
diff --git a/pyproject.toml b/pyproject.toml
index 6e503ae69..9e9d59a81 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -74,6 +74,7 @@ module = [
     "api.adobe_vendor_id",
     "api.circulation",
     "api.integration.*",
+    "core.feed_protocol.*",
     "core.integration.*",
     "core.model.announcements",
     "core.model.hassessioncache",

@jonathangreen This has been implemented.

jonathangreen · 2023-08-29T09:55:32Z

Thanks @RishiDiwanTT. I'll code review the rest of this today.

jonathangreen

This looks good Rishi, it is really great work to get this code untangled between feeds, annotators and serializers. It looks like it was a huge amount of work (almost 9000 lines!!) ⛰️.

I added some comments to the code as I went though it, most of them are just minor things that would be nice to get cleaned up before merging.

Before this is merged I'd like to figure out what we do with the old annotator and feed code. I'd prefer that we remove the old feed and annotator classes (including updating the database to remove the unused cache fields) as part of this PR, think they confuse things, now that we have the new classes that are part of this PR. However I'm open to handling this however you would like, if you want to remove them in a future ticket instead.

core/feed_protocol/opds.py

core/feed_protocol/base.py

core/feed_protocol/acquisition.py

core/feed_protocol/types.py

core/feed_protocol/opds.py

RishiDiwanTT · 2023-08-30T12:19:40Z

@jonathangreen I think I got everything, and rebased #1340 as well to allow simultaneous code reviews.

jonathangreen · 2023-08-30T19:17:32Z

@RishiDiwanTT I went through and resolved all the comments that were addressed. Looks like there are a couple more with some work left to do on them.

The other thing is this comment:

Before this is merged I'd like to figure out what we do with the old annotator and feed code. I'd prefer that we remove the old feed and annotator classes (including updating the database to remove the unused cache fields) as part of this PR, think they confuse things, now that we have the new classes that are part of this PR. However I'm open to handling this however you would like, if you want to remove them in a future ticket instead.

How would you like to handle this?

RishiDiwanTT · 2023-09-01T04:47:16Z

@RishiDiwanTT I went through and resolved all the comments that were addressed. Looks like there are a couple more with some work left to do on them.

The other thing is this comment:

Before this is merged I'd like to figure out what we do with the old annotator and feed code. I'd prefer that we remove the old feed and annotator classes (including updating the database to remove the unused cache fields) as part of this PR, think they confuse things, now that we have the new classes that are part of this PR. However I'm open to handling this however you would like, if you want to remove them in a future ticket instead.

How would you like to handle this?

@jonathangreen Considering both PRs are still in-progress I would like to take this up as an additional ticket, since this may need some effort to ensure we're not breaking some dependant functionality still in use.

jonathangreen

There are a couple todo comments in the new code, that I'd like to make sure we shouldn't sort out before merging this one.

core/feed_protocol/acquisition.py

core/feed_protocol/annotator/circulation.py

jonathangreen · 2023-09-01T11:54:33Z

core/feed_protocol/annotator/circulation.py

+        identifier = entry.identifier or work.presentation_edition.primary_identifier
+
+        permalink_uri, permalink_type = self.permalink_for(identifier)
+        # TODO: Do not force OPDS types


Until we have all endpoints under OPDS 2.0 maybe we should keep the APIs as OPDS1.x and keep the TODO as a reminder.

I see now what the todo is mentioning, it wasn't clear to me before. Can you maybe elaborate on the comment in the todo a bit more, so our future selves know what we are talking about here?

core/feed_protocol/serializer/opds2.py

jonathangreen · 2023-09-05T13:45:06Z

Okay looks like all the comments are resolved here. Since we are removing the old classes in a followup, can you make a jira ticket for that @RishiDiwanTT?

jonathangreen · 2023-09-12T13:42:02Z

Once #1340 gets merged into this, I think there at two things that we need to do before this one can get merged:

Rename feed_protocol directory
Update to main and remove the couple references to the self_hosted variable that was removed in Remove locally hosted LCP Collections 🔥 (PP-393) #1344.

With an acquisition page feed This aims to remove all XML directives from within the feed and annotator workflows And also to simplify the workflow making it more linear and less circular Only a Serializer should write XML to the feed A hardcoded example has been set up in the /feed route

Along with LibraryAnnotator tests

… APIs and scripts Now the new feed classes control the XML generation

Implemented some pydantic dict() and iteration functionality for a smooth refactor

* Added acquisition link serialization for OPDS2 * OPDS2 serialization with acuqisition links * Loan and hold specific statuses for acquisition links * Content types come from the serializer now OPDS2 publication feed is now the same as the OPDS1 page feed, with a different serializer * Ensure test_url_for is only used during testing * Mimetypes are iterated in priority order before serialization Created a Serializer base for better typing and OOP

Allow /groups to also be OPDS2 if requested

RishiDiwanTT self-assigned this Aug 4, 2023

RishiDiwanTT force-pushed the feature/refactor-opds-feeds branch from 8d74097 to 108c105 Compare August 15, 2023 11:25

RishiDiwanTT marked this pull request as ready for review August 17, 2023 11:19

RishiDiwanTT requested a review from a team August 17, 2023 11:31

RishiDiwanTT added incompatible changes Changes that require a new major version and removed incompatible changes Changes that require a new major version labels Aug 17, 2023

RishiDiwanTT force-pushed the feature/refactor-opds-feeds branch from c0b54eb to 2e2da92 Compare August 21, 2023 09:31

RishiDiwanTT mentioned this pull request Aug 23, 2023

PP-129 OPDS 2 with acquisition links #1340

Merged

2 tasks

RishiDiwanTT force-pushed the feature/refactor-opds-feeds branch from b50811b to f70ec00 Compare August 24, 2023 06:10

RishiDiwanTT force-pushed the feature/refactor-opds-feeds branch from c4b041b to f1f1b6b Compare August 29, 2023 07:41

jonathangreen reviewed Aug 29, 2023

View reviewed changes

RishiDiwanTT requested a review from jonathangreen August 30, 2023 12:19

RishiDiwanTT force-pushed the feature/refactor-opds-feeds branch from 0574016 to 45a2f09 Compare September 1, 2023 08:13

jonathangreen reviewed Sep 1, 2023

View reviewed changes

jonathangreen approved these changes Sep 5, 2023

View reviewed changes

RishiDiwanTT added 4 commits September 14, 2023 15:08

Feed protocol for Library Annotator and LoanAndHoldAnnotator

8e307a6

Along with LibraryAnnotator tests

Loan and hold annotator tests copied over and passing

3ddf7ae

Mypy fixes

56b3798

RishiDiwanTT and others added 24 commits September 14, 2023 15:08

Minor Refactor

52919b6

Replaced the AcquisitionFeed with the new OPDSAcquisitionFeeds in all…

8bd42b7

… APIs and scripts Now the new feed classes control the XML generation

Test fix

4d1c7e1

Mypy fixes

53d1f1b

Tests for the OPDS1 serializer

7a76fa2

Added tests for coverage

395d398

Mypy fixes

1dbbd84

Removed unusued method

dd0e5cc

Gave acquisitions their own pydantic model

dfde988

Facets should be part of the FeedData.facet_links

e912d84

Fixed open access links to be the right model

2dba265

Rebase test fixes

c2350f5

Mypy fix

6706b99

Strict mypy implementation for feed_protocols

35731c7

Type hint fixes for feed_protocol tests as well

2547c60

Test fix

e1ecd0e

Review changes

88e85c4

Moved from pydantic classes to python dataclasses

a42c3a4

Implemented some pydantic dict() and iteration functionality for a smooth refactor

Mypy fixes

6a15415

Review Changes

6cf9507

Removed references to "self_hosted" from refactor code

c3a7372

Renamed feed_protocol/ to feed/

e3ecb1c

Ensured */* Accept headers always match for OPDS1

bef06fe

Allow /groups to also be OPDS2 if requested

RishiDiwanTT force-pushed the feature/refactor-opds-feeds branch from 38f4da5 to bef06fe Compare September 14, 2023 10:52

RishiDiwanTT and others added 2 commits September 14, 2023 16:39

Mime types test case fix

886380f

Merge branch 'main' into feature/refactor-opds-feeds

7ada55c

RishiDiwanTT merged commit 202a1ab into main Sep 15, 2023
30 checks passed

RishiDiwanTT deleted the feature/refactor-opds-feeds branch September 15, 2023 06:25

jonathangreen mentioned this pull request Sep 15, 2023

Remove collection content mirroring (PP-415) #1358

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PP-149 Refactor Acquisition feeds and Annotators #1308

PP-149 Refactor Acquisition feeds and Annotators #1308

RishiDiwanTT commented Aug 4, 2023 •

edited

Loading

codecov bot commented Aug 17, 2023 •

edited

Loading

jonathangreen commented Aug 24, 2023 •

edited

Loading

jonathangreen commented Aug 24, 2023

RishiDiwanTT commented Aug 28, 2023

RishiDiwanTT commented Aug 29, 2023

jonathangreen commented Aug 29, 2023

jonathangreen left a comment •

edited

Loading

RishiDiwanTT commented Aug 30, 2023

jonathangreen commented Aug 30, 2023

RishiDiwanTT commented Sep 1, 2023

jonathangreen left a comment

jonathangreen Sep 1, 2023

RishiDiwanTT Sep 1, 2023

jonathangreen Sep 5, 2023 •

edited

Loading

jonathangreen commented Sep 5, 2023

jonathangreen commented Sep 12, 2023

PP-149 Refactor Acquisition feeds and Annotators #1308

PP-149 Refactor Acquisition feeds and Annotators #1308

Conversation

RishiDiwanTT commented Aug 4, 2023 • edited Loading

Description

Feed Class

Annotator Class

Serializer Class

Pydantic Models

⚠️ XML cache fields ⚠️

Motivation and Context

How Has This Been Tested?

Checklist

codecov bot commented Aug 17, 2023 • edited Loading

Codecov Report

jonathangreen commented Aug 24, 2023 • edited Loading

jonathangreen commented Aug 24, 2023

RishiDiwanTT commented Aug 28, 2023

RishiDiwanTT commented Aug 29, 2023

jonathangreen commented Aug 29, 2023

jonathangreen left a comment • edited Loading

Choose a reason for hiding this comment

RishiDiwanTT commented Aug 30, 2023

jonathangreen commented Aug 30, 2023

RishiDiwanTT commented Sep 1, 2023

jonathangreen left a comment

Choose a reason for hiding this comment

jonathangreen Sep 1, 2023

Choose a reason for hiding this comment

RishiDiwanTT Sep 1, 2023

Choose a reason for hiding this comment

jonathangreen Sep 5, 2023 • edited Loading

Choose a reason for hiding this comment

jonathangreen commented Sep 5, 2023

jonathangreen commented Sep 12, 2023

RishiDiwanTT commented Aug 4, 2023 •

edited

Loading

codecov bot commented Aug 17, 2023 •

edited

Loading

jonathangreen commented Aug 24, 2023 •

edited

Loading

jonathangreen left a comment •

edited

Loading

jonathangreen Sep 5, 2023 •

edited

Loading