Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PP-149 Refactor Acquisition feeds and Annotators #1308

Merged
merged 43 commits into from
Sep 15, 2023

Conversation

RishiDiwanTT
Copy link
Contributor

@RishiDiwanTT RishiDiwanTT commented Aug 4, 2023

Currently, please ignore all OPDS2 classes, they are just POCs.

Description

This aims to remove all XML directives from within the feed and annotator workflows And also to simplify the workflow making it more linear and less circular Only a Serializer should write XML to the feed.
The feed workflow should always be AcquisitionFeed -> Annotator -> Serializer, as opposed to the back and forth that was occurring earlier.
About 70% of the time that went into this PR refactoring tests and understanding the different paradigms used in the existing code.
This is not a complete rewrite, all the logic has been left as-is without any changes.

Feed Class

Only bothers with the outer shell of the feed, which is everything except the publications in OPDS2 and <entry>'s in OPDS1.
Eg. The facet links and other metadata.

Annotator Class

Populates the data required for each work entry and may also potentially add some additional metadata into the outer feed.

Serializer Class

Accepts pydantic models and outputs the OPDS feed in whichever format it is responsible for.

Pydantic Models

We use pydantic models (FeedEntryType's) to store the annotated information for the feed as well as for each entry.
The FeedEntryType model is the base model for any nestable attribute of the OPDS feeds.
Eg. If we have an acquisition link, we know this is nestable with indirectAcquisitions, so all links are inherited from the FeedEntryType models so they can have arbitrary information added into them.

⚠️ XML cache fields ⚠️

The model fields Work.simple_opds_entry and Work.verbose_opds_entry are not used in this implementation because the implementation is specifically not targetting XML.
This means we no longer have an internal partial cache of XML entries per work record.

Motivation and Context

In order to allow all API feeds to be served in both OPDS1 and OPDS2 protocols all the XML generation had to be extricated from the Feed and Annotator classes, and pushed into a separate serializer class.
JIRA

How Has This Been Tested?

All the Feed and Annotator tests were copied and refactored to work with the new style classes.
All the Feeds and Annotators have unit tests that test the equivalence between the old and new style classes, thereby ensuring we are creating exactly the same feeds.
Manually tested whichever endpoints came to mind.

Checklist

  • I have updated the documentation accordingly.
  • All new and existing tests passed.

@RishiDiwanTT RishiDiwanTT self-assigned this Aug 4, 2023
@RishiDiwanTT RishiDiwanTT force-pushed the feature/refactor-opds-feeds branch from 8d74097 to 108c105 Compare August 15, 2023 11:25
@RishiDiwanTT RishiDiwanTT marked this pull request as ready for review August 17, 2023 11:19
@RishiDiwanTT RishiDiwanTT requested a review from a team August 17, 2023 11:31
@RishiDiwanTT RishiDiwanTT added incompatible changes Changes that require a new major version and removed incompatible changes Changes that require a new major version labels Aug 17, 2023
@codecov
Copy link

codecov bot commented Aug 17, 2023

Codecov Report

Patch coverage: 92.09% and project coverage change: +0.13% 🎉

Comparison is base (4cfb3be) 90.02% compared to head (7ada55c) 90.15%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1308      +/-   ##
==========================================
+ Coverage   90.02%   90.15%   +0.13%     
==========================================
  Files         204      219      +15     
  Lines       28051    30067    +2016     
  Branches     6458     6959     +501     
==========================================
+ Hits        25252    27107    +1855     
- Misses       1821     1893      +72     
- Partials      978     1067      +89     
Files Changed Coverage Δ
core/opds_schema.py 76.78% <0.00%> (-2.85%) ⬇️
core/feed/annotator/circulation.py 89.30% <89.30%> (ø)
core/feed/annotator/admin.py 89.47% <89.47%> (ø)
core/feed/serializer/opds.py 89.75% <89.75%> (ø)
core/feed/types.py 91.41% <91.41%> (ø)
core/feed/annotator/base.py 92.09% <92.09%> (ø)
core/feed/annotator/loan_and_hold.py 92.15% <92.15%> (ø)
core/feed/serializer/opds2.py 93.66% <93.66%> (ø)
core/feed/acquisition.py 93.76% <93.76%> (ø)
core/feed/annotator/verbose.py 96.82% <96.82%> (ø)
... and 14 more

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@RishiDiwanTT RishiDiwanTT force-pushed the feature/refactor-opds-feeds branch from c0b54eb to 2e2da92 Compare August 21, 2023 09:31
@RishiDiwanTT RishiDiwanTT force-pushed the feature/refactor-opds-feeds branch from b50811b to f70ec00 Compare August 24, 2023 06:10
@jonathangreen
Copy link
Member

jonathangreen commented Aug 24, 2023

I'm just starting the code review on this, but one thing that I see immediately is it would be really nice if we had some type hints on all this new code. I know a lot of the logic is recycled from the existing classes, but since we're moving it around and cleaning it up, having type hints on it would make things a lot easier to understand.

Ideally I'd like to see core.feed_protocol.* added to the strict section of the tool.mypy.overrides config, so that we enforce type checking there.

diff --git a/pyproject.toml b/pyproject.toml
index 6e503ae69..9e9d59a81 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -74,6 +74,7 @@ module = [
     "api.adobe_vendor_id",
     "api.circulation",
     "api.integration.*",
+    "core.feed_protocol.*",
     "core.integration.*",
     "core.model.announcements",
     "core.model.hassessioncache",

@jonathangreen
Copy link
Member

I might just be misunderstanding the design here, but why use Pydantic over a normal Python dataclasses for the classes in types.py? Pydantic is great for validating data, but I don't see where we are taking advantage of that validation functionality here.

@RishiDiwanTT
Copy link
Contributor Author

I might just be misunderstanding the design here, but why use Pydantic over a normal Python dataclasses for the classes in types.py? Pydantic is great for validating data, but I don't see where we are taking advantage of that validation functionality here.

One reason was to have validations, which ended up not being required (yet).
Another reason was to allow arbitrary values into the init method as is used with the FeedEntryType.
This class can be used to add arbitrary data to any part of the feed/entry.
Eg. The subject classifications are FeedEntryTypes with additional keys 'scheme', 'label' and 'term'.
If we use dataclasses, we would need to create special data types for all the different kinds of tags/keys possible, which isn't terrible but just adds to the coding overhead.

@RishiDiwanTT RishiDiwanTT force-pushed the feature/refactor-opds-feeds branch from c4b041b to f1f1b6b Compare August 29, 2023 07:41
@RishiDiwanTT
Copy link
Contributor Author

I'm just starting the code review on this, but one thing that I see immediately is it would be really nice if we had some type hints on all this new code. I know a lot of the logic is recycled from the existing classes, but since we're moving it around and cleaning it up, having type hints on it would make things a lot easier to understand.

Ideally I'd like to see core.feed_protocol.* added to the strict section of the tool.mypy.overrides config, so that we enforce type checking there.

diff --git a/pyproject.toml b/pyproject.toml
index 6e503ae69..9e9d59a81 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -74,6 +74,7 @@ module = [
     "api.adobe_vendor_id",
     "api.circulation",
     "api.integration.*",
+    "core.feed_protocol.*",
     "core.integration.*",
     "core.model.announcements",
     "core.model.hassessioncache",

@jonathangreen This has been implemented.

@jonathangreen
Copy link
Member

Thanks @RishiDiwanTT. I'll code review the rest of this today.

Copy link
Member

@jonathangreen jonathangreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good Rishi, it is really great work to get this code untangled between feeds, annotators and serializers. It looks like it was a huge amount of work (almost 9000 lines!!) ⛰️.

I added some comments to the code as I went though it, most of them are just minor things that would be nice to get cleaned up before merging.

Before this is merged I'd like to figure out what we do with the old annotator and feed code. I'd prefer that we remove the old feed and annotator classes (including updating the database to remove the unused cache fields) as part of this PR, think they confuse things, now that we have the new classes that are part of this PR. However I'm open to handling this however you would like, if you want to remove them in a future ticket instead.

core/feed_protocol/opds.py Outdated Show resolved Hide resolved
core/feed_protocol/base.py Outdated Show resolved Hide resolved
core/feed_protocol/acquisition.py Outdated Show resolved Hide resolved
core/feed_protocol/acquisition.py Outdated Show resolved Hide resolved
core/feed_protocol/acquisition.py Outdated Show resolved Hide resolved
core/feed_protocol/types.py Outdated Show resolved Hide resolved
core/feed_protocol/types.py Outdated Show resolved Hide resolved
core/feed_protocol/types.py Outdated Show resolved Hide resolved
core/feed_protocol/types.py Outdated Show resolved Hide resolved
core/feed_protocol/opds.py Outdated Show resolved Hide resolved
@RishiDiwanTT
Copy link
Contributor Author

@jonathangreen I think I got everything, and rebased #1340 as well to allow simultaneous code reviews.

@jonathangreen
Copy link
Member

@RishiDiwanTT I went through and resolved all the comments that were addressed. Looks like there are a couple more with some work left to do on them.

The other thing is this comment:

Before this is merged I'd like to figure out what we do with the old annotator and feed code. I'd prefer that we remove the old feed and annotator classes (including updating the database to remove the unused cache fields) as part of this PR, think they confuse things, now that we have the new classes that are part of this PR. However I'm open to handling this however you would like, if you want to remove them in a future ticket instead.

How would you like to handle this?

@RishiDiwanTT
Copy link
Contributor Author

@RishiDiwanTT I went through and resolved all the comments that were addressed. Looks like there are a couple more with some work left to do on them.

The other thing is this comment:

Before this is merged I'd like to figure out what we do with the old annotator and feed code. I'd prefer that we remove the old feed and annotator classes (including updating the database to remove the unused cache fields) as part of this PR, think they confuse things, now that we have the new classes that are part of this PR. However I'm open to handling this however you would like, if you want to remove them in a future ticket instead.

How would you like to handle this?

@jonathangreen Considering both PRs are still in-progress I would like to take this up as an additional ticket, since this may need some effort to ensure we're not breaking some dependant functionality still in use.

@RishiDiwanTT RishiDiwanTT force-pushed the feature/refactor-opds-feeds branch from 0574016 to 45a2f09 Compare September 1, 2023 08:13
Copy link
Member

@jonathangreen jonathangreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple todo comments in the new code, that I'd like to make sure we shouldn't sort out before merging this one.

core/feed_protocol/acquisition.py Outdated Show resolved Hide resolved
core/feed_protocol/annotator/circulation.py Outdated Show resolved Hide resolved
identifier = entry.identifier or work.presentation_edition.primary_identifier

permalink_uri, permalink_type = self.permalink_for(identifier)
# TODO: Do not force OPDS types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we have all endpoints under OPDS 2.0 maybe we should keep the APIs as OPDS1.x and keep the TODO as a reminder.

Copy link
Member

@jonathangreen jonathangreen Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now what the todo is mentioning, it wasn't clear to me before. Can you maybe elaborate on the comment in the todo a bit more, so our future selves know what we are talking about here?

core/feed_protocol/serializer/opds2.py Outdated Show resolved Hide resolved
core/feed_protocol/serializer/opds2.py Outdated Show resolved Hide resolved
@jonathangreen
Copy link
Member

Okay looks like all the comments are resolved here. Since we are removing the old classes in a followup, can you make a jira ticket for that @RishiDiwanTT?

@jonathangreen
Copy link
Member

Once #1340 gets merged into this, I think there at two things that we need to do before this one can get merged:

  1. Rename feed_protocol directory
  2. Update to main and remove the couple references to the self_hosted variable that was removed in Remove locally hosted LCP Collections 🔥 (PP-393) #1344.

With an acquisition page feed
This aims to remove all XML directives from within the feed and annotator workflows
And also to simplify the workflow making it more linear and less circular
Only a Serializer should write XML to the feed

A hardcoded example has been set up in the /feed route
RishiDiwanTT and others added 24 commits September 14, 2023 15:08
… APIs and scripts

Now the new feed classes control the XML generation
Implemented some pydantic dict() and iteration functionality for a smooth refactor
* Added acquisition link serialization for OPDS2

* OPDS2 serialization with acuqisition links

* Loan and hold specific statuses for acquisition links

* Content types come from the serializer now

OPDS2 publication feed is now the same as the OPDS1 page feed, with a different serializer

* Ensure test_url_for is only used during testing

* Mimetypes are iterated in priority order before serialization

Created a Serializer base for better typing and OOP
Allow /groups to also be OPDS2 if requested
@RishiDiwanTT RishiDiwanTT force-pushed the feature/refactor-opds-feeds branch from 38f4da5 to bef06fe Compare September 14, 2023 10:52
@RishiDiwanTT RishiDiwanTT merged commit 202a1ab into main Sep 15, 2023
30 checks passed
@RishiDiwanTT RishiDiwanTT deleted the feature/refactor-opds-feeds branch September 15, 2023 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants