Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embed APIv3: initial implementation #8319

Merged
merged 49 commits into from
Sep 21, 2021
Merged

Embed APIv3: initial implementation #8319

merged 49 commits into from
Sep 21, 2021

Conversation

humitos
Copy link
Member

@humitos humitos commented Jul 6, 2021

Implementation

  • support external sites for allowed domains (via setting)
  • given a URL and a fragment, it returns the exact HTML portion if found
  • cache external request for some time
  • rate-limit external requests to 50/m per domain
  • uses requests.get for external sites (only public documents)
  • by default requests.get follows up to 30 redirects
  • force requests.get a timeout of 1 second
  • uses Django storage for locally hosted sites (allows only PUBLIC versions)
  • reuses clean_links util function (depends on PyQuery) from EmbedAPI v2
  • add extra fields in response if locally hosted documents (e.g. project, language, version and path)
  • handle Sphinx special cases for dt

ToDo

  • implement doctool and doctoolversion
  • handle Sphinx specific cases (e.g. dt)
  • add tests for all the cases
    • locally hosted documents
    • external pages
    • not allow private version
    • Sphinx specific cases
  • decide what's a good file/dir structure to put this code (embed/v3, api/v3/embed, other?)
  • write documentation

Reference: #8222
Design document: https://docs.readthedocs.io/en/latest/development/design/embed-api.html

humitos added 3 commits July 5, 2021 16:15
This function will be shared between APIv2 and APIv3.
Making `clean_links` to response HTML in raw allows us to reuse this code
in APIv2 and APIv3.
- support external sites for allowed domains (via setting)
- given a URL and a fragment, it returns the exact HTML portion if found
- cache external request for some time
- rate-limit external requests to 50/m per domain
- uses `requests.get` for external sites (only public documents)
- uses Django storage for local hosted sites (allows only PUBLIC versions)
- reuses `clean_links` util function (depends on PyQuery) from EmbedAPI v2
@humitos humitos requested a review from a team July 6, 2021 11:38
@humitos
Copy link
Member Author

humitos commented Jul 7, 2021

I played a little more with nparagraphs= but I wasn't able to implement a generic solution that always works. The problems start when there are not enough p tags in the main content and we need to move to the next section (usually a div with h2)

Also, even when there are enough p tags in the main content, if there are .. note directives: these include 1 p for the title of it and we could return a content that ends with the title's note but without the content of it.

On the other hand, depending on the 3rd party HTML parsing library used (e.g. selectolax, beautifulsoup or pyquery) we may not have the parsing power we need. For example, I wasn't able to get all the .parents() for a particular node using selectolax, but I could with beautifulsoup. So, the "algorithm" designed for this may differ depending on the library chosen.

I'll come back to this after we have the first implementation for this API working. We can probably add it as a feature/improvement later.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start. I wonder if we can simplify things to start, and perhaps just treat internal & external sites the same? That might make the code way simpler and less prone to bugs. I'm not 100% sure that makes sense, but I think reducing special cases as much as possible will be a benefit in terms of maintaining and consuming the API.

readthedocs/embed/v3/views.py Show resolved Hide resolved
readthedocs/embed/v3/views.py Outdated Show resolved Hide resolved
readthedocs/embed/v3/views.py Outdated Show resolved Hide resolved
readthedocs/embed/v3/views.py Show resolved Hide resolved
readthedocs/embed/v3/views.py Outdated Show resolved Hide resolved
readthedocs/embed/v3/views.py Outdated Show resolved Hide resolved
readthedocs/embed/v3/views.py Outdated Show resolved Hide resolved
readthedocs/embed/v3/views.py Outdated Show resolved Hide resolved

return self._parse_based_on_doctool(page_content, fragment, doctool, doctoolversion)

def _find_main_node(self, html):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is reimplementing

def _get_main_node(self, html):

I'd like to find a way to have both of these codebases using similar abstractions, instead of reinventing the same logic over again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I mainly copied&pasted this from there. We can probably move this to a utils function somewhere and use the same in both places.

readthedocs/settings/base.py Show resolved Hide resolved

return node.html

def get(self, request):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not doing any auth here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for now. Do we need auth at all? We are giving users to access only public internal projects (from inside _get_page_content_from_storage method) or public webpages on the internet.

I remember thinking about this and not finding a good and safe way to support private projects with session auth. Do we want other type of auth for this? If this is the case, we will need to consider auth only for internal projects.

I thought it was safer, for now, to start only with public documents and grow from there if we require auth/private documents in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we need to support this on .com, so we definitely need to figure out auth I think. I'd imagine it would be with session auth for the same domain, and explicit API key auth cross-domain, as we discussed previously?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be implemented in corporate repository by overriding the class and the permission_classes field with a custom class that make these checks. Makes sense to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, seems reasonable.

@humitos
Copy link
Member Author

humitos commented Jul 12, 2021

I copied these 3 comments from the review since they are all related.

  • I wonder if we can simplify things to start, and perhaps just treat internal & external sites the same? That might make the code way simpler and less prone to bugs
  • For .com and private projects, we have to access it via storage. We also need auth checks in that case though.
  • Are we not doing any auth here?

I started using just requests.get for all the pages. Then I realized that I needed auth for private projects, but then I remembered that private projects were complicated and added the PUBLIC filter to the requests done in the internal storage.

In any case, I'm 👍🏼 on simplifying this code to use the exact same method for internal/external documents if possible. If we plan to query private projects somehow, we could keep this code that's already written. Otherwise, we can delete it and use requests.get directly.

@ericholscher
Copy link
Member

In any case, I'm 👍🏼 on simplifying this code to use the exact same method for internal/external documents if possible. If we plan to query private projects somehow, we could keep this code that's already written. Otherwise, we can delete it and use requests.get directly.

Yea -- let's keep the code around for now. I do think we'll want to support private projects, at least in the simple case on .com (eg. hoverxref). We need to implement better support for private queries though, so I'm also OK with waiting a bit longer to implement that so we can do something bigger across all the proxied API's.

@humitos
Copy link
Member Author

humitos commented Aug 18, 2021

I'm marking this PR as ready for review because all the logic and tests are ready. I'll continue writing the documentation in the meanwhile.

@humitos humitos marked this pull request as ready for review August 18, 2021 16:00
@humitos humitos requested a review from ericholscher August 18, 2021 16:08
@humitos
Copy link
Member Author

humitos commented Aug 24, 2021

Use /api/v3/embed/ and send a dated version from the client (e.g. X-API-Version: 2021-08-18. GitHub uses Accept: header: docs.github.com/en/rest/overview/media-types#request-specific-version)

I'm leaning in this direction more.

I think it makes sense to use the /v3/ in the URL's path together with a version sent in the client. However, I think we should default to a fixed version (the current one --that we can define as today: 2021-08-18) instead of latest. This allows us to break requests only when we deprecate the default (current) version.

  • Backward-incompatible changes will generate a new dated version that needs to be sent in the client.
  • Backward-compatible changes won't generate a new dated release

The version in the URL shouldn't change unless we break the API completely with a different structure (REST to GrapQL), authentication (Session or Token to OAuth) or something big enough. I think we are on time to make this happen without breaking what we currently have. Then, if a user realizes that the API is not returning what they are expecting, they will be forced to send the version from the client to receive the response they are expecting.

(copied from a message that I sent privately to Eric)

@agjohnson @stsewd do you have opinions here?

@ericholscher
Copy link
Member

I just worry we're going to end up with a situation like with the docker images, where we can never change the client version, because it will break too many people. Our API isn't super heavily used I guess, but I guess having a way to opt into newer features while not breaking people is OK, but definitely will become a frustration and complication in documentation :/

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to soft launch in prod. Y'all can discuss the API versioning perhaps while I'm on vacation, but I don't want this blocked on my review :)

if Version(docutils.__version__) >= Version('0.17'):
content = '<div class="body" role="main">\n \n <section id="title">\n<h1>Title<a class="headerlink" href="https://docs.project.com#title" title="Permalink to this headline">¶</a></h1>\n<p>This is an example page used to test EmbedAPI parsing features.</p>\n<section id="sub-title">\n<h2>Sub-title<a class="headerlink" href="https://docs.project.com#sub-title" title="Permalink to this headline">¶</a></h2>\n<p>This is a reference to <a class="reference internal" href="https://docs.project.com#sub-title"><span class="std std-ref">Sub-title</span></a>.</p>\n</section>\n</section>\n\n\n </div>'
else:
content = '<div class="body" role="main">\n \n <div class="section" id="title">\n<h1>Title<a class="headerlink" href="https://docs.project.com#title" title="Permalink to this headline">¶</a></h1>\n<p>This is an example page used to test EmbedAPI parsing features.</p>\n<div class="section" id="sub-title">\n<h2>Sub-title<a class="headerlink" href="https://docs.project.com#sub-title" title="Permalink to this headline">¶</a></h2>\n<p>This is a reference to <a class="reference internal" href="https://docs.project.com#sub-title"><span class="std std-ref">Sub-title</span></a>.</p>\n</div>\n</div>\n\n\n </div>'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels quite brittle in terms of checking whitespace, etc. I do worry it might break easily, but I don't have a strong feeling here about how to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think they can be in an external file or use a multiline string withtextwrap.dedent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, I think we will find out a better way to do these checks in a more maintainable way. I'm doing something similar in sphinx-hoverxref and I haven't had too much problems with this in particular.

However, the output changes on version upgrade, in ways that don't affect the result (e.g. adds an extra CSS class) but I don't think there is too much we can do in that case if we want to be sure in our checks.

@stsewd
Copy link
Member

stsewd commented Sep 8, 2021

Backward-incompatible changes will generate a new dated version that needs to be sent in the client.

I feel like that's going to be a lot of code to maintain based on dates, and don't see much changing, as we should return the html as close as possible to the original, additional metadata is still backwards compatible.

@agjohnson
Copy link
Contributor

I think it makes sense to use the /v3/ in the URL's path together with a version sent in the client.

Have you thought about what this pattern will look like in code? If we version the API at the client, that is a different contract with the user than just having a /api/v2 vs /api/v3. Allowing a version specified by the client would make us have to maintain several, mostly immutable API versions. We'd probably have to document all of the versions we support as well.

Perhaps it's not a huge issue and we wouldn't use this in many places anyways, but I'm picturing our APIs becoming a nest of conditional logic around API client versions.

I am a fan of this pattern however, it is nice UX from the client perspective to not have to worry about client/server mismatches. It does take API deprecation off the table though.


response = {
'url': url,
'fragment': fragment if fragment else None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are calling id to the fragment in the search API.

# ...
# </dl>

# TODO: figure it out if it's needed to remove the siblings here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we going to resolve these todos in this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a blocker to move forward.

@humitos
Copy link
Member Author

humitos commented Sep 9, 2021

@agjohnson

Have you thought about what this pattern will look like in code?

Yes, but I haven't done a POC, tho. I imagine that we can have two different approaches at least:

  1. nested ifs for small things in the middle of the logic: simpler to start and harder to maintain in the long term
  2. modularized logic and override by using one class per API version: harder to start and easier to maintain in the long term

We could probably start with the nested ifs when this situation needs to be handled to understand how much work requires and how "dirty" the code looks like and decide if it is worth the effort to migrate to the overriding pattern that seems cleaner at first sight.

If we version the API at the client, that is a different contract with the user than just having a /api/v2 vs /api/v3. Allowing a version specified by the client would make us have to maintain several, mostly immutable API versions. We'd probably have to document all of the versions we support as well.

We have been maintaining immutable API already (APIv1 and APIv2 are good examples) and the downside we have on them currently is that they avoided us to move forward and we had to implement v3.

This is definitely more work, yes. But I think I'm fine paying for this work if it unblocks us to make changes in the API that change behavior in a particular case without breaking the rest of the API. Example: "when a glossary is detected in Sphinx we remove the siblings" (https://github.com/readthedocs/readthedocs.org/pull/8319/files#diff-93aac61b8b6d747d415ef7412a0f6aa10410dd7b1aa229ea075026177915724fR166)

Honestly, at first sight, I don't expect to have many of these changes, but I'd like to have a defined pattern to follow in case we need to do it.

I am a fan of this pattern however, it is nice UX from the client perspective to not have to worry about client/server mismatches. It does take API deprecation off the table though.

Why do you think that "It does take API deprecation off the table"?

@humitos
Copy link
Member Author

humitos commented Sep 9, 2021

@stsewd

Backward-incompatible changes will generate a new dated version that needs to be sent in the client.

I feel like that's going to be a lot of code to maintain based on dates, and don't see much changing, as we should return the html as close as possible to the original, additional metadata is still backwards compatible.

Yeah, additional metadata is not a backward-incompatible change. However, making a change in the behavior is what I'm referring to here. Example: "a query to a term returns only the title of it" changes to "a query to a term returns the title and the content"

@agjohnson
Copy link
Contributor

modularized logic and override by using one class per API version

I was picturing this as well. Seems it should work pretty well. I also agree we probably won't be forking API endpoint functionality very often.

We have been maintaining immutable API already

Well, I would say otherwise -- we've straight up made API endpoints stop working without even communicating the deprecation :)

We don't support old iterations of endpoint (for example, our project listing) is my point though. We do support major versions of our API, but effectively supporting a v3.1, 3.2, etc will be new territory for us.

The point that stands out for me here is: if we already have immutable API versions, what benefit does a secondary versioning mechanism have over adding a URL path for /api/v4? While I do prefer the client header option, and I don't think users should need to know about /api/v2/ vs /api/v3/, we are already using a versioned pattern for the API URL. The secondary version specifier seems like odd design for an API in this case.

At the code level, would extending APIv3 in an APIv3.1 implementation be cleaner or messier than splitting up our API with conditional logic against the API client header?

If there was a clean way to use a X-API: 2021-08-01 header to route to the correct API version, using a top level /api/project/foo (without the v2/v3/v4), this seems like really nice design and maybe a separate project.

Why do you think that "It does take API deprecation off the table"?

Deprecating a version will never be graceful for the user, and so (similar to how APIv2 keeps avoiding our deprecation axe) we'll end up supporting the endpoint version for longer than we want.


So, to bring together all of these points:

  • We don't expect many forks of the API
  • We already have a pattern for maintaining immutable API versions
  • Multiple version numbers are confusing

I'm not a huge fan of /api/v3.1/ or /api/v4/, but signs are pointing this direction for the reasons above.

@humitos
Copy link
Member Author

humitos commented Sep 15, 2021

The point that stands out for me here is: if we already have immutable API versions, what benefit does a secondary versioning mechanism have over adding a URL path for /api/v4?

I think this is the key question, IIRC this is why we started this conversation with @ericholscher (goes back to #8319 (comment)). The reason is that it's super confusing to have feature APIs (e.g. Embed, Search) with a different version than resource APIs (e.g. data models). For example, if we need to make a breaking change in EmbedAPI we will need /api/v4/embed/ but all the resource APIs will still be under /api/v3/(projects/versions/etc); and maybe SearchAPI will be /api/v7/search/.

If there was a clean way to use a X-API: 2021-08-01 header to route to the correct API version, using a top level /api/project/foo (without the v2/v3/v4), this seems like really nice design and maybe a separate project.

I agree that this is probably the best pattern, but I think we can't remove the /vX from the URL at this point. However, adding the version in the client seems like a middle ground solution that we can implement to get some of its benefits without breaking existing things.

@agjohnson
Copy link
Contributor

I agree that this is probably the best pattern, but I think we can't remove the /vX from the URL at this point.

Yeah, though we might not need to remove these endpoints either way. For example, the /api/v2 and /api/v3 endpoints could live on, but instead we'd promote accessing the APIs by using a header and /api/embed/ instead. Adding a /api/embed doesn't even have to happen now, but maybe is just part of our long term goal here.

So, perhaps in that case, we are starting with your suggestion of implementing this on /api/v3 with a header, with plans to have this pattern also on a top level /api/ as part of a later, separate project. The immediate case is workable, albeit awkward, but eventually everything will make more sense.

This seems like a good middle ground and a good step towards a nicer API design.

@humitos
Copy link
Member Author

humitos commented Sep 21, 2021

I'm merging this PR and continue with the work required in sphinx-hoverxref to use this new APIv3. We can adapt things that may not be working as we expect while working on the extension.

@humitos humitos merged commit 7c4ccbf into master Sep 21, 2021
@humitos humitos deleted the humitos/embed-api-v3 branch September 21, 2021 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants