Embed APIv3: initial implementation #8319

humitos · 2021-07-06T11:38:27Z

Implementation

support external sites for allowed domains (via setting)
given a URL and a fragment, it returns the exact HTML portion if found
cache external request for some time
rate-limit external requests to 50/m per domain
uses requests.get for external sites (only public documents)
by default requests.get follows up to 30 redirects
force requests.get a timeout of 1 second
uses Django storage for locally hosted sites (allows only PUBLIC versions)
reuses clean_links util function (depends on PyQuery) from EmbedAPI v2
~~add extra fields in response if locally hosted documents (e.g. project, language, version and path)~~
handle Sphinx special cases for dt

ToDo

implement doctool and doctoolversion
handle Sphinx specific cases (e.g. dt)
add tests for all the cases
- locally hosted documents
- external pages
- not allow private version
- Sphinx specific cases
decide what's a good file/dir structure to put this code (embed/v3, api/v3/embed, other?)
write documentation

Reference: #8222
Design document: https://docs.readthedocs.io/en/latest/development/design/embed-api.html

This function will be shared between APIv2 and APIv3.

Making `clean_links` to response HTML in raw allows us to reuse this code in APIv2 and APIv3.

- support external sites for allowed domains (via setting) - given a URL and a fragment, it returns the exact HTML portion if found - cache external request for some time - rate-limit external requests to 50/m per domain - uses `requests.get` for external sites (only public documents) - uses Django storage for local hosted sites (allows only PUBLIC versions) - reuses `clean_links` util function (depends on PyQuery) from EmbedAPI v2

This project has already tried our extension but since they are not hosted on Read the Docs it didn't work for them. See sympy/sympy#17456

humitos · 2021-07-07T15:38:07Z

I played a little more with nparagraphs= but I wasn't able to implement a generic solution that always works. The problems start when there are not enough p tags in the main content and we need to move to the next section (usually a div with h2)

Also, even when there are enough p tags in the main content, if there are .. note directives: these include 1 p for the title of it and we could return a content that ends with the title's note but without the content of it.

On the other hand, depending on the 3rd party HTML parsing library used (e.g. selectolax, beautifulsoup or pyquery) we may not have the parsing power we need. For example, I wasn't able to get all the .parents() for a particular node using selectolax, but I could with beautifulsoup. So, the "algorithm" designed for this may differ depending on the library chosen.

I'll come back to this after we have the first implementation for this API working. We can probably add it as a feature/improvement later.

ericholscher

This is a great start. I wonder if we can simplify things to start, and perhaps just treat internal & external sites the same? That might make the code way simpler and less prone to bugs. I'm not 100% sure that makes sense, but I think reducing special cases as much as possible will be a benefit in terms of maintaining and consuming the API.

readthedocs/embed/v3/views.py

ericholscher · 2021-07-08T22:59:57Z

readthedocs/embed/v3/views.py

+
+        return self._parse_based_on_doctool(page_content, fragment, doctool, doctoolversion)
+
+    def _find_main_node(self, html):


Seems like this is reimplementing

readthedocs.org/readthedocs/search/parsers.py

Line 64 in b3baa7a

def _get_main_node(self, html):

I'd like to find a way to have both of these codebases using similar abstractions, instead of reinventing the same logic over again.

Yeah. I mainly copied&pasted this from there. We can probably move this to a utils function somewhere and use the same in both places.

readthedocs/settings/base.py

ericholscher · 2021-07-08T23:07:58Z

readthedocs/embed/v3/views.py

+
+        return node.html
+
+    def get(self, request):


Are we not doing any auth here?

Not for now. Do we need auth at all? We are giving users to access only public internal projects (from inside _get_page_content_from_storage method) or public webpages on the internet.

I remember thinking about this and not finding a good and safe way to support private projects with session auth. Do we want other type of auth for this? If this is the case, we will need to consider auth only for internal projects.

I thought it was safer, for now, to start only with public documents and grow from there if we require auth/private documents in the future.

Well, we need to support this on .com, so we definitely need to figure out auth I think. I'd imagine it would be with session auth for the same domain, and explicit API key auth cross-domain, as we discussed previously?

I think this can be implemented in corporate repository by overriding the class and the permission_classes field with a custom class that make these checks. Makes sense to you?

Yea, seems reasonable.

Remove `query` and `fragment` from it before requesting it.

Returning this metadata makes the internal/external document requests being different. Let's avoid this for now since we are not using this metadata yet. We can add it when we needed in the future.

humitos · 2021-07-12T15:20:04Z

I copied these 3 comments from the review since they are all related.

I wonder if we can simplify things to start, and perhaps just treat internal & external sites the same? That might make the code way simpler and less prone to bugs

For .com and private projects, we have to access it via storage. We also need auth checks in that case though.

Are we not doing any auth here?

I started using just requests.get for all the pages. Then I realized that I needed auth for private projects, but then I remembered that private projects were complicated and added the PUBLIC filter to the requests done in the internal storage.

In any case, I'm 👍🏼 on simplifying this code to use the exact same method for internal/external documents if possible. If we plan to query private projects somehow, we could keep this code that's already written. Otherwise, we can delete it and use requests.get directly.

ericholscher · 2021-07-12T18:41:04Z

In any case, I'm 👍🏼 on simplifying this code to use the exact same method for internal/external documents if possible. If we plan to query private projects somehow, we could keep this code that's already written. Otherwise, we can delete it and use requests.get directly.

Yea -- let's keep the code around for now. I do think we'll want to support private projects, at least in the simple case on .com (eg. hoverxref). We need to implement better support for private queries though, so I'm also OK with waiting a bit longer to implement that so we can do something bigger across all the proxied API's.

Co-authored-by: Eric Holscher <[email protected]>

- too many redirects - error reading the external URL requested

humitos · 2021-08-18T16:00:40Z

I'm marking this PR as ready for review because all the logic and tests are ready. I'll continue writing the documentation in the meanwhile.

humitos · 2021-08-24T13:11:11Z

Use /api/v3/embed/ and send a dated version from the client (e.g. X-API-Version: 2021-08-18. GitHub uses Accept: header: docs.github.com/en/rest/overview/media-types#request-specific-version)

I'm leaning in this direction more.

I think it makes sense to use the /v3/ in the URL's path together with a version sent in the client. However, I think we should default to a fixed version (the current one --that we can define as today: 2021-08-18) instead of latest. This allows us to break requests only when we deprecate the default (current) version.

Backward-incompatible changes will generate a new dated version that needs to be sent in the client.
Backward-compatible changes won't generate a new dated release

The version in the URL shouldn't change unless we break the API completely with a different structure (REST to GrapQL), authentication (Session or Token to OAuth) or something big enough. I think we are on time to make this happen without breaking what we currently have. Then, if a user realizes that the API is not returning what they are expecting, they will be forced to send the version from the client to receive the response they are expecting.

(copied from a message that I sent privately to Eric)

@agjohnson @stsewd do you have opinions here?

ericholscher · 2021-08-24T15:43:52Z

I just worry we're going to end up with a situation like with the docker images, where we can never change the client version, because it will break too many people. Our API isn't super heavily used I guess, but I guess having a way to opt into newer features while not breaking people is OK, but definitely will become a frustration and complication in documentation :/

ericholscher

I think this is ready to soft launch in prod. Y'all can discuss the API versioning perhaps while I'm on vacation, but I don't want this blocked on my review :)

readthedocs/embed/v3/tests/test_external_pages.py

ericholscher · 2021-08-17T16:35:51Z

readthedocs/embed/v3/tests/test_external_pages.py

+        if Version(docutils.__version__) >= Version('0.17'):
+            content = '<div class="body" role="main">\n            \n  <section id="title">\n<h1>Title<a class="headerlink" href="https://docs.project.com#title" title="Permalink to this headline">¶</a></h1>\n<p>This is an example page used to test EmbedAPI parsing features.</p>\n<section id="sub-title">\n<h2>Sub-title<a class="headerlink" href="https://docs.project.com#sub-title" title="Permalink to this headline">¶</a></h2>\n<p>This is a reference to <a class="reference internal" href="https://docs.project.com#sub-title"><span class="std std-ref">Sub-title</span></a>.</p>\n</section>\n</section>\n\n\n          </div>'
+        else:
+            content = '<div class="body" role="main">\n            \n  <div class="section" id="title">\n<h1>Title<a class="headerlink" href="https://docs.project.com#title" title="Permalink to this headline">¶</a></h1>\n<p>This is an example page used to test EmbedAPI parsing features.</p>\n<div class="section" id="sub-title">\n<h2>Sub-title<a class="headerlink" href="https://docs.project.com#sub-title" title="Permalink to this headline">¶</a></h2>\n<p>This is a reference to <a class="reference internal" href="https://docs.project.com#sub-title"><span class="std std-ref">Sub-title</span></a>.</p>\n</div>\n</div>\n\n\n          </div>'


This feels quite brittle in terms of checking whitespace, etc. I do worry it might break easily, but I don't have a strong feeling here about how to fix it.

think they can be in an external file or use a multiline string withtextwrap.dedent

Yeap, I think we will find out a better way to do these checks in a more maintainable way. I'm doing something similar in sphinx-hoverxref and I haven't had too much problems with this in particular.

However, the output changes on version upgrade, in ways that don't affect the result (e.g. adds an extra CSS class) but I don't think there is too much we can do in that case if we want to be sure in our checks.

stsewd · 2021-09-08T21:48:53Z

Backward-incompatible changes will generate a new dated version that needs to be sent in the client.

I feel like that's going to be a lot of code to maintain based on dates, and don't see much changing, as we should return the html as close as possible to the original, additional metadata is still backwards compatible.

agjohnson · 2021-09-08T21:52:15Z

I think it makes sense to use the /v3/ in the URL's path together with a version sent in the client.

Have you thought about what this pattern will look like in code? If we version the API at the client, that is a different contract with the user than just having a /api/v2 vs /api/v3. Allowing a version specified by the client would make us have to maintain several, mostly immutable API versions. We'd probably have to document all of the versions we support as well.

Perhaps it's not a huge issue and we wouldn't use this in many places anyways, but I'm picturing our APIs becoming a nest of conditional logic around API client versions.

I am a fan of this pattern however, it is nice UX from the client perspective to not have to worry about client/server mismatches. It does take API deprecation off the table though.

stsewd · 2021-09-08T22:26:43Z

readthedocs/embed/v3/views.py

+
+        response = {
+            'url': url,
+            'fragment': fragment if fragment else None,


we are calling id to the fragment in the search API.

stsewd · 2021-09-08T22:30:39Z

readthedocs/embed/v3/views.py

+                    # ...
+                    # </dl>
+
+                    # TODO: figure it out if it's needed to remove the siblings here


are we going to resolve these todos in this PR?

It's not a blocker to move forward.

Co-authored-by: Eric Holscher <[email protected]>

humitos · 2021-09-09T10:02:27Z

@agjohnson

Have you thought about what this pattern will look like in code?

Yes, but I haven't done a POC, tho. I imagine that we can have two different approaches at least:

nested ifs for small things in the middle of the logic: simpler to start and harder to maintain in the long term
modularized logic and override by using one class per API version: harder to start and easier to maintain in the long term

We could probably start with the nested ifs when this situation needs to be handled to understand how much work requires and how "dirty" the code looks like and decide if it is worth the effort to migrate to the overriding pattern that seems cleaner at first sight.

If we version the API at the client, that is a different contract with the user than just having a /api/v2 vs /api/v3. Allowing a version specified by the client would make us have to maintain several, mostly immutable API versions. We'd probably have to document all of the versions we support as well.

We have been maintaining immutable API already (APIv1 and APIv2 are good examples) and the downside we have on them currently is that they avoided us to move forward and we had to implement v3.

This is definitely more work, yes. But I think I'm fine paying for this work if it unblocks us to make changes in the API that change behavior in a particular case without breaking the rest of the API. Example: "when a glossary is detected in Sphinx we remove the siblings" (https://github.com/readthedocs/readthedocs.org/pull/8319/files#diff-93aac61b8b6d747d415ef7412a0f6aa10410dd7b1aa229ea075026177915724fR166)

Honestly, at first sight, I don't expect to have many of these changes, but I'd like to have a defined pattern to follow in case we need to do it.

I am a fan of this pattern however, it is nice UX from the client perspective to not have to worry about client/server mismatches. It does take API deprecation off the table though.

Why do you think that "It does take API deprecation off the table"?

humitos · 2021-09-09T10:05:54Z

@stsewd

Backward-incompatible changes will generate a new dated version that needs to be sent in the client.

I feel like that's going to be a lot of code to maintain based on dates, and don't see much changing, as we should return the html as close as possible to the original, additional metadata is still backwards compatible.

Yeah, additional metadata is not a backward-incompatible change. However, making a change in the behavior is what I'm referring to here. Example: "a query to a term returns only the title of it" changes to "a query to a term returns the title and the content"

agjohnson · 2021-09-14T18:02:49Z

modularized logic and override by using one class per API version

I was picturing this as well. Seems it should work pretty well. I also agree we probably won't be forking API endpoint functionality very often.

We have been maintaining immutable API already

Well, I would say otherwise -- we've straight up made API endpoints stop working without even communicating the deprecation :)

We don't support old iterations of endpoint (for example, our project listing) is my point though. We do support major versions of our API, but effectively supporting a v3.1, 3.2, etc will be new territory for us.

The point that stands out for me here is: if we already have immutable API versions, what benefit does a secondary versioning mechanism have over adding a URL path for /api/v4? While I do prefer the client header option, and I don't think users should need to know about /api/v2/ vs /api/v3/, we are already using a versioned pattern for the API URL. The secondary version specifier seems like odd design for an API in this case.

At the code level, would extending APIv3 in an APIv3.1 implementation be cleaner or messier than splitting up our API with conditional logic against the API client header?

If there was a clean way to use a X-API: 2021-08-01 header to route to the correct API version, using a top level /api/project/foo (without the v2/v3/v4), this seems like really nice design and maybe a separate project.

Why do you think that "It does take API deprecation off the table"?

Deprecating a version will never be graceful for the user, and so (similar to how APIv2 keeps avoiding our deprecation axe) we'll end up supporting the endpoint version for longer than we want.

So, to bring together all of these points:

We don't expect many forks of the API
We already have a pattern for maintaining immutable API versions
Multiple version numbers are confusing

I'm not a huge fan of /api/v3.1/ or /api/v4/, but signs are pointing this direction for the reasons above.

humitos · 2021-09-15T11:04:25Z

The point that stands out for me here is: if we already have immutable API versions, what benefit does a secondary versioning mechanism have over adding a URL path for /api/v4?

I think this is the key question, IIRC this is why we started this conversation with @ericholscher (goes back to #8319 (comment)). The reason is that it's super confusing to have feature APIs (e.g. Embed, Search) with a different version than resource APIs (e.g. data models). For example, if we need to make a breaking change in EmbedAPI we will need /api/v4/embed/ but all the resource APIs will still be under /api/v3/(projects/versions/etc); and maybe SearchAPI will be /api/v7/search/.

If there was a clean way to use a X-API: 2021-08-01 header to route to the correct API version, using a top level /api/project/foo (without the v2/v3/v4), this seems like really nice design and maybe a separate project.

I agree that this is probably the best pattern, but I think we can't remove the /vX from the URL at this point. However, adding the version in the client seems like a middle ground solution that we can implement to get some of its benefits without breaking existing things.

agjohnson · 2021-09-15T20:52:02Z

I agree that this is probably the best pattern, but I think we can't remove the /vX from the URL at this point.

Yeah, though we might not need to remove these endpoints either way. For example, the /api/v2 and /api/v3 endpoints could live on, but instead we'd promote accessing the APIs by using a header and /api/embed/ instead. Adding a /api/embed doesn't even have to happen now, but maybe is just part of our long term goal here.

So, perhaps in that case, we are starting with your suggestion of implementing this on /api/v3 with a header, with plans to have this pattern also on a top level /api/ as part of a later, separate project. The immediate case is workable, albeit awkward, but eventually everything will make more sense.

This seems like a good middle ground and a good step towards a nicer API design.

…humitos/embed-api-v3

humitos · 2021-09-21T08:56:56Z

I'm merging this PR and continue with the work required in sphinx-hoverxref to use this new APIv3. We can adapt things that may not be working as we expect while working on the extension.

humitos added 3 commits July 5, 2021 16:15

Move clean_links to embed/utils.py

61554ac

This function will be shared between APIv2 and APIv3.

clean_links response HTML in raw instead of a PyQuery object

311ada4

Making `clean_links` to response HTML in raw allows us to reuse this code in APIv2 and APIv3.

humitos requested a review from a team July 6, 2021 11:38

humitos added 8 commits July 6, 2021 13:56

Add docs.sympy.org to the allowed domains

e841b98

This project has already tried our extension but since they are not hosted on Read the Docs it didn't work for them. See sympy/sympy#17456

Implement doctool= usage for Sphinx dt known cases

78f6656

Use timeout= when requesting external pages

fe3a507

Handle requests TooManyRedirects and other errors

28e4e83

Use cache-keys and djangos settings for timeouts

71b9e2d

More logs and comments

1aa70e3

Remove one unneeded conditional

5bab780

Return fragment=null if there is no fragment

59fe0f1

ericholscher reviewed Jul 8, 2021

View reviewed changes

humitos added 8 commits July 12, 2021 16:10

Use setting for request.get timeout= argument

517d3e8

Log exception to track wrong URLs in Sentry

562c260

Sanitize the URL inside _download_page_content

73e2ee3

Remove `query` and `fragment` from it before requesting it.

Handle malformed URLs (not netloc or scheme)

854a44b

Use for/else syntax sugar instead of a temporary variable

49e02d5

Call clean_links before creating the response

be32bd3

Do not depend on impicit state: pass the required arguments

cc2227c

Don't return metadata (project, version, language, path)

1a72c07

Returning this metadata makes the internal/external document requests being different. Let's avoid this for now since we are not using this metadata yet. We can add it when we needed in the future.

humitos and others added 5 commits July 15, 2021 10:29

Update readthedocs/embed/v3/views.py

7b6b493

Co-authored-by: Eric Holscher <[email protected]>

Improve the response http status codes

418d9ac

- too many redirects - error reading the external URL requested

Sanitize URL before passing it to clean_liniks

8a77043

Comment to sanitize cache_key by URL

d9ca50e

Update import for clean_links in tests

3be9b8e

humitos added 9 commits August 18, 2021 15:16

Make test for sphinxcontrib-bibtex to pass

c4751c7

Auto-delete _build directory after test run

7dd4b4f

Checks that depend on Sphinx version (3.5)

d134ab9

Don't make doctoolversion= attribute mandatory when passing doctool=

4b5f2d6

Sphinx 3.5 seems to be different on its HTML

0056e90

Lint

cf3da8a

Fragment case changed on 3.0.0

aeaad76

Don't run EmbedAPIv3 tests by default

ef5a14c

Sphinx 3.5 adds an <span> on <dl>

160e51b

humitos marked this pull request as ready for review August 18, 2021 16:00

humitos requested a review from ericholscher August 18, 2021 16:08

ericholscher approved these changes Sep 2, 2021

View reviewed changes

stsewd reviewed Sep 8, 2021

View reviewed changes

Update readthedocs/embed/v3/tests/test_external_pages.py

cc76893

Co-authored-by: Eric Holscher <[email protected]>

humitos added 2 commits September 21, 2021 10:48

Log url= together with fragment=

5e25c47

Merge branch 'master' of github.com:readthedocs/readthedocs.org into …

7edb272

…humitos/embed-api-v3

humitos merged commit 7c4ccbf into master Sep 21, 2021

humitos deleted the humitos/embed-api-v3 branch September 21, 2021 09:13

humitos mentioned this pull request Nov 25, 2021

Add support for :doc: role readthedocs/sphinx-hoverxref#18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embed APIv3: initial implementation #8319

Embed APIv3: initial implementation #8319

humitos commented Jul 6, 2021 •

edited

Loading

humitos commented Jul 7, 2021

ericholscher left a comment

ericholscher Jul 8, 2021

humitos Jul 12, 2021

ericholscher Jul 8, 2021

humitos Jul 12, 2021

ericholscher Jul 12, 2021

humitos Aug 16, 2021

ericholscher Aug 16, 2021

humitos commented Jul 12, 2021

ericholscher commented Jul 12, 2021

humitos commented Aug 18, 2021

humitos commented Aug 24, 2021

ericholscher commented Aug 24, 2021

ericholscher left a comment

ericholscher Aug 17, 2021

stsewd Sep 8, 2021

humitos Sep 9, 2021

stsewd commented Sep 8, 2021

agjohnson commented Sep 8, 2021

stsewd Sep 8, 2021

stsewd Sep 8, 2021

humitos Sep 9, 2021

humitos commented Sep 9, 2021

humitos commented Sep 9, 2021

agjohnson commented Sep 14, 2021

humitos commented Sep 15, 2021 •

edited

Loading

agjohnson commented Sep 15, 2021

humitos commented Sep 21, 2021


		return self._parse_based_on_doctool(page_content, fragment, doctool, doctoolversion)

		def _find_main_node(self, html):

Embed APIv3: initial implementation #8319

Embed APIv3: initial implementation #8319

Conversation

humitos commented Jul 6, 2021 • edited Loading

Implementation

ToDo

humitos commented Jul 7, 2021

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos commented Jul 12, 2021

ericholscher commented Jul 12, 2021

humitos commented Aug 18, 2021

humitos commented Aug 24, 2021

ericholscher commented Aug 24, 2021

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsewd commented Sep 8, 2021

agjohnson commented Sep 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos commented Sep 9, 2021

humitos commented Sep 9, 2021

agjohnson commented Sep 14, 2021

humitos commented Sep 15, 2021 • edited Loading

agjohnson commented Sep 15, 2021

humitos commented Sep 21, 2021

humitos commented Jul 6, 2021 •

edited

Loading

humitos commented Sep 15, 2021 •

edited

Loading