-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embed APIv3: initial implementation #8319
Conversation
This function will be shared between APIv2 and APIv3.
Making `clean_links` to response HTML in raw allows us to reuse this code in APIv2 and APIv3.
- support external sites for allowed domains (via setting) - given a URL and a fragment, it returns the exact HTML portion if found - cache external request for some time - rate-limit external requests to 50/m per domain - uses `requests.get` for external sites (only public documents) - uses Django storage for local hosted sites (allows only PUBLIC versions) - reuses `clean_links` util function (depends on PyQuery) from EmbedAPI v2
This project has already tried our extension but since they are not hosted on Read the Docs it didn't work for them. See sympy/sympy#17456
I played a little more with Also, even when there are enough On the other hand, depending on the 3rd party HTML parsing library used (e.g. I'll come back to this after we have the first implementation for this API working. We can probably add it as a feature/improvement later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start. I wonder if we can simplify things to start, and perhaps just treat internal & external sites the same? That might make the code way simpler and less prone to bugs. I'm not 100% sure that makes sense, but I think reducing special cases as much as possible will be a benefit in terms of maintaining and consuming the API.
|
||
return self._parse_based_on_doctool(page_content, fragment, doctool, doctoolversion) | ||
|
||
def _find_main_node(self, html): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this is reimplementing
def _get_main_node(self, html): |
I'd like to find a way to have both of these codebases using similar abstractions, instead of reinventing the same logic over again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I mainly copied&pasted this from there. We can probably move this to a utils function somewhere and use the same in both places.
readthedocs/embed/v3/views.py
Outdated
|
||
return node.html | ||
|
||
def get(self, request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we not doing any auth here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for now. Do we need auth at all? We are giving users to access only public internal projects (from inside _get_page_content_from_storage
method) or public webpages on the internet.
I remember thinking about this and not finding a good and safe way to support private projects with session auth. Do we want other type of auth for this? If this is the case, we will need to consider auth only for internal projects.
I thought it was safer, for now, to start only with public documents and grow from there if we require auth/private documents in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we need to support this on .com, so we definitely need to figure out auth I think. I'd imagine it would be with session auth for the same domain, and explicit API key auth cross-domain, as we discussed previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be implemented in corporate repository by overriding the class and the permission_classes
field with a custom class that make these checks. Makes sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, seems reasonable.
Remove `query` and `fragment` from it before requesting it.
Returning this metadata makes the internal/external document requests being different. Let's avoid this for now since we are not using this metadata yet. We can add it when we needed in the future.
I copied these 3 comments from the review since they are all related.
I started using just In any case, I'm 👍🏼 on simplifying this code to use the exact same method for internal/external documents if possible. If we plan to query private projects somehow, we could keep this code that's already written. Otherwise, we can delete it and use |
Yea -- let's keep the code around for now. I do think we'll want to support private projects, at least in the simple case on .com (eg. hoverxref). We need to implement better support for private queries though, so I'm also OK with waiting a bit longer to implement that so we can do something bigger across all the proxied API's. |
Co-authored-by: Eric Holscher <[email protected]>
- too many redirects - error reading the external URL requested
I'm marking this PR as ready for review because all the logic and tests are ready. I'll continue writing the documentation in the meanwhile. |
I'm leaning in this direction more. I think it makes sense to use the
The version in the URL shouldn't change unless we break the API completely with a different structure (REST to GrapQL), authentication (Session or Token to OAuth) or something big enough. I think we are on time to make this happen without breaking what we currently have. Then, if a user realizes that the API is not returning what they are expecting, they will be forced to send the version from the client to receive the response they are expecting. (copied from a message that I sent privately to Eric) @agjohnson @stsewd do you have opinions here? |
I just worry we're going to end up with a situation like with the docker images, where we can never change the client version, because it will break too many people. Our API isn't super heavily used I guess, but I guess having a way to opt into newer features while not breaking people is OK, but definitely will become a frustration and complication in documentation :/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ready to soft launch in prod. Y'all can discuss the API versioning perhaps while I'm on vacation, but I don't want this blocked on my review :)
if Version(docutils.__version__) >= Version('0.17'): | ||
content = '<div class="body" role="main">\n \n <section id="title">\n<h1>Title<a class="headerlink" href="https://docs.project.com#title" title="Permalink to this headline">¶</a></h1>\n<p>This is an example page used to test EmbedAPI parsing features.</p>\n<section id="sub-title">\n<h2>Sub-title<a class="headerlink" href="https://docs.project.com#sub-title" title="Permalink to this headline">¶</a></h2>\n<p>This is a reference to <a class="reference internal" href="https://docs.project.com#sub-title"><span class="std std-ref">Sub-title</span></a>.</p>\n</section>\n</section>\n\n\n </div>' | ||
else: | ||
content = '<div class="body" role="main">\n \n <div class="section" id="title">\n<h1>Title<a class="headerlink" href="https://docs.project.com#title" title="Permalink to this headline">¶</a></h1>\n<p>This is an example page used to test EmbedAPI parsing features.</p>\n<div class="section" id="sub-title">\n<h2>Sub-title<a class="headerlink" href="https://docs.project.com#sub-title" title="Permalink to this headline">¶</a></h2>\n<p>This is a reference to <a class="reference internal" href="https://docs.project.com#sub-title"><span class="std std-ref">Sub-title</span></a>.</p>\n</div>\n</div>\n\n\n </div>' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels quite brittle in terms of checking whitespace, etc. I do worry it might break easily, but I don't have a strong feeling here about how to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think they can be in an external file or use a multiline string withtextwrap.dedent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, I think we will find out a better way to do these checks in a more maintainable way. I'm doing something similar in sphinx-hoverxref and I haven't had too much problems with this in particular.
However, the output changes on version upgrade, in ways that don't affect the result (e.g. adds an extra CSS class) but I don't think there is too much we can do in that case if we want to be sure in our checks.
I feel like that's going to be a lot of code to maintain based on dates, and don't see much changing, as we should return the html as close as possible to the original, additional metadata is still backwards compatible. |
Have you thought about what this pattern will look like in code? If we version the API at the client, that is a different contract with the user than just having a Perhaps it's not a huge issue and we wouldn't use this in many places anyways, but I'm picturing our APIs becoming a nest of conditional logic around API client versions. I am a fan of this pattern however, it is nice UX from the client perspective to not have to worry about client/server mismatches. It does take API deprecation off the table though. |
|
||
response = { | ||
'url': url, | ||
'fragment': fragment if fragment else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are calling id
to the fragment in the search API.
# ... | ||
# </dl> | ||
|
||
# TODO: figure it out if it's needed to remove the siblings here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we going to resolve these todos in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a blocker to move forward.
Co-authored-by: Eric Holscher <[email protected]>
Yes, but I haven't done a POC, tho. I imagine that we can have two different approaches at least:
We could probably start with the nested
We have been maintaining immutable API already (APIv1 and APIv2 are good examples) and the downside we have on them currently is that they avoided us to move forward and we had to implement v3. This is definitely more work, yes. But I think I'm fine paying for this work if it unblocks us to make changes in the API that change behavior in a particular case without breaking the rest of the API. Example: "when a glossary is detected in Sphinx we remove the siblings" (https://github.com/readthedocs/readthedocs.org/pull/8319/files#diff-93aac61b8b6d747d415ef7412a0f6aa10410dd7b1aa229ea075026177915724fR166) Honestly, at first sight, I don't expect to have many of these changes, but I'd like to have a defined pattern to follow in case we need to do it.
Why do you think that "It does take API deprecation off the table"? |
Yeah, additional metadata is not a backward-incompatible change. However, making a change in the behavior is what I'm referring to here. Example: "a query to a term returns only the title of it" changes to "a query to a term returns the title and the content" |
I was picturing this as well. Seems it should work pretty well. I also agree we probably won't be forking API endpoint functionality very often.
Well, I would say otherwise -- we've straight up made API endpoints stop working without even communicating the deprecation :) We don't support old iterations of endpoint (for example, our project listing) is my point though. We do support major versions of our API, but effectively supporting a v3.1, 3.2, etc will be new territory for us. The point that stands out for me here is: if we already have immutable API versions, what benefit does a secondary versioning mechanism have over adding a URL path for At the code level, would extending APIv3 in an APIv3.1 implementation be cleaner or messier than splitting up our API with conditional logic against the API client header? If there was a clean way to use a
Deprecating a version will never be graceful for the user, and so (similar to how APIv2 keeps avoiding our deprecation axe) we'll end up supporting the endpoint version for longer than we want. So, to bring together all of these points:
I'm not a huge fan of |
I think this is the key question, IIRC this is why we started this conversation with @ericholscher (goes back to #8319 (comment)). The reason is that it's super confusing to have feature APIs (e.g. Embed, Search) with a different version than resource APIs (e.g. data models). For example, if we need to make a breaking change in EmbedAPI we will need
I agree that this is probably the best pattern, but I think we can't remove the |
Yeah, though we might not need to remove these endpoints either way. For example, the So, perhaps in that case, we are starting with your suggestion of implementing this on This seems like a good middle ground and a good step towards a nicer API design. |
I'm merging this PR and continue with the work required in |
Implementation
requests.get
for external sites (only public documents)requests.get
follows up to 30 redirectsrequests.get
a timeout of 1 secondclean_links
util function (depends on PyQuery) from EmbedAPI v2add extra fields in response if locally hosted documents (e.g.project
,language
,version
andpath
)dt
ToDo
doctool
anddoctoolversion
dt
)embed/v3
,api/v3/embed
, other?)Reference: #8222
Design document: https://docs.readthedocs.io/en/latest/development/design/embed-api.html