-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embed: design doc for new embed API #8039
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are good ideas here.
I don't think we should have 2 separate endpoints for this and we should only rely on ?url=
attribute unless there is a good reason to not to.
Also, we should support nparagraph=
or nwords=
or both to be able to chunk the content of the section returned (see https://github.com/readthedocs/readthedocs-ext/pull/304)
- There are three ways of querying the API, and some rely on Sphinx's concepts like ``doc``. | ||
- Doesn't cache responses or doesn't purge the cache on build. | ||
- Doesn't support MkDocs. | ||
- It returns all sections from the current page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can send section=
to get only the section you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean here is that it always returns all the sections (title and id), not the section content.
These improvements aren't breaking changes, so we can implement them in the old and new API. | ||
|
||
- Support for MkDocs. | ||
- Always return a valid/well formed HTML block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this refers to the dd
/dt
issue, I'm not sure if this will be easily doable for all the cases. We may end up re-parsing the dd
to convert that HTML tag to another one, but that will break the style based on dd
tags. So, this will be tricky to do it right and make it work for all the themes --which is also related with the new extras
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just need to extract only that element. For example
<dl class="foo">
.. more..
<dt>foo</dt> <dd>bar</dd>
.. more...
</dl>
becomes
<dl class="foo">
<dt>foo</dt> <dd>bar</dd>
</dl>
Extracting content isn't hard to do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm... I think that has the same problem.
When parsing citation
/glossary
we don't need to return the title/definition (dt
), we only want the description (dd
). Example in code
readthedocs.org/readthedocs/embed/views.py
Lines 329 to 335 in eb682d8
# Structure: | |
# <dl class="glossary docutils"> | |
# <dt id="term-definition">definition</dt> | |
# <dd>Text definition for the term</dd> | |
# ... | |
# </dl> | |
query_result = query_result.next() |
Following, your example, to make it valid it should be:
<dl class="foo">
<dt></dt>
<dd>bar</dd>
</dl>
However, this won't probably render correctly inside the tooltip, since in that context is not a description list --it's not a list at all-- but just the definition of one of the term from that list.
In those cases, citation
/glossary
will be rendered in the same than any other definition list and I'm not sure we want that. We need to think how we will support this. Example:
Glossary (only includes the definition)
Config value (includes the definition list completely repeating the term hovered)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should repeat the title, it doesn't necessary match the text where the tooltip is present. Removing the title if it's the same as the text can be done in the client side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glossary and Citation special cases can be managed from the frontend/client since we have all the data we need (note that we return the .next()
HTML element). We will just need to parse it remove the content we don't want to display.
However, for the dt
case (definition of a config, Python method/function, etc) it won't work because we need the .parent()
of HTML element picked by ID. Take a look at this example:
<dl class="std confval">
<dt id="confval-hoverxref_role_types">
<code class="sig-name descname">hoverxref_role_types</code>
<a class="headerlink" href="#confval-hoverxref_role_types" title="Permalink to this definition">¶</a>
</dt>
<dd>
<p>Description: Style to use by default when hover each type of reference (role).</p>
<!-- ... snip ... -->
</dd>
</dl>
(example from https://sphinx-hoverxref.readthedocs.io/en/latest/configuration.html#confval-hoverxref_role_types)
With the id confval-hoverxref_role_types
we are currently returning the whole dl
because we access to the .parent()
element. If we return just the HTML tag matched, we will be returning only the dt
which does not include the description nor any content.
Without handling this special case properly, we will show tooltips like:
We need to find a general solution for these special cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@humitos not sure to understand the problem there? that's the same example we have in the document. The id is on the dt element, so we fetch the dt and the next dd enclosed in the dl tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may not have expressed myself properly.
This is a special case because we are manipulating the HTML to manage a particular case on Sphinx: return the parent of a particular ID, instead of itself. Since we are going to support external sites that will not be generated with Sphinx (*), we can't assume that we always have to do this special HTML manipulation. That's why I'm saying that we need a general solution to differentiate these special cases from regular (non-Sphinx) pages.
I'm thinking that we will need to communicate this to the backend somehow. Maybe a request GET argument: ?doctool=sphinx&version=3.5.1
(*) or even with a different version of Sphinx that may adds the id=
in the dl
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's part of how we expect the html to be, following semantic html and using correct tags, if a site has malformed tags or if the id is in a different element we don't try to fix it, we just return it as is.
|
||
Allow us to query information about a page, like its list of sections. | ||
|
||
.. http:get:: /_/api/v3/embed/pages?project=docs&version=latest&path=install.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we will have only one way of querying the API, we should probably use ?url=
instead of sending multiple attributes. The implementation of the extensions is easier (and there are things not possible to implement without it --e.g. intersphinx) and the usage of the endpoint will be simplified too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a URL makes it depend on the domain, if the domain changes everything is broken. I'm fine supporting both to support intersphinx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. http:get:: /_/api/v3/embed/pages?project=docs&version=latest&path=install.html | |
.. http:get:: /_/api/v3/embed/page/?project=docs&version=latest&path=install.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure if we want to make this a v3 API, given that it works totally differently than the others :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure if we want to make this a v3 API, given that it works totally differently than the others :/
That's why my note in #8039 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is super bad. The other endpoints are for resources and this one is not.
However, I'm starting to think that this could be a different service as well? But maybe that's too much? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a URL makes it depend on the domain, if the domain changes everything is broken
Yes and no. Maybe (?), I guess. If they use https://readthedocs.org/api/v3/embed/?url=
endpoint and it shouldn't break if they don't remove the Domain for that Project from our platform. Even if the DNS for that domain is dead.
"sections": [ | ||
{ | ||
"title": "Installation", | ||
"id": "installation" | ||
}, | ||
{ | ||
"title": "Examples", | ||
"id": "examples" | ||
} | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of listing all the sections. I'd also like to see a way to explore/surf the API just by clicking these links, similar as we do with the _links
field in other endpoints. This way, you can discover the API by clicking on those, but also expose these links to developers.
API Client | ||
---------- | ||
|
||
Do we really need a JS client? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. IIRC @agjohnson idea here was to include the popup/modal form of this into the JS client. Besides, we could eventually be able to use this JS client from our extension itself, instead of maintaining the front-end in our own extension (or adding it as another popup extension together with the ones we currently support)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, the JS client is going to be doing a lot of work in addition to just doing the GET. We definitely want nice default widgets, along with helper methods that make this a much nicer UX for users.
Feels like that's a client thing, to be able to cut the content exactly where they want and show a "See more" link. And maybe can be done with css, without messing around the content itselff. |
I don't think we want to return/download a whole page just to show the first paragraph of the first section. That's my point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good start, but I think we need a lot more thought and explanation behind some of the design decisions. I'd love to see more structure for a design doc here, explicitly stating the goals in a more detailed way. There is a ton of implied context in each "list of 10 items" that we should be explaining more fully.
Most importantly, we should have a sections of goals, and why they are valuable. For example, why is "Always return a valid/well formed HTML block." important? How are we going to ensure that at the code level?
|
||
Allow us to query information about a page, like its list of sections. | ||
|
||
.. http:get:: /_/api/v3/embed/pages?project=docs&version=latest&path=install.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. http:get:: /_/api/v3/embed/pages?project=docs&version=latest&path=install.html | |
.. http:get:: /_/api/v3/embed/page/?project=docs&version=latest&path=install.html |
|
||
Allow us to query the content of the section, with all links re-written as absolute. | ||
|
||
.. http:get:: /_/api/v3/embed/sections?project=docs&version=latest&path=install.html#examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. http:get:: /_/api/v3/embed/sections?project=docs&version=latest&path=install.html#examples | |
.. http:get:: /_/api/v3/embed/section/?project=docs&version=latest&path=install.html#examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the naming conventions here. We are naming a resource, so it should be plural, but we only have a single resource to get the data from.
API Client | ||
---------- | ||
|
||
Do we really need a JS client? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, the JS client is going to be doing a lot of work in addition to just doing the GET. We definitely want nice default widgets, along with helper methods that make this a much nicer UX for users.
but them querying a section would require to query a page to get the extra js/css files. | ||
- We could not return the ``title`` of the page/section as it would require more parsing to do | ||
(but we can re-use the code from search). | ||
Titles can be useful to build an UI like https://readthedocs.org/projects/docs/tools/embed/. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would a user build something like this without the ability to also get all the pages in a project? We aren't exposing an API for that in this design doc. Should we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point, we could have an endpoint for that, but now that you mentioned we need to support external sites, I don't think it would be possible. Or it would be rtd only feature.
|
||
Note that we only try to keep the current structure, | ||
if the page contains malformed HTML, we don't try to *fix it*. | ||
This improvement can be shared with the current API (v2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't back-port this since it's a breaking change from a user's perspective: the render of the content in the tooltip will be completely different (see #8039 (comment))
Considerations | ||
`````````````` | ||
|
||
If a project changes its custom domain, current usage of the API would break. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could check Referer
/Origin
header as a fallback to try to find the project in case they changed the domain (sent via ?url=
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also save all the historic domains for a project. Instead of delete them, we can mark them as inactive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems very difficult to scale. What if another project reuses the domain?
The API could be missused. | ||
This is already true if we don't support external sites, | ||
since we host arbitrary HTML already. | ||
But it can be abussed to crawl external sites without the consent. | ||
We can integrate support for external sites in a later stage, | ||
or have a list of allowed sites. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to understand what's the problem or abuse we can face here.
I think we definitely don't have a list of allowed sites, here. We want this to be general/global.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would be exposed to SSRF, basically an attacker can use our servers to make get requests to other insecure sites or DDOS another site.
"sections": [ | ||
{ | ||
"title": "Installation", | ||
"id": "installation" | ||
}, | ||
{ | ||
"title": "Examples", | ||
"id": "examples" | ||
} | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case for this field? Are we returning nested sections here or only the first level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All sections, nested as well, so the client knows what sections it can query, this is only from h elements, but I guess it could be expanded for more tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, why only show h
tags when the backend can reply to any anchor link? If we want expose all the possible queries we should return all the valid anchors.
In any case, I think we should add a request GET argument to expand this and avoid returning a lot of data when it's not going to be used. Example: ?expand=sections
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in another endpoint, so I don't think we need that feature here.
Mostly based on my notes from #7963.