Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embed: design doc for new embed API #8039

Closed
wants to merge 10 commits into from
154 changes: 154 additions & 0 deletions docs/development/design/embed-api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
Embed API
=========

The embedded API allows to embed content from docs pages in other sites.
stsewd marked this conversation as resolved.
Show resolved Hide resolved
For a while it has been as an *experimental* feature without public documentation or real applications,
but recently it has been used widely (mainly because we created a Sphinx extension).

Due to this we need to have more friendly to use API,
and general and stable enough to support it for a long time.

.. contents::
:local:
:depth: 3

Current implementation
----------------------

The current implementation of the API is partially documented in :doc:`/guides/embedding-content`.
Some characteristics/problems are:

- There are three ways of querying the API, and some rely on Sphinx's concepts like ``doc``.
- Doesn't cache responses or doesn't purge the cache on build.
- Doesn't support MkDocs.
- It returns all sections from the current page.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can send section= to get only the section you want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean here is that it always returns all the sections (title and id), not the section content.

- Lookups are slow (~500 ms).
- IDs returned aren't well formed (like empty IDs `#`).
- The content is always an array of one element.
- The section can be an identifier or any other four variants or the title of the section.
- It doesn't return valid HTML for definition lists (``dd`` tags without a ``dt`` tag).
- The client doesn't know if the page requires extra JS or CSS in order to make it work or look nice.

Improvements
------------

These improvements aren't breaking changes, so we can implement them in the old and new API.

- Support for MkDocs.
- Always return a valid/well formed HTML block.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this refers to the dd/dt issue, I'm not sure if this will be easily doable for all the cases. We may end up re-parsing the dd to convert that HTML tag to another one, but that will break the style based on dd tags. So, this will be tricky to do it right and make it work for all the themes --which is also related with the new extras field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just need to extract only that element. For example

<dl class="foo">
 .. more..
 <dt>foo</dt> <dd>bar</dd>
.. more...
</dl>

becomes

<dl class="foo">
 <dt>foo</dt> <dd>bar</dd>
</dl>

Extracting content isn't hard to do here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm... I think that has the same problem.

When parsing citation/glossary we don't need to return the title/definition (dt), we only want the description (dd). Example in code

# Structure:
# <dl class="glossary docutils">
# <dt id="term-definition">definition</dt>
# <dd>Text definition for the term</dd>
# ...
# </dl>
query_result = query_result.next()

Following, your example, to make it valid it should be:

<dl class="foo">
 <dt></dt>
 <dd>bar</dd>
</dl>

However, this won't probably render correctly inside the tooltip, since in that context is not a description list --it's not a list at all-- but just the definition of one of the term from that list.

In those cases, citation/glossary will be rendered in the same than any other definition list and I'm not sure we want that. We need to think how we will support this. Example:

Glossary (only includes the definition)

Screenshot_2021-03-30_11-26-04

Config value (includes the definition list completely repeating the term hovered)

Screenshot_2021-03-30_11-23-06

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should repeat the title, it doesn't necessary match the text where the tooltip is present. Removing the title if it's the same as the text can be done in the client side.

Copy link
Member

@humitos humitos May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glossary and Citation special cases can be managed from the frontend/client since we have all the data we need (note that we return the .next() HTML element). We will just need to parse it remove the content we don't want to display.

However, for the dt case (definition of a config, Python method/function, etc) it won't work because we need the .parent() of HTML element picked by ID. Take a look at this example:

<dl class="std confval">
  <dt id="confval-hoverxref_role_types">
    <code class="sig-name descname">hoverxref_role_types</code>
    <a class="headerlink" href="#confval-hoverxref_role_types" title="Permalink to this definition"></a>
  </dt>
  <dd>
    <p>Description: Style to use by default when hover each type of reference (role).</p>
    <!-- ... snip ... -->
  </dd>
</dl>

(example from https://sphinx-hoverxref.readthedocs.io/en/latest/configuration.html#confval-hoverxref_role_types)

With the id confval-hoverxref_role_types we are currently returning the whole dl because we access to the .parent() element. If we return just the HTML tag matched, we will be returning only the dt which does not include the description nor any content.

Without handling this special case properly, we will show tooltips like:

Screenshot_2021-05-27_16-16-41

We need to find a general solution for these special cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@humitos not sure to understand the problem there? that's the same example we have in the document. The id is on the dt element, so we fetch the dt and the next dd enclosed in the dl tag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not have expressed myself properly.

This is a special case because we are manipulating the HTML to manage a particular case on Sphinx: return the parent of a particular ID, instead of itself. Since we are going to support external sites that will not be generated with Sphinx (*), we can't assume that we always have to do this special HTML manipulation. That's why I'm saying that we need a general solution to differentiate these special cases from regular (non-Sphinx) pages.

I'm thinking that we will need to communicate this to the backend somehow. Maybe a request GET argument: ?doctool=sphinx&version=3.5.1

(*) or even with a different version of Sphinx that may adds the id= in the dl.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's part of how we expect the html to be, following semantic html and using correct tags, if a site has malformed tags or if the id is in a different element we don't try to fix it, we just return it as is.


New API
-------

The API would be split into two endpoints, and only have one way of querying the API.

Get page
--------

Allow us to query information about a page, like its list of sections.

.. http:get:: /_/api/v3/embed/pages?project=docs&version=latest&path=install.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we will have only one way of querying the API, we should probably use ?url= instead of sending multiple attributes. The implementation of the extensions is easier (and there are things not possible to implement without it --e.g. intersphinx) and the usage of the endpoint will be simplified too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a URL makes it depend on the domain, if the domain changes everything is broken. I'm fine supporting both to support intersphinx.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. http:get:: /_/api/v3/embed/pages?project=docs&version=latest&path=install.html
.. http:get:: /_/api/v3/embed/page/?project=docs&version=latest&path=install.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure if we want to make this a v3 API, given that it works totally differently than the others :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure if we want to make this a v3 API, given that it works totally differently than the others :/

That's why my note in #8039 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is super bad. The other endpoints are for resources and this one is not.

However, I'm starting to think that this could be a different service as well? But maybe that's too much? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a URL makes it depend on the domain, if the domain changes everything is broken

Yes and no. Maybe (?), I guess. If they use https://readthedocs.org/api/v3/embed/?url= endpoint and it shouldn't break if they don't remove the Domain for that Project from our platform. Even if the DNS for that domain is dead.


:query project: (required)
:query version: (required)
:query path: (required)

.. sourcecode:: json

{
"project": "docs",
"version": "latest",
"path": "install.html",
"title": "Installation Guide",
"url": "https://docs.readthedocs.io/en/latest/install.html",
"sections": [
{
"title": "Installation",
"id": "installation"
},
{
"title": "Examples",
"id": "examples"
}
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of listing all the sections. I'd also like to see a way to explore/surf the API just by clicking these links, similar as we do with the _links field in other endpoints. This way, you can discover the API by clicking on those, but also expose these links to developers.

"extras": {
"js": ["https://docs.readthedocs.io/en/latest/index.js"],
"css": ["https://docs.readthedocs.io/en/latest/index.css"],
}
stsewd marked this conversation as resolved.
Show resolved Hide resolved
}

Get section
-----------

Allow us to query the content of the section, with all links re-written as absolute.

.. http:get:: /_/api/v3/embed/sections?project=docs&version=latest&path=install.html#examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. http:get:: /_/api/v3/embed/sections?project=docs&version=latest&path=install.html#examples
.. http:get:: /_/api/v3/embed/section/?project=docs&version=latest&path=install.html#examples

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the naming conventions here. We are naming a resource, so it should be plural, but we only have a single resource to get the data from.


:query project: (required)
:query version: (required)
:query path: Path with or without fragment (required)

.. sourcecode:: json

{
"project": "docs",
"version": "latest",
"path": "install.html",
"url": "https://docs.readthedocs.io/en/latest/install.html#examples",
"id": "examples",
"title": "Examples",
"content": "<div>I'm a html block!<div>",
"extras": {
"js": ["https://docs.readthedocs.io/en/latest/index.js"],
"css": ["https://docs.readthedocs.io/en/latest/index.css"],
}
}

Notes
-----

- If a section or page doesn't exist, we return 404.
- All links are re-written to be absolute (this is already done).
- All sections listed are from html tags that are linkeable, this is, they have an ``id``
(we don't rely on the toctree from the fjson file anymore).
- The IDs returned don't contain the redundant ``#`` symbol.
- The content is an string with a well formed HTML block.
stsewd marked this conversation as resolved.
Show resolved Hide resolved
- We could also support only ``url`` as argument for ``/sections`` and ``/pages``,
but this introduces another way of querying the API.
Having two ways of querying the API makes it *less-cacheable*.
stsewd marked this conversation as resolved.
Show resolved Hide resolved
- Returning the extra js and css requires parsing the HTML page itself,
rather than only the content extracted from the fjson files (this is for sphinx).
We can use both, the html file and the json file, but we could also just start parsing the full html page
(we can re-use code from the search parsing to detect the main content).
stsewd marked this conversation as resolved.
Show resolved Hide resolved
- ``extras`` could be returned only on ``/pages``, or only on ``/sections``.
It makes more sense to be only on ``/pages``,
but them querying a section would require to query a page to get the extra js/css files.
- We could not return the ``title`` of the page/section as it would require more parsing to do
(but we can re-use the code from search).
Titles can be useful to build an UI like https://readthedocs.org/projects/docs/tools/embed/.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would a user build something like this without the ability to also get all the pages in a project? We aren't exposing an API for that in this design doc. Should we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point, we could have an endpoint for that, but now that you mentioned we need to support external sites, I don't think it would be possible. Or it would be rtd only feature.

- MkDocs support can be added easily as we make our parsing code more general.

.. note::

We should probably make a distinction between our general API that handles Read the Docs resources,
vs our APIs that expose features (like server side search, footer, and embed, all of them proxied).
This way we can version each endpoint separately.
stsewd marked this conversation as resolved.
Show resolved Hide resolved

Deprecation
-----------

We should have a section in our docs instead of guide where the embed API is documented.
There we can list v2 as deprecated.
We would need to migrate our extension as well.
Most of the parsing code could be shared between the two APIs, so it shouldn't be a burden to maintain.

API Client
----------

Do we really need a JS client?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. IIRC @agjohnson idea here was to include the popup/modal form of this into the JS client. Besides, we could eventually be able to use this JS client from our extension itself, instead of maintaining the front-end in our own extension (or adding it as another popup extension together with the ones we currently support)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, the JS client is going to be doing a lot of work in addition to just doing the GET. We definitely want nice default widgets, along with helper methods that make this a much nicer UX for users.

The API client is a js script to allow users to use our API in any page.
Using the fetch and DOM API should be easy enough to make this work.
Having a guide on how to use it would be better than having to maintain and publish a JS package.

Most users would use the embed API in their docs in form of an extension (like sphinx-hoverxref).
Users using the API in other pages would probably have the sufficient knowledge to use the fetch and DOM API.