Skip to content

Commit

Permalink
Merge pull request #8222 from readthedocs/humitos/design-doc-embed-api
Browse files Browse the repository at this point in the history
  • Loading branch information
humitos authored Jun 23, 2021
2 parents dacfcf7 + b34fce0 commit c86f18a
Showing 1 changed file with 290 additions and 0 deletions.
290 changes: 290 additions & 0 deletions docs/development/design/embed-api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
Embed APIv3
===========

The Embed API allows users to embed content from documentation pages in other sites.
It has been treated as an *experimental* feature without public documentation or real applications,
but recently it started to be used widely (mainly because we created the ``hoverxref`` Sphinx extension).

The main goal of this document is to design a new version of the Embed API to be more user friendly,
make it more stable over time, support embedding content from pages not hosted at Read the Docs,
and remove some quirkiness that makes it hard to maintain and difficult to use.

.. note::

This work is part of the `CZI grant`_ that Read the Docs received.

.. _CZI grant: https://blog.readthedocs.com/czi-grant-announcement/

.. contents::
:local:
:depth: 2


Current implementation
----------------------

The current implementation of the API is partially documented in :doc:`/guides/embedding-content`.
It has some known problems:

* There are different ways of querying the API: ``?url=`` (generic) and ``?doc=`` (relies on Sphinx's specific concept)
* Doesn't support MkDocs
* Lookups are slow (~500 ms)
* IDs returned aren't well formed (like empty IDs ``"headers": [{"title": "#"}]``)
* The content is always an array of one element
* It tries different variations of the original ID
* It doesn't return valid HTML for definition lists (``dd`` tags without a ``dt`` tag)


Goals
-----

We plan to add new features and define a contract that works the same for all HTML.
This project has the following goals:

* Support embedding content from pages hosted outside Read the Docs
* Do not depend on Sphinx ``.fjson`` files
* Query and parse the ``.html`` file directly (from our storage or from an external request)
* Rewrite all links returned in the content to make them absolute
* Require a valid HTML ``id`` selector
* Accept only ``?url=`` request GET argument to query the endpoint
* Support ``?nwords=`` and ``?nparagraphs=`` to return chunked content
* Handle special cases for particular doctools (e.g. Sphinx requires to return the ``.parent()`` element for ``dl``)
* Make explicit the client is asking to handle the special cases (e.g. send ``?doctool=sphinx&version=4.0.1&writer=html4``)
* Delete HTML tags from the original document (for well-defined special cases)
* Add HTTP cache headers to cache responses
* Allow :abbr:`CORS` from everywhere *only* for public projects


The contract
------------

Return the HTML tag (and its children) with the ``id`` selector requested
and replace all the relative links from its content making them absolute.

.. note::

Any other case outside this contract will be considered *special* and will be implemented
only under ``?doctool=``, ``?version=`` and ``?writer=`` arguments.

If no ``id`` selector is sent to the request, the content of the first meaningfull HTML tag
(``<main>``, ``<div role="main">`` or other well-defined standard tags) identifier found is returned.


Embed endpoints
---------------

This is the list of endpoints to be implemented in APIv3:

.. http:get:: /api/v3/embed/
Returns the exact HTML content for a specific identifier (``id``).
If no anchor identifier is specified the content of the first one returned.

**Example request**:

.. tabs::

$ curl https://readthedocs.org/api/v3/embed/?url=https://docs.readthedocs.io/en/latest/development/install.html#set-up-your-environment

**Example response**:

.. sourcecode:: json

{
"project": "docs",
"version": "latest",
"language": "en",
"path": "development/install.html",
"title": "Development Installation",
"url": "https://docs.readthedocs.io/en/latest/install.html#set-up-your-environment",
"id": "set-up-your-environment",
"content": "<div class=\"section\" id=\"development-installation\">\n<h1>Development Installation<a class=\"headerlink\" href=\"https://docs.readthedocs.io/en/stable/development/install.html#development-installation\" title=\"Permalink to this headline\">¶</a></h1>\n ..."
}

:query url (required): Full URL for the documentation page with optional anchor identifier.


.. http:get:: /api/v3/embed/metadata/
Returns all the available metadata for an specific page.

.. note::

As it's not trivial to get the ``title`` associated with a particular ``id`` and it's not easy to get a nested list of identifiers,
we may not implement this endpoint in initial version.

The endpoint as-is, is mainly useful to explore/discover what are the identifiers available for a particular page
--which is handy in the development process of a new tool that consumes the API.
Because of this, we don't have too much traction to add it in the initial version.

**Example request**:

.. tabs::

$ curl https://readthedocs.org/api/v3/embed/metadata/?url=https://docs.readthedocs.io/en/latest/development/install.html

**Example response**:

.. sourcecode:: json

{
"identifiers": {
"id": "set-up-your-environment",
"url": "https://docs.readthedocs.io/en/latest/development/install.html#set-up-your-environment"
"_links": {
"embed": "https://docs.readthedocs.io/_/api/v3/embed/?url=https://docs.readthedocs.io/en/latest/development/install.html#set-up-your-environment"
}
},
{
"id": "check-that-everything-works",
"url": "https://docs.readthedocs.io/en/latest/development/install.html#check-that-everything-works"
"_links": {
"embed": "https://docs.readthedocs.io/_/api/v3/embed/?url=https://docs.readthedocs.io/en/latest/development/install.html#check-that-everything-works"
}
},
}

:query url (required): Full URL for the documentation page


Handle specific Sphinx cases
----------------------------

.. https://github.com/readthedocs/readthedocs.org/pull/8039#discussion_r640670085
We are currently handling some special cases for Sphinx due how it writes the HTML output structure.
In some cases, we look for the HTML tag with the identifier requested but we return
the ``.next()`` HTML tag or the ``.parent()`` tag instead of the *requested one*.

Currently, we have identified that this happens for definition tags (``dl``, ``dt``, ``dd``)
--but may be other cases we don't know yet.
Sphinx adds the ``id=`` attribute to the ``dt`` tag, which contains only the title of the definition,
but as a user, we are expecting the description of it.

In the following example we will return the whole ``dl`` HTML tag instead of
the HTML tag with the identifier ``id="term-name"`` as requested by the client,
because otherwise the "Term definition for Term Name" content won't be included and the response would be useless.

.. code:: html

<dl class="glossary docutils">
<dt id="term-name">Term Name</dt>
<dd>Term definition for Term Name</dd>
</dl>

If the definition list (``dl``) has more than *one definition* it will return **only the term requested**.
Considering the following example, with the request ``?url=glossary.html#term-name``

.. code:: html

<dl class="glossary docutils">
...

<dt id="term-name">Term Name</dt>
<dd>Term definition for Term Name</dd>

<dt id="term-unknown">Term Unknown</dt>
<dd>Term definition for Term Unknown </dd>

...
</dl>


It will return the whole ``dl`` with only the ``dt`` and ``dd`` for ``id`` requested:

.. code:: html

<dl class="glossary docutils">
<dt id="term-name">Term Name</dt>
<dd>Term definition for Term Name</dd>
</dl>


However, this assumptions may not apply to documentation pages built with a different doctool than Sphinx.
For this reason, we need to communicate to the API that we want to handle this special cases in the backend.
This will be done by appending a request GET argument to the Embed API endpoint: ``?doctool=sphinx&version=4.0.1&writer=html4``.
In this case, the backend will known that has to deal with these special cases.

.. note::

This leaves the door open to be able to support more special cases (e.g. for other doctools) without breaking the actual behavior.


Support for external documents
------------------------------

When the ``?url=`` argument passed belongs to a documentation page not hosted on Read the Docs,
the endpoint will do an external request to download the HTML file,
parse it and return the content for the identifier requested.

The whole logic should be the same, the only difference would be where the source HTML comes from.

.. warning::

We should be carefull with the URL received from the user because those may be internal URLs and we could be leaking some data.
Example: ``?url=http://localhost/some-weird-endpoint`` or ``?url=http://169.254.169.254/latest/meta-data/``
(see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html).

This is related to SSRF (https://en.wikipedia.org/wiki/Server-side_request_forgery).
It doesn't seem to be a huge problem, but something to consider.

Also, the endpoint may need to limit the requests per-external domain to avoid using our servers to take down another site.

.. note::

Due to the potential security issues mentioned, we will start with an allowed list of domains for common Sphinx docs projects.
Projects like Django and Python, where ``sphinx-hoverxref`` users might commonly want to embed from.
We aren't planning to allow arbitrary HTML from any website.


Handle project's domain changes
-------------------------------

The proposed Embed APIv3 implementation only allows ``?url=`` argument to embed content from that page.
That URL can be:

* a URL for a project hosted under ``<project-slug>.readthedocs.io``
* a URL for a project with a custom domain

In the first case, we can easily get the project's slug directly from the URL.
However, in the second case we get the project's slug by querying our database for a ``Domain`` object
with the full domain from the URL.

Now, consider that all the links in the documentation page that uses Embed APIv3 are pointing to
``docs.example.com`` and the author decides to change the domain to be ``docs.newdomain.com``.
At this point there are different possible scenarios:

* The user creates a new ``Domain`` object with ``docs.newdomain.com`` as domain's name.
In this case, old links will keep working because we still have the old ``Domain`` object in our database
and we can use it to get the project's slug.
* The user *deletes* the old ``Domain`` besides creating the new one.
In this scenario, our query for a ``Domain`` with name ``docs.example.com`` to our database will fail.
We will need to do a request to ``docs.example.com`` and check for a 3xx response status code and in that case,
we can read the ``Location:`` HTTP header to find the new domain's name for the documentation.
Once we have the new domain from the redirect response, we can query our database again to find out the project's slug.

.. note::

We will follow up to 5 redirects to find out the project's domain.


Embed APIv2 deprecation
-----------------------

The v2 is currently widely used by projects using the ``sphinx-hoverxref`` extension.
Because of that, we need to keep supporting it as-is for a long time.

Next steps on this direction should be:

* Add a note in the documentation mentioning this endpoint is deprecated
* Promote the usage of the new Embed APIv3
* Migrate the ``sphinx-hoverxref`` extension to use the new endpoint

Once we have done them, we could check our NGINX logs to find out if there are people still using APIv2,
contact them and let them know that they have some months to migrate since the endpoint is deprecated and will be removed.


Unanswered questions
--------------------

* How do we distinguish between our APIv3 for resources (models in the database) from these "feature API endpoints"?

0 comments on commit c86f18a

Please sign in to comment.