Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design doc: Embed APIv3 #8222

Merged
merged 19 commits into from
Jun 23, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
290 changes: 290 additions & 0 deletions docs/development/design/embed-api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
Embed APIv3
===========

The Embed API allows users to embed content from documentation pages in other sites.
It has been treated as an *experimental* feature without public documentation or real applications,
but recently it started to be used widely (mainly because we created the ``hoverxref`` Sphinx extension).

The main goal of this document is to design a new version of the Embed API to be more user friendly,
make it more stable over time, support embedding content from pages not hosted at Read the Docs,
and remove some quirkiness that makes it hard to maintain and difficult to use.

.. note::

This work is part of the `CZI grant`_ that Read the Docs received.

.. _CZI grant: https://blog.readthedocs.com/czi-grant-announcement/

.. contents::
:local:
:depth: 2


Current implementation
----------------------

The current implementation of the API is partially documented in :doc:`/guides/embedding-content`.
It has some known problems:

* There are different ways of querying the API: ``?url=`` (generic) and ``?doc=`` (relies on Sphinx's specific concept)
* Doesn't support MkDocs
* Lookups are slow (~500 ms)
* IDs returned aren't well formed (like empty IDs ``"headers": [{"title": "#"}]``)
* The content is always an array of one element
* It tries different variations of the original ID
* It doesn't return valid HTML for definition lists (``dd`` tags without a ``dt`` tag)


Goals
-----

We plan to add new features and define a contract that works the same for all HTML.
This project has the following goals:

* Support embedding content from pages hosted outside Read the Docs
* Do not depend on Sphinx ``.fjson`` files
* Query and parse the ``.html`` file directly (from our storage or from an external request)
* Rewrite all links returned in the content to make them absolute
* Require a valid HTML ``id`` selector
* Accept only ``?url=`` request GET argument to query the endpoint
humitos marked this conversation as resolved.
Show resolved Hide resolved
* Support ``?nwords=`` and ``?nparagraphs=`` to return chunked content
* Handle special cases for particular doctools (e.g. Sphinx requires to return the ``.parent()`` element for ``dl``)
* Make explicit the client is asking to handle the special cases (e.g. send ``?doctool=sphinx&version=4.0.1&writer=html4``)
* Delete HTML tags from the original document (for well-defined special cases)
* Add HTTP cache headers to cache responses
* Allow :abbr:`CORS` from everywhere *only* for public projects
humitos marked this conversation as resolved.
Show resolved Hide resolved


The contract
------------

Return the HTML tag (and its children) with the ``id`` selector requested
and replace all the relative links from its content making them absolute.

.. note::

Any other case outside this contract will be considered *special* and will be implemented
only under ``?doctool=``, ``?version=`` and ``?writer=`` arguments.

If no ``id`` selector is sent to the request, the content of the first meaningfull HTML tag
(``<main>``, ``<div role="main">`` or other well-defined standard tags) identifier found is returned.


Embed endpoints
---------------

This is the list of endpoints to be implemented in APIv3:

.. http:get:: /api/v3/embed/

Returns the exact HTML content for a specific identifier (``id``).
If no anchor identifier is specified the content of the first one returned.

**Example request**:

.. tabs::

$ curl https://readthedocs.org/api/v3/embed/?url=https://docs.readthedocs.io/en/latest/development/install.html#set-up-your-environment

**Example response**:

.. sourcecode:: json

{
"project": "docs",
"version": "latest",
"language": "en",
"path": "development/install.html",
"title": "Development Installation",
"url": "https://docs.readthedocs.io/en/latest/install.html#set-up-your-environment",
"id": "set-up-your-environment",
"content": "<div class=\"section\" id=\"development-installation\">\n<h1>Development Installation<a class=\"headerlink\" href=\"https://docs.readthedocs.io/en/stable/development/install.html#development-installation\" title=\"Permalink to this headline\">¶</a></h1>\n ..."
}

:query url (required): Full URL for the documentation page with optional anchor identifier.


.. http:get:: /api/v3/embed/metadata/

Returns all the available metadata for an specific page.

.. note::

As it's not trivial to get the ``title`` associated with a particular ``id`` and it's not easy to get a nested list of identifiers,
we may not implement this endpoint in initial version.

The endpoint as-is, is mainly useful to explore/discover what are the identifiers available for a particular page
--which is handy in the development process of a new tool that consumes the API.
Because of this, we don't have too much traction to add it in the initial version.

**Example request**:

.. tabs::

$ curl https://readthedocs.org/api/v3/embed/metadata/?url=https://docs.readthedocs.io/en/latest/development/install.html

**Example response**:

.. sourcecode:: json

{
"identifiers": {
"id": "set-up-your-environment",
"url": "https://docs.readthedocs.io/en/latest/development/install.html#set-up-your-environment"
"_links": {
"embed": "https://docs.readthedocs.io/_/api/v3/embed/?url=https://docs.readthedocs.io/en/latest/development/install.html#set-up-your-environment"
}
},
{
"id": "check-that-everything-works",
"url": "https://docs.readthedocs.io/en/latest/development/install.html#check-that-everything-works"
"_links": {
"embed": "https://docs.readthedocs.io/_/api/v3/embed/?url=https://docs.readthedocs.io/en/latest/development/install.html#check-that-everything-works"
}
},
}

:query url (required): Full URL for the documentation page


Handle specific Sphinx cases
----------------------------

.. https://github.com/readthedocs/readthedocs.org/pull/8039#discussion_r640670085

We are currently handling some special cases for Sphinx due how it writes the HTML output structure.
In some cases, we look for the HTML tag with the identifier requested but we return
the ``.next()`` HTML tag or the ``.parent()`` tag instead of the *requested one*.

Currently, we have identified that this happens for definition tags (``dl``, ``dt``, ``dd``)
--but may be other cases we don't know yet.
Sphinx adds the ``id=`` attribute to the ``dt`` tag, which contains only the title of the definition,
but as a user, we are expecting the description of it.

In the following example we will return the whole ``dl`` HTML tag instead of
the HTML tag with the identifier ``id="term-name"`` as requested by the client,
because otherwise the "Term definition for Term Name" content won't be included and the response would be useless.

.. code:: html

<dl class="glossary docutils">
<dt id="term-name">Term Name</dt>
<dd>Term definition for Term Name</dd>
</dl>

If the definition list (``dl``) has more than *one definition* it will return **only the term requested**.
Considering the following example, with the request ``?url=glossary.html#term-name``

.. code:: html

<dl class="glossary docutils">
...

<dt id="term-name">Term Name</dt>
<dd>Term definition for Term Name</dd>

<dt id="term-unknown">Term Unknown</dt>
<dd>Term definition for Term Unknown </dd>

...
</dl>


It will return the whole ``dl`` with only the ``dt`` and ``dd`` for ``id`` requested:

.. code:: html

<dl class="glossary docutils">
<dt id="term-name">Term Name</dt>
<dd>Term definition for Term Name</dd>
</dl>


However, this assumptions may not apply to documentation pages built with a different doctool than Sphinx.
For this reason, we need to communicate to the API that we want to handle this special cases in the backend.
This will be done by appending a request GET argument to the Embed API endpoint: ``?doctool=sphinx&version=4.0.1&writer=html4``.
In this case, the backend will known that has to deal with these special cases.

.. note::

This leaves the door open to be able to support more special cases (e.g. for other doctools) without breaking the actual behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a better solution than doing it for all requests, and gives us a way to deprecate things 👍



Support for external documents
------------------------------

When the ``?url=`` argument passed belongs to a documentation page not hosted on Read the Docs,
the endpoint will do an external request to download the HTML file,
parse it and return the content for the identifier requested.
humitos marked this conversation as resolved.
Show resolved Hide resolved

The whole logic should be the same, the only difference would be where the source HTML comes from.

humitos marked this conversation as resolved.
Show resolved Hide resolved
.. warning::

We should be carefull with the URL received from the user because those may be internal URLs and we could be leaking some data.
Example: ``?url=http://localhost/some-weird-endpoint`` or ``?url=http://169.254.169.254/latest/meta-data/``
(see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html).

This is related to SSRF (https://en.wikipedia.org/wiki/Server-side_request_forgery).
It doesn't seem to be a huge problem, but something to consider.

Also, the endpoint may need to limit the requests per-external domain to avoid using our servers to take down another site.

.. note::

Due to the potential security issues mentioned, we will start with an allowed list of domains for common Sphinx docs projects.
Projects like Django and Python, where ``sphinx-hoverxref`` users might commonly want to embed from.
We aren't planning to allow arbitrary HTML from any website.


Handle project's domain changes
-------------------------------

The proposed Embed APIv3 implementation only allows ``?url=`` argument to embed content from that page.
That URL can be:

* a URL for a project hosted under ``<project-slug>.readthedocs.io``
* a URL for a project with a custom domain

In the first case, we can easily get the project's slug directly from the URL.
However, in the second case we get the project's slug by querying our database for a ``Domain`` object
with the full domain from the URL.

Now, consider that all the links in the documentation page that uses Embed APIv3 are pointing to
``docs.example.com`` and the author decides to change the domain to be ``docs.newdomain.com``.
At this point there are different possible scenarios:

* The user creates a new ``Domain`` object with ``docs.newdomain.com`` as domain's name.
In this case, old links will keep working because we still have the old ``Domain`` object in our database
and we can use it to get the project's slug.
* The user *deletes* the old ``Domain`` besides creating the new one.
In this scenario, our query for a ``Domain`` with name ``docs.example.com`` to our database will fail.
We will need to do a request to ``docs.example.com`` and check for a 3xx response status code and in that case,
we can read the ``Location:`` HTTP header to find the new domain's name for the documentation.
Once we have the new domain from the redirect response, we can query our database again to find out the project's slug.
humitos marked this conversation as resolved.
Show resolved Hide resolved

.. note::

We will follow up to 5 redirects to find out the project's domain.


Embed APIv2 deprecation
-----------------------

The v2 is currently widely used by projects using the ``sphinx-hoverxref`` extension.
Because of that, we need to keep supporting it as-is for a long time.

Next steps on this direction should be:

* Add a note in the documentation mentioning this endpoint is deprecated
* Promote the usage of the new Embed APIv3
* Migrate the ``sphinx-hoverxref`` extension to use the new endpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are specifying deprecation in this document, then the use case for embedding Sphinx refs should also be mentioned here. Deprecation should depend on surfacing an API to expose documentation refs. We need to have a discussion about this end point and how it is implemented/etc next, but that can be separate.


Once we have done them, we could check our NGINX logs to find out if there are people still using APIv2,
contact them and let them know that they have some months to migrate since the endpoint is deprecated and will be removed.


Unanswered questions
--------------------

* How do we distinguish between our APIv3 for resources (models in the database) from these "feature API endpoints"?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a good question.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to switch the version and API name so we have the version after its name? That way we can release a new version of a particular API without touching the others.

  • /api/search/v3/
  • /api/footer/v3/
  • /api/resources/v2/ and /api/resources/v3/
  • /api/embed/v3/
  • etc

cc @stsewd @ericholscher

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That naming feels weird to me, I think we just need to document which endpoints are part of what.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stsewd can you expand how that would be and give some examples?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already done in part for search for example https://docs.readthedocs.io/en/stable/server-side-search.html#api we just need to do the same with the embed api and footer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stsewd I don't follow you. That search endpoint is under /api/v2/ --so we are not differentiating the "Feature APIs" from the "Resources API" there.

print(stsewd.dump("brain").verbose())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to make this distinction in our docs, not in the URLs