Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design doc: Embed APIv3 #8222

Merged
merged 19 commits into from
Jun 23, 2021
Merged
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
211 changes: 211 additions & 0 deletions docs/development/design/embed-api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
Embed APIv3
===========

The Embed API allows to embed content from documentation pages in other sites.
humitos marked this conversation as resolved.
Show resolved Hide resolved
It has been treated as an *experimental* feature without public documentation or real applications,
but recently it started to be used widely (mainly because we created a Sphinx extension).
humitos marked this conversation as resolved.
Show resolved Hide resolved

The main goal of this document is to design a new version of the Embed API to be more user friendly,
make it more stable over time, support documentation pages not hosted at Read the Docs,
humitos marked this conversation as resolved.
Show resolved Hide resolved
and remove some quirkiness that makes it hard to maintain and difficult to use.

.. note::

This work is part of the `CZI grant`_ that Read the Docs received.

.. _CZI grant: https://blog.readthedocs.com/czi-grant-announcement/

.. contents::
:local:
:depth: 2


Current implementation
----------------------

The current implementation of the API is partially documented in :doc:`/guides/embedding-content`.
It has some known problems:

* There are different ways of querying the API: ``?url=`` (generic) and ``?doc=`` (relies on Sphinx's specific concept)
* Doesn't support MkDocs
* Lookups are slow (~500 ms)
* IDs returned aren't well formed (like empty IDs ``"headers": [{"title": "#"}]``)
* The content is always an array of one element
* It tries different variations of the original ID
* It doesn't return valid HTML for definition lists (``dd`` tags without a ``dt`` tag)


Goals
-----

Considering the problems mentioned in the previous section,
the inclusion of new features and the definition of a contract that works the same for all,
this document set the following goals for the new version of this endpoint:
humitos marked this conversation as resolved.
Show resolved Hide resolved

* Support external documents hosted outside Read the Docs
humitos marked this conversation as resolved.
Show resolved Hide resolved
* Do not depend on Sphinx ``.fjson`` files
* Query and parse the ``.html`` file directly (from our storage or from an external request)
* Rewrite all links returned in the content to make them absolute
* Always return valid HTML structure
* Delete HTML tags from the original document if needed
* Support ``?nwords=`` and ``?nparagraphs=`` to return chunked content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still were discussing this. I don't think we can support it without having to handle specific cases, not all returned html will have p tags or the word will end in a valid tag, we also need to skip tags from the lenght

<p>Cut <strong>here?</strong></p>

Having the client using css to "hide" long texts is more easy to handle.

Copy link
Member Author

@humitos humitos Jun 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example you shared would return:

  1. with ?nwords=1: <p>Cut</p>
  2. with ?nwords=2: <p>Cut <strong>here?</strong></p>
  3. with ?nparagraphs=1: <p>Cut <strong>here?</strong></p>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but the problem is calculating the words and taking the tags into account, you'll be removing tags from the original html, possible breaking some styles like with tables/definition lists. And also how would you know if there is a paragraph when you have content surrounded in other tags like lists or divs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been playing with this already in https://github.com/readthedocs/readthedocs-ext/pull/304/. How to do it is an implementation detail and I'm sure there is going to be some problems we will need to solve. However, I don't think it's impossible.

I've been playing a little with BeautifulSoup already and this seems to work close enough:

# nparagraphs.py
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('install.html'))
nparagraphs = 3
for element in soup.find('div', attrs={'id': 'development-installation'}).findAll():
    if nparagraphs == 0:
        element.replaceWith('')

    if element.name == 'p' and nparagraphs > 0:
        nparagraphs -= 1
$ wget https://docs.readthedocs.io/en/stable/development/install.html
$ python nparagraph.py
Click to see the output
<div class="section" id="development-installation">
 <h1>
  Development Installation
  <a class="headerlink" href="#development-installation" title="Permalink to this headline"></a>
 </h1>
 <p>
  These are development setup and
  <a class="hoverxref tooltip reference internal" data-doc="development/install" data-docpath="/development/install.html" data-project="docs" data-section="core-team-standards" data-version="stable" href="#core-team-standards">
   <span class="std std-ref">
    standards
   </span>
  </a>
  that are adhered to by the core development team while
developing Read the Docs and related services. If you are a contributor to Read the Docs,
it might a be a good idea to follow these guidelines as well.
 </p>
 <div class="admonition note">
  <p class="admonition-title">
   Note
  </p>
  <p>
   We do not recommend to follow this guide to deploy an instance of Read the Docs for production usage.
Take into account that this setup is only useful for developing purposes.
  </p>
 </div>
</div>

* Require a valid HTML ``id`` selector
* Handle special cases for particular doctools (e.g. Sphinx requires to return the ``.parent()`` element for ``dl``)
* Make explicit the client is asking to handle the special cases (e.g. send ``?doctool=sphinx&version=4.0.1``)
humitos marked this conversation as resolved.
Show resolved Hide resolved
* Accept only ``?url=`` request GET argument to query the endpoint
humitos marked this conversation as resolved.
Show resolved Hide resolved
* Add HTTP cache headers to cache responses
* Allow :abbr:`CORS` from everywhere
humitos marked this conversation as resolved.
Show resolved Hide resolved


Embed endpoint
--------------

Returns the exact HTML content for a specific identifier.
If no anchor identifier is specified the content of the whole page is returned.

.. http:get:: /api/v3/embed/?url=https://docs.readthedocs.io/en/latest/development/install.html#set-up-your-environment

:query url (required): Full URL for the documentation page with optional anchor identifier.
:query expand (optional): Allows to return extra data about the page. Currently, only ``?expand=identifiers`` is supported
to return all the identifiers that page accepts.

.. sourcecode:: json

{
"project": "docs",
"version": "latest",
"language": "en",
"path": "development/install.html",
"title": "Development Installation",
"url": "https://docs.readthedocs.io/en/latest/install.html#set-up-your-environment",
"id": "set-up-your-environment",
"content": "<div class=\"section\" id=\"development-installation\">\n<h1>Development Installation<a class=\"headerlink\" href=\"https://docs.readthedocs.io/en/stable/development/install.html#development-installation\" title=\"Permalink to this headline\">¶</a></h1>\n ..."
}


When used together with ``?expand=identifiers`` the follwing field is also returned:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about returning this information on the same endpoint, is confusing returning the content of a section plus the list of sections, we should have a different endpoint to return data about the page itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand here why it may cause confusion and how having a different endpoint to only show the available id= for a page is better? I think I'm not against a different endpoint, but I want to understand your position better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you are requesting information for a section, but getting information about the page too, you don't want to have to query a random section just to get information about the page. This also allow us to return more information about the page itself like the title.

Copy link
Member Author

@humitos humitos Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinging @agjohnson here since he wrote:

More metadata around headers, such as heading level. I'd like to display the topics in a nested menu, as they don't make as much sense in sequential order

in #7117

@agjohnson can you take a look at this proposal and suggest that would be the ideal response to you and how you would like to use it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agjohnson returning headers may not be useful if those headers don't have an id= that we can find. Also, I realized that it's not trivial to "find the title for a particular id=" (we have this hack currently implemented that only works over a tags. The id= could be in the div surrounding the h1 tag or the dl tag or any other.

<div id="configuration">
  <h1>Configuration<a class="headerlink" href="#configuration" title="Permalink to this headline"></a></h1>
  <p>...</p>
</div>

For this case, we could get the .next() from the id=configuration. However, we can't guarantee that this will always work. Even if we get it, we will want to remove the trailing from its name as well in this case (but it could be a different char)

<dt id="confval-hoverxref_modal_class">
  <code class="sig-name descname">hoverxref_modal_class</code>
  <a class="headerlink" href="#confval-hoverxref_modal_class" title="Permalink to this definition"></a>
</dt>

This one is similar, but the .next() element is exactly the title we are looking for. It does not contain the char.

However, if the title we want is exactly on the h1 or dl (instead of their child), we will fail to detect it and we would return something invalid.


So, I think it's still useful to return all the available id=s from the page so developers can explore them, but it's not easy (maybe impossible) to know the exact title for that particular id= and the exact hierarchy as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this to a particular endpoint /api/v3/embed/identifiers/ and define the initial response: "return all the available identifiers for a specific page".

We can expand it later if we need something else and there is a good way to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked more here, and the use case or feature that I'm describing is currently in a strange place.

More explicitly, I require a list of headings -- basically the toctree on a document -- with heading text, URL to link to the heading, and the heading nesting level or heading nesting as a data structure. This gives me what I need to basically expose the toctree in our application. This is the feature that I had wanted to expose for commercial use -- embedding documentation metadata/headings/etc in customer applications. A hoverxref type extension might be useful in addition here, but is separate. Customers would still need to get metadata out of a particular document in order to inject a toctree into an application view.

So, where I'm at is that I feel like this feature is more of a relic of where the embed API started and it is dragging the direction of the embed API down -- however this might just be my interpretation. The embed API is more focused on having generic support, and so therefore parsing HTML, and what I want is basically exposing the contents of objects.inv. I could be talking about a separate feature, or we could still be talking about keeping this as a separate API endpoint.

I am going to research this a bit more, I may be talking about a separate feature entirely at this point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

objects.inv is already indexed & exposed w/ the SphinxDomain modeling, isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, so we talked about exposing a separate API for that instead.

We need to decide if this feature is going to be a separate product and how users will interact with it. I think we'd be maintaining 1 client JS library that does both embed of hover cards and injection of document header/refs from Sphinx, as the use case is similar.

@humitos Did we talk about what mechanism we'd use if this function was a subpath on the embed API? Are we using Sphinx refs via the SphinxDomain modeling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@humitos Did we talk about what mechanism we'd use if this function was a subpath on the embed API? Are we using Sphinx refs via the SphinxDomain modeling?

Nope, we didn't talk about this and I haven't thought too much either. I usually forget that we have SphinxDomain model --and I haven't used it yet.

Something like /api/v3/embed/headers/ may work for this use case. However, we have to keep in mind that:

  • it will only work with Sphinx if we use the SphinxDomain model
  • making it generic by parsing the HTML is probably impossible
  • we can use the HTTP request arguments for other doctools, as ?doctool=mkdocs&version=1.0.1 and parse a known HTML structure


.. sourcecode:: json

{
"identifiers": [
{
"title": "Set up your environment",
"id": "set-up-your-environment",
"url": "https://docs.readthedocs.io/en/latest/development/install.html#set-up-your-environment"
},
{
"title": "Check that everything works",
"id": "check-that-everything-works",
"url": "https://docs.readthedocs.io/en/latest/development/install.html#check-that-everything-works"
},
...
]
}


Handle specific Sphinx cases
----------------------------

.. https://github.com/readthedocs/readthedocs.org/pull/8039#discussion_r640670085

We are currently handling some special cases for Sphinx due how it writes the HTML output structure.
In some cases, we look for the HTML tag with the identifier requested but we return
the ``.next()`` HTML tag or the ``.parent()`` tag instead of the *requested one*.

Currently, we have identified that this happens for definition tags (``dl``, ``dt``, ``dd``)
--but may be other cases we don't know yet.
Sphinx adds the ``id=`` attribute to the ``dt`` tag, which contains only the title of the definition,
but as a user, we are expecting the description of it.

In the following example we will return the whole ``dl`` HTML tag instead of
the HTML tag with the identifier ``id="term-name"`` as requested by the client,
because otherwise the "Term definition for Term Name" content won't be included and the response would be useless.

.. code:: html

<dl class="glossary docutils">
<dt id="term-name">Term Name</dt>
<dd>Term definition for Term Name</dd>
</dl>

If the definition list (``dl``) has more than *one definition* it will return **only the term requested**.
Considering the following example, with the request ``?url=glossary.html#term-name``

.. code:: html

<dl class="glossary docutils">
...

<dt id="term-name">Term Name</dt>
<dd>Term definition for Term Name</dd>

<dt id="term-unknown">Term Unknown</dt>
<dd>Term definition for Term Unknown </dd>

...
</dl>


It will return the whole ``dl`` with only the ``dt`` and ``dd`` for ``id`` requested:

.. code:: html

<dl class="glossary docutils">
<dt id="term-name">Term Name</dt>
<dd>Term definition for Term Name</dd>
</dl>


However, this assumptions may not apply to documentation pages built with a different doctool than Sphinx.
For this reason, we need to communicate to the API that we want to handle this special cases in the backend.
This will be done by appending a request GET argument to the Embed API endpoint: ``?doctool=sphinx&version=4.0.1``.
humitos marked this conversation as resolved.
Show resolved Hide resolved
In this case, the backend will known that has to deal with these special cases.

.. note::

This leaves the door open to be able to support more special cases (e.g. for other doctools) without breaking the actual behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a better solution than doing it for all requests, and gives us a way to deprecate things 👍



Support for external documents
------------------------------

When the ``?url=`` argument passed belongs to a documentation page not hosted on Read the Docs,
the endpoint will do an external request to download the HTML file,
parse it and return the content for the identifier requested.
humitos marked this conversation as resolved.
Show resolved Hide resolved

The whole logic should be the same, the only difference would be where the source HTML comes from.

humitos marked this conversation as resolved.
Show resolved Hide resolved
.. warning::

We should be carefull with the URL received from the user because those may be internal URLs and we could be leaking some data.
Example: ``?url=http://localhost/some-weird-endpoint`` or ``?url=http://169.254.169.254/latest/meta-data/``
(see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html).

This is related to SSRF (https://en.wikipedia.org/wiki/Server-side_request_forgery).
It doesn't seem to be a huge problem, but something to consider.

Also, the endpoint may need to limit the requests per-external domain to avoid using our servers to take down another site.


Embed APIv2 deprecation
-----------------------

The v2 is currently widely used by projects using the ``sphinx-hoverxref`` extension.
Because of that, we need to keep supporting it as-is for a long time.

Next steps on this direction should be:

* Add a note in the documentation mentioning this endpoint is deprecated
* Promote the usage of the new Embed APIv3
* Migrate the ``sphinx-hoverxref`` extension to use the new endpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are specifying deprecation in this document, then the use case for embedding Sphinx refs should also be mentioned here. Deprecation should depend on surfacing an API to expose documentation refs. We need to have a discussion about this end point and how it is implemented/etc next, but that can be separate.


Once we have done them, we could check our NGINX logs to find out if there are people still using APIv2,
contact them and let them know that they have some months to migrate since the endpoint is deprecated and will be removed.


Unanswered questions
--------------------

* How do we distinguish between our APIv3 for resources (models in the database) from these "feature API endpoints"?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a good question.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to switch the version and API name so we have the version after its name? That way we can release a new version of a particular API without touching the others.

  • /api/search/v3/
  • /api/footer/v3/
  • /api/resources/v2/ and /api/resources/v3/
  • /api/embed/v3/
  • etc

cc @stsewd @ericholscher

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That naming feels weird to me, I think we just need to document which endpoints are part of what.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stsewd can you expand how that would be and give some examples?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already done in part for search for example https://docs.readthedocs.io/en/stable/server-side-search.html#api we just need to do the same with the embed api and footer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stsewd I don't follow you. That search endpoint is under /api/v2/ --so we are not differentiating the "Feature APIs" from the "Resources API" there.

print(stsewd.dump("brain").verbose())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to make this distinction in our docs, not in the URLs

* What happen if a project changes its custom domain? Do we support redirects in this case?