Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design doc: Embed APIv3 #8222

Merged
merged 19 commits into from
Jun 23, 2021
Merged

Design doc: Embed APIv3 #8222

merged 19 commits into from
Jun 23, 2021

Conversation

humitos
Copy link
Member

@humitos humitos commented May 31, 2021

I made too many changes to the original document that I'm creating a new PR based on Santos' PR: #8039. It basically extends and replaces it.

📝 read the rendered version at: https://docs--8222.org.readthedocs.build/en/8222/development/design/embed-api.html

Closes #7963
Closes #8039

@humitos humitos changed the title Embed: design doc for new embed API Design doc: Embed APIv3 May 31, 2021
* Rewrite all links returned in the content to make them absolute
* Always return valid HTML structure
* Delete HTML tags from the original document if needed
* Support ``?nwords=`` and ``?nparagraphs=`` to return chunked content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still were discussing this. I don't think we can support it without having to handle specific cases, not all returned html will have p tags or the word will end in a valid tag, we also need to skip tags from the lenght

<p>Cut <strong>here?</strong></p>

Having the client using css to "hide" long texts is more easy to handle.

Copy link
Member Author

@humitos humitos Jun 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example you shared would return:

  1. with ?nwords=1: <p>Cut</p>
  2. with ?nwords=2: <p>Cut <strong>here?</strong></p>
  3. with ?nparagraphs=1: <p>Cut <strong>here?</strong></p>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but the problem is calculating the words and taking the tags into account, you'll be removing tags from the original html, possible breaking some styles like with tables/definition lists. And also how would you know if there is a paragraph when you have content surrounded in other tags like lists or divs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been playing with this already in https://github.com/readthedocs/readthedocs-ext/pull/304/. How to do it is an implementation detail and I'm sure there is going to be some problems we will need to solve. However, I don't think it's impossible.

I've been playing a little with BeautifulSoup already and this seems to work close enough:

# nparagraphs.py
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('install.html'))
nparagraphs = 3
for element in soup.find('div', attrs={'id': 'development-installation'}).findAll():
    if nparagraphs == 0:
        element.replaceWith('')

    if element.name == 'p' and nparagraphs > 0:
        nparagraphs -= 1
$ wget https://docs.readthedocs.io/en/stable/development/install.html
$ python nparagraph.py
Click to see the output
<div class="section" id="development-installation">
 <h1>
  Development Installation
  <a class="headerlink" href="#development-installation" title="Permalink to this headline"></a>
 </h1>
 <p>
  These are development setup and
  <a class="hoverxref tooltip reference internal" data-doc="development/install" data-docpath="/development/install.html" data-project="docs" data-section="core-team-standards" data-version="stable" href="#core-team-standards">
   <span class="std std-ref">
    standards
   </span>
  </a>
  that are adhered to by the core development team while
developing Read the Docs and related services. If you are a contributor to Read the Docs,
it might a be a good idea to follow these guidelines as well.
 </p>
 <div class="admonition note">
  <p class="admonition-title">
   Note
  </p>
  <p>
   We do not recommend to follow this guide to deploy an instance of Read the Docs for production usage.
Take into account that this setup is only useful for developing purposes.
  </p>
 </div>
</div>

}


When used together with ``?expand=identifiers`` the follwing field is also returned:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about returning this information on the same endpoint, is confusing returning the content of a section plus the list of sections, we should have a different endpoint to return data about the page itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand here why it may cause confusion and how having a different endpoint to only show the available id= for a page is better? I think I'm not against a different endpoint, but I want to understand your position better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you are requesting information for a section, but getting information about the page too, you don't want to have to query a random section just to get information about the page. This also allow us to return more information about the page itself like the title.

Copy link
Member Author

@humitos humitos Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinging @agjohnson here since he wrote:

More metadata around headers, such as heading level. I'd like to display the topics in a nested menu, as they don't make as much sense in sequential order

in #7117

@agjohnson can you take a look at this proposal and suggest that would be the ideal response to you and how you would like to use it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agjohnson returning headers may not be useful if those headers don't have an id= that we can find. Also, I realized that it's not trivial to "find the title for a particular id=" (we have this hack currently implemented that only works over a tags. The id= could be in the div surrounding the h1 tag or the dl tag or any other.

<div id="configuration">
  <h1>Configuration<a class="headerlink" href="#configuration" title="Permalink to this headline"></a></h1>
  <p>...</p>
</div>

For this case, we could get the .next() from the id=configuration. However, we can't guarantee that this will always work. Even if we get it, we will want to remove the trailing from its name as well in this case (but it could be a different char)

<dt id="confval-hoverxref_modal_class">
  <code class="sig-name descname">hoverxref_modal_class</code>
  <a class="headerlink" href="#confval-hoverxref_modal_class" title="Permalink to this definition"></a>
</dt>

This one is similar, but the .next() element is exactly the title we are looking for. It does not contain the char.

However, if the title we want is exactly on the h1 or dl (instead of their child), we will fail to detect it and we would return something invalid.


So, I think it's still useful to return all the available id=s from the page so developers can explore them, but it's not easy (maybe impossible) to know the exact title for that particular id= and the exact hierarchy as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this to a particular endpoint /api/v3/embed/identifiers/ and define the initial response: "return all the available identifiers for a specific page".

We can expand it later if we need something else and there is a good way to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked more here, and the use case or feature that I'm describing is currently in a strange place.

More explicitly, I require a list of headings -- basically the toctree on a document -- with heading text, URL to link to the heading, and the heading nesting level or heading nesting as a data structure. This gives me what I need to basically expose the toctree in our application. This is the feature that I had wanted to expose for commercial use -- embedding documentation metadata/headings/etc in customer applications. A hoverxref type extension might be useful in addition here, but is separate. Customers would still need to get metadata out of a particular document in order to inject a toctree into an application view.

So, where I'm at is that I feel like this feature is more of a relic of where the embed API started and it is dragging the direction of the embed API down -- however this might just be my interpretation. The embed API is more focused on having generic support, and so therefore parsing HTML, and what I want is basically exposing the contents of objects.inv. I could be talking about a separate feature, or we could still be talking about keeping this as a separate API endpoint.

I am going to research this a bit more, I may be talking about a separate feature entirely at this point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

objects.inv is already indexed & exposed w/ the SphinxDomain modeling, isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, so we talked about exposing a separate API for that instead.

We need to decide if this feature is going to be a separate product and how users will interact with it. I think we'd be maintaining 1 client JS library that does both embed of hover cards and injection of document header/refs from Sphinx, as the use case is similar.

@humitos Did we talk about what mechanism we'd use if this function was a subpath on the embed API? Are we using Sphinx refs via the SphinxDomain modeling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@humitos Did we talk about what mechanism we'd use if this function was a subpath on the embed API? Are we using Sphinx refs via the SphinxDomain modeling?

Nope, we didn't talk about this and I haven't thought too much either. I usually forget that we have SphinxDomain model --and I haven't used it yet.

Something like /api/v3/embed/headers/ may work for this use case. However, we have to keep in mind that:

  • it will only work with Sphinx if we use the SphinxDomain model
  • making it generic by parsing the HTML is probably impossible
  • we can use the HTTP request arguments for other doctools, as ?doctool=mkdocs&version=1.0.1 and parse a known HTML structure

@humitos humitos requested a review from a team June 1, 2021 16:27
@humitos
Copy link
Member Author

humitos commented Jun 3, 2021

Notes to myself: it seems I missed some points from #7117 that @agjohnson wants, like language as argument and/or HTTP header. So, I'll come back to this document and update it with those points.

@humitos humitos requested a review from a team June 7, 2021 13:45
docs/development/design/embed-api.rst Outdated Show resolved Hide resolved
docs/development/design/embed-api.rst Outdated Show resolved Hide resolved
docs/development/design/embed-api.rst Show resolved Hide resolved
}


When used together with ``?expand=identifiers`` the follwing field is also returned:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using the embed API list of sections to inject a (currently flat, not nested) list of headings for help topics. Displaying heading listing would be an important feature of an embed client, for providing customers with a way to inject topics to their application.

docs/development/design/embed-api.rst Show resolved Hide resolved
humitos added 2 commits June 9, 2021 15:26
- re-order goals
- allow CORS only for public projects
- new section with the definition of the contract
- define `/api/v3/embed/identifiers/` endpoint
  - remove `title` field from it because it's not easy to get it
  - return only available identifiers
  - add `_links` to make the API browseable
- handle project's domain changes querying for 3xx status codes
@humitos
Copy link
Member Author

humitos commented Jun 9, 2021

@ericholscher @stsewd @agjohnson I updated the document with the latest conversation we had. I think I covered all the concerns risen here. Let's see if you agree with what I wrote 😄

Copy link

@nguyenminhhoang888 nguyenminhhoang888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of small language nitpicks. I think in general this is heading in a good direction. I like having a way to know when to apply sphinx magic as needed.

docs/development/design/embed-api.rst Outdated Show resolved Hide resolved
docs/development/design/embed-api.rst Outdated Show resolved Hide resolved
docs/development/design/embed-api.rst Outdated Show resolved Hide resolved
docs/development/design/embed-api.rst Show resolved Hide resolved
only under ``?doctool=`` and ``?version=`` arguments.

If no ``id`` selector is sent to the request, the content of the first meaningfull HTML tag
(``<main>``, ``<div role="main">``, etc) identifier found is returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this useful? Seems like a lot of content to return -- is the use case here for small pages or browsing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it because it's how it works currently. I don't have a strong opinion and I'm happy to remove it if there is no well-defined use case yet. In fact, I didn't add this originally but then I came back and added this paragraph because I realized that I was removing a feature that we currently have.

Copy link
Member

@ericholscher ericholscher Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I'm not sure. I use this functionality a lot when tested (eg. I just throw a URL at the API and it returns the whole page content) -- but I'm not sure if this is generally useful. It definitely makes the API easier to browse and understand, so I think that's valuable. A lot of users will probably do this instead of any metadata endpoint to get the page sections.

I think this functionality probably makes sense, but we should be very clear about what it does (no etc :D)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I'm not sure if this is generally useful

I'm not sure we are already using the feature as is right now. However, I opened a PR in sphinx-hoverxref to support :doc: that would require this: readthedocs/sphinx-hoverxref#68

but we should be very clear about what it does (no etc :D)

He he, I left the "etc" here because I didn't do the research required to grab the most used ones. The two that I mentioned here are HTML5 and Sphinx default's one, but it may be others. This will be clear in the documentation of the endpoint itself without magic/guessing involved, tho.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this feature combined with ?nparagraphs= makes more sense.


.. note::

This leaves the door open to be able to support more special cases (e.g. for other doctools) without breaking the actual behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a better solution than doing it for all requests, and gives us a way to deprecate things 👍

docs/development/design/embed-api.rst Show resolved Hide resolved
docs/development/design/embed-api.rst Show resolved Hide resolved
Unanswered questions
--------------------

* How do we distinguish between our APIv3 for resources (models in the database) from these "feature API endpoints"?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a good question.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to switch the version and API name so we have the version after its name? That way we can release a new version of a particular API without touching the others.

  • /api/search/v3/
  • /api/footer/v3/
  • /api/resources/v2/ and /api/resources/v3/
  • /api/embed/v3/
  • etc

cc @stsewd @ericholscher

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That naming feels weird to me, I think we just need to document which endpoints are part of what.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stsewd can you expand how that would be and give some examples?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already done in part for search for example https://docs.readthedocs.io/en/stable/server-side-search.html#api we just need to do the same with the embed api and footer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stsewd I don't follow you. That search endpoint is under /api/v2/ --so we are not differentiating the "Feature APIs" from the "Resources API" there.

print(stsewd.dump("brain").verbose())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to make this distinction in our docs, not in the URLs

Copy link
Contributor

@agjohnson agjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting a few questions left over after talking through this with @humitos

docs/development/design/embed-api.rst Outdated Show resolved Hide resolved
docs/development/design/embed-api.rst Outdated Show resolved Hide resolved
@humitos
Copy link
Member Author

humitos commented Jun 14, 2021

I think we are ready for a final review (cc @stsewd, @ericholscher, @agjohnson) and to start with its implementation. We had discussed a lot of things and it seems that we may add extra features in the future when we have more clear use-cases, but I think that's fine. I'm happy that we are covering the general case in a simple way and the special cases with an explicit argument.

@ericholscher
Copy link
Member

I think this looks good to get started on. I still have a couple implementation questions, and think we should focus on the core implementation first and then worry about Sphinx edge cases later.

Copy link
Contributor

@agjohnson agjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are to separate out the conversation around embedding Sphinx refs/headings, then I think this particular document is very close and we can continue. We need to come back to a separate conversation on a replacement API next.


* Add a note in the documentation mentioning this endpoint is deprecated
* Promote the usage of the new Embed APIv3
* Migrate the ``sphinx-hoverxref`` extension to use the new endpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are specifying deprecation in this document, then the use case for embedding Sphinx refs should also be mentioned here. Deprecation should depend on surfacing an API to expose documentation refs. We need to have a discussion about this end point and how it is implemented/etc next, but that can be separate.

@humitos humitos merged commit c86f18a into master Jun 23, 2021
@humitos humitos deleted the humitos/design-doc-embed-api branch June 23, 2021 09:43
@humitos humitos mentioned this pull request Jul 6, 2021
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improvements for the embed app
5 participants