Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embed: design doc for new embed API #8039

Closed
wants to merge 10 commits into from
Closed

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Mar 22, 2021

Mostly based on my notes from #7963.

@stsewd stsewd requested a review from a team March 22, 2021 21:17
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are good ideas here.

I don't think we should have 2 separate endpoints for this and we should only rely on ?url= attribute unless there is a good reason to not to.

Also, we should support nparagraph= or nwords= or both to be able to chunk the content of the section returned (see https://github.com/readthedocs/readthedocs-ext/pull/304)

- There are three ways of querying the API, and some rely on Sphinx's concepts like ``doc``.
- Doesn't cache responses or doesn't purge the cache on build.
- Doesn't support MkDocs.
- It returns all sections from the current page.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can send section= to get only the section you want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean here is that it always returns all the sections (title and id), not the section content.

These improvements aren't breaking changes, so we can implement them in the old and new API.

- Support for MkDocs.
- Always return a valid/well formed HTML block.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this refers to the dd/dt issue, I'm not sure if this will be easily doable for all the cases. We may end up re-parsing the dd to convert that HTML tag to another one, but that will break the style based on dd tags. So, this will be tricky to do it right and make it work for all the themes --which is also related with the new extras field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just need to extract only that element. For example

<dl class="foo">
 .. more..
 <dt>foo</dt> <dd>bar</dd>
.. more...
</dl>

becomes

<dl class="foo">
 <dt>foo</dt> <dd>bar</dd>
</dl>

Extracting content isn't hard to do here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm... I think that has the same problem.

When parsing citation/glossary we don't need to return the title/definition (dt), we only want the description (dd). Example in code

# Structure:
# <dl class="glossary docutils">
# <dt id="term-definition">definition</dt>
# <dd>Text definition for the term</dd>
# ...
# </dl>
query_result = query_result.next()

Following, your example, to make it valid it should be:

<dl class="foo">
 <dt></dt>
 <dd>bar</dd>
</dl>

However, this won't probably render correctly inside the tooltip, since in that context is not a description list --it's not a list at all-- but just the definition of one of the term from that list.

In those cases, citation/glossary will be rendered in the same than any other definition list and I'm not sure we want that. We need to think how we will support this. Example:

Glossary (only includes the definition)

Screenshot_2021-03-30_11-26-04

Config value (includes the definition list completely repeating the term hovered)

Screenshot_2021-03-30_11-23-06

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should repeat the title, it doesn't necessary match the text where the tooltip is present. Removing the title if it's the same as the text can be done in the client side.

Copy link
Member

@humitos humitos May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glossary and Citation special cases can be managed from the frontend/client since we have all the data we need (note that we return the .next() HTML element). We will just need to parse it remove the content we don't want to display.

However, for the dt case (definition of a config, Python method/function, etc) it won't work because we need the .parent() of HTML element picked by ID. Take a look at this example:

<dl class="std confval">
  <dt id="confval-hoverxref_role_types">
    <code class="sig-name descname">hoverxref_role_types</code>
    <a class="headerlink" href="#confval-hoverxref_role_types" title="Permalink to this definition"></a>
  </dt>
  <dd>
    <p>Description: Style to use by default when hover each type of reference (role).</p>
    <!-- ... snip ... -->
  </dd>
</dl>

(example from https://sphinx-hoverxref.readthedocs.io/en/latest/configuration.html#confval-hoverxref_role_types)

With the id confval-hoverxref_role_types we are currently returning the whole dl because we access to the .parent() element. If we return just the HTML tag matched, we will be returning only the dt which does not include the description nor any content.

Without handling this special case properly, we will show tooltips like:

Screenshot_2021-05-27_16-16-41

We need to find a general solution for these special cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@humitos not sure to understand the problem there? that's the same example we have in the document. The id is on the dt element, so we fetch the dt and the next dd enclosed in the dl tag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not have expressed myself properly.

This is a special case because we are manipulating the HTML to manage a particular case on Sphinx: return the parent of a particular ID, instead of itself. Since we are going to support external sites that will not be generated with Sphinx (*), we can't assume that we always have to do this special HTML manipulation. That's why I'm saying that we need a general solution to differentiate these special cases from regular (non-Sphinx) pages.

I'm thinking that we will need to communicate this to the backend somehow. Maybe a request GET argument: ?doctool=sphinx&version=3.5.1

(*) or even with a different version of Sphinx that may adds the id= in the dl.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's part of how we expect the html to be, following semantic html and using correct tags, if a site has malformed tags or if the id is in a different element we don't try to fix it, we just return it as is.


Allow us to query information about a page, like its list of sections.

.. http:get:: /_/api/v3/embed/pages?project=docs&version=latest&path=install.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we will have only one way of querying the API, we should probably use ?url= instead of sending multiple attributes. The implementation of the extensions is easier (and there are things not possible to implement without it --e.g. intersphinx) and the usage of the endpoint will be simplified too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a URL makes it depend on the domain, if the domain changes everything is broken. I'm fine supporting both to support intersphinx.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. http:get:: /_/api/v3/embed/pages?project=docs&version=latest&path=install.html
.. http:get:: /_/api/v3/embed/page/?project=docs&version=latest&path=install.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure if we want to make this a v3 API, given that it works totally differently than the others :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure if we want to make this a v3 API, given that it works totally differently than the others :/

That's why my note in #8039 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is super bad. The other endpoints are for resources and this one is not.

However, I'm starting to think that this could be a different service as well? But maybe that's too much? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a URL makes it depend on the domain, if the domain changes everything is broken

Yes and no. Maybe (?), I guess. If they use https://readthedocs.org/api/v3/embed/?url= endpoint and it shouldn't break if they don't remove the Domain for that Project from our platform. Even if the DNS for that domain is dead.

Comment on lines 64 to 73
"sections": [
{
"title": "Installation",
"id": "installation"
},
{
"title": "Examples",
"id": "examples"
}
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of listing all the sections. I'd also like to see a way to explore/surf the API just by clicking these links, similar as we do with the _links field in other endpoints. This way, you can discover the API by clicking on those, but also expose these links to developers.

API Client
----------

Do we really need a JS client?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. IIRC @agjohnson idea here was to include the popup/modal form of this into the JS client. Besides, we could eventually be able to use this JS client from our extension itself, instead of maintaining the front-end in our own extension (or adding it as another popup extension together with the ones we currently support)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, the JS client is going to be doing a lot of work in addition to just doing the GET. We definitely want nice default widgets, along with helper methods that make this a much nicer UX for users.

@stsewd
Copy link
Member Author

stsewd commented Mar 30, 2021

Also, we should support nparagraph= or nwords= or both to be able to chunk the content of the section returned

Feels like that's a client thing, to be able to cut the content exactly where they want and show a "See more" link. And maybe can be done with css, without messing around the content itselff.

@humitos
Copy link
Member

humitos commented Mar 30, 2021

Also, we should support nparagraph= or nwords= or both to be able to chunk the content of the section returned

Feels like that's a client thing, to be able to cut the content exactly where they want and show a "See more" link

I don't think we want to return/download a whole page just to show the first paragraph of the first section. That's my point.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good start, but I think we need a lot more thought and explanation behind some of the design decisions. I'd love to see more structure for a design doc here, explicitly stating the goals in a more detailed way. There is a ton of implied context in each "list of 10 items" that we should be explaining more fully.

Most importantly, we should have a sections of goals, and why they are valuable. For example, why is "Always return a valid/well formed HTML block." important? How are we going to ensure that at the code level?

docs/development/design/embed-api.rst Outdated Show resolved Hide resolved

Allow us to query information about a page, like its list of sections.

.. http:get:: /_/api/v3/embed/pages?project=docs&version=latest&path=install.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. http:get:: /_/api/v3/embed/pages?project=docs&version=latest&path=install.html
.. http:get:: /_/api/v3/embed/page/?project=docs&version=latest&path=install.html


Allow us to query the content of the section, with all links re-written as absolute.

.. http:get:: /_/api/v3/embed/sections?project=docs&version=latest&path=install.html#examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. http:get:: /_/api/v3/embed/sections?project=docs&version=latest&path=install.html#examples
.. http:get:: /_/api/v3/embed/section/?project=docs&version=latest&path=install.html#examples

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the naming conventions here. We are naming a resource, so it should be plural, but we only have a single resource to get the data from.

docs/development/design/embed-api.rst Outdated Show resolved Hide resolved
docs/development/design/embed-api.rst Outdated Show resolved Hide resolved
docs/development/design/embed-api.rst Outdated Show resolved Hide resolved
docs/development/design/embed-api.rst Outdated Show resolved Hide resolved
docs/development/design/embed-api.rst Show resolved Hide resolved
API Client
----------

Do we really need a JS client?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, the JS client is going to be doing a lot of work in addition to just doing the GET. We definitely want nice default widgets, along with helper methods that make this a much nicer UX for users.

but them querying a section would require to query a page to get the extra js/css files.
- We could not return the ``title`` of the page/section as it would require more parsing to do
(but we can re-use the code from search).
Titles can be useful to build an UI like https://readthedocs.org/projects/docs/tools/embed/.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would a user build something like this without the ability to also get all the pages in a project? We aren't exposing an API for that in this design doc. Should we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point, we could have an endpoint for that, but now that you mentioned we need to support external sites, I don't think it would be possible. Or it would be rtd only feature.

docs/development/design/embed-api.rst Outdated Show resolved Hide resolved

Note that we only try to keep the current structure,
if the page contains malformed HTML, we don't try to *fix it*.
This improvement can be shared with the current API (v2).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't back-port this since it's a breaking change from a user's perspective: the render of the content in the tooltip will be completely different (see #8039 (comment))

docs/development/design/embed-api.rst Outdated Show resolved Hide resolved
Considerations
``````````````

If a project changes its custom domain, current usage of the API would break.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could check Referer/Origin header as a fallback to try to find the project in case they changed the domain (sent via ?url=).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also save all the historic domains for a project. Instead of delete them, we can mark them as inactive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems very difficult to scale. What if another project reuses the domain?

Comment on lines 197 to 202
The API could be missused.
This is already true if we don't support external sites,
since we host arbitrary HTML already.
But it can be abussed to crawl external sites without the consent.
We can integrate support for external sites in a later stage,
or have a list of allowed sites.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand what's the problem or abuse we can face here.

I think we definitely don't have a list of allowed sites, here. We want this to be general/global.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would be exposed to SSRF, basically an attacker can use our servers to make get requests to other insecure sites or DDOS another site.

Comment on lines +58 to +67
"sections": [
{
"title": "Installation",
"id": "installation"
},
{
"title": "Examples",
"id": "examples"
}
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case for this field? Are we returning nested sections here or only the first level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All sections, nested as well, so the client knows what sections it can query, this is only from h elements, but I guess it could be expanded for more tags.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, why only show h tags when the backend can reply to any anchor link? If we want expose all the possible queries we should return all the valid anchors.

In any case, I think we should add a request GET argument to expand this and avoid returning a lot of data when it's not going to be used. Example: ?expand=sections

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in another endpoint, so I don't think we need that feature here.

@humitos humitos mentioned this pull request May 31, 2021
@stsewd stsewd deleted the design-doc-embed-api branch October 11, 2023 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants