Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embed APIv3: initial implementation #8319

Merged
merged 49 commits into from
Sep 21, 2021
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
61554ac
Move `clean_links` to embed/utils.py
humitos Jul 5, 2021
311ada4
`clean_links` response HTML in raw instead of a PyQuery object
humitos Jul 6, 2021
76d0bf5
Implement the minimal version of the contract
humitos Jul 6, 2021
e841b98
Add docs.sympy.org to the allowed domains
humitos Jul 6, 2021
78f6656
Implement doctool= usage for Sphinx `dt` known cases
humitos Jul 6, 2021
fe3a507
Use timeout= when requesting external pages
humitos Jul 6, 2021
28e4e83
Handle requests TooManyRedirects and other errors
humitos Jul 6, 2021
71b9e2d
Use cache-keys and djangos settings for timeouts
humitos Jul 7, 2021
1aa70e3
More logs and comments
humitos Jul 7, 2021
5bab780
Remove one unneeded conditional
humitos Jul 7, 2021
59fe0f1
Return fragment=null if there is no fragment
humitos Jul 7, 2021
517d3e8
Use setting for request.get `timeout=` argument
humitos Jul 12, 2021
562c260
Log exception to track wrong URLs in Sentry
humitos Jul 12, 2021
73e2ee3
Sanitize the URL inside `_download_page_content`
humitos Jul 12, 2021
854a44b
Handle malformed URLs (not netloc or scheme)
humitos Jul 12, 2021
49e02d5
Use for/else syntax sugar instead of a temporary variable
humitos Jul 12, 2021
be32bd3
Call `clean_links` before creating the response
humitos Jul 12, 2021
cc2227c
Do not depend on impicit state: pass the required arguments
humitos Jul 12, 2021
1a72c07
Don't return metadata (project, version, language, path)
humitos Jul 12, 2021
7b6b493
Update readthedocs/embed/v3/views.py
humitos Jul 15, 2021
418d9ac
Improve the response http status codes
humitos Jul 19, 2021
8a77043
Sanitize URL before passing it to `clean_liniks`
humitos Jul 19, 2021
d9ca50e
Comment to sanitize `cache_key` by URL
humitos Jul 19, 2021
3be9b8e
Update import for `clean_links` in tests
humitos Jul 19, 2021
a27122a
Do not call selectolax if there is no content
humitos Jul 19, 2021
84c3d18
Check if the domain is valid before calling `unresolver`
humitos Jul 19, 2021
7adccb2
Remove tedius warnings from pytest
humitos Jul 19, 2021
a1cf45a
Initial test suite for EmbedAPI v3
humitos Jul 19, 2021
ec2fd5f
Add `doctoolwriter` to allow `html4` and `html5` on Sphinx
humitos Aug 16, 2021
f05da3f
Run EmbedAPIv3 test on a different tox environment
humitos Aug 16, 2021
7f3e3a9
Fix tests with proper error message
humitos Aug 16, 2021
bcfba37
Run tests-embedapi in CircleCI
humitos Aug 16, 2021
905cbcf
Consider docutils 0.16 and 0.17 when checking HTML output
humitos Aug 16, 2021
b0dc81f
Revert "Fix tests with proper error message"
humitos Aug 16, 2021
9810f89
Revert "Add `doctoolwriter` to allow `html4` and `html5` on Sphinx"
humitos Aug 16, 2021
05936d3
Lint
humitos Aug 16, 2021
333c892
Disable unused-argument for now
humitos Aug 16, 2021
c4751c7
Make test for sphinxcontrib-bibtex to pass
humitos Aug 18, 2021
7dd4b4f
Auto-delete _build directory after test run
humitos Aug 18, 2021
d134ab9
Checks that depend on Sphinx version (3.5)
humitos Aug 18, 2021
4b5f2d6
Don't make doctoolversion= attribute mandatory when passing doctool=
humitos Aug 18, 2021
0056e90
Sphinx 3.5 seems to be different on its HTML
humitos Aug 18, 2021
cf3da8a
Lint
humitos Aug 18, 2021
aeaad76
Fragment case changed on 3.0.0
humitos Aug 18, 2021
ef5a14c
Don't run EmbedAPIv3 tests by default
humitos Aug 18, 2021
160e51b
Sphinx 3.5 adds an <span> on <dl>
humitos Aug 18, 2021
cc76893
Update readthedocs/embed/v3/tests/test_external_pages.py
humitos Sep 9, 2021
5e25c47
Log url= together with fragment=
humitos Sep 21, 2021
7edb272
Merge branch 'master' of github.com:readthedocs/readthedocs.org into …
humitos Sep 21, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 54 additions & 1 deletion readthedocs/embed/utils.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""Embed utils."""

from urllib.parse import urlparse
from pyquery import PyQuery as PQ # noqa

def recurse_while_none(element):
"""Recursively find the leaf node with the ``href`` attribute."""
Expand All @@ -10,3 +11,55 @@ def recurse_while_none(element):
if not href:
href = element.attrib.get('id')
return {element.text: href}


def clean_links(obj, url, html_raw_response=False):
"""
Rewrite (internal) links to make them absolute.

1. external links are not changed
2. prepend URL to links that are just fragments (e.g. #section)
3. prepend URL (without filename) to internal relative links
"""

# TODO: do not depend on PyQuery
obj = PQ(obj)

if url is None:
return obj

for link in obj.find('a'):
base_url = urlparse(url)
# We need to make all internal links, to be absolute
href = link.attrib['href']
parsed_href = urlparse(href)
if parsed_href.scheme or parsed_href.path.startswith('/'):
# don't change external links
continue

if not parsed_href.path and parsed_href.fragment:
# href="#section-link"
new_href = base_url.geturl() + href
link.attrib['href'] = new_href
continue

if not base_url.path.endswith('/'):
# internal relative link
# href="../../another.html" and ``base_url`` is not HTMLDir
# (e.g. /en/latest/deep/internal/section/page.html)
# we want to remove the trailing filename (page.html) and use the rest as base URL
# The resulting absolute link should be
# https://slug.readthedocs.io/en/latest/deep/internal/section/../../another.html

# remove the filename (page.html) from the original document URL (base_url) and,
path, _ = base_url.path.rsplit('/', 1)
# append the value of href (../../another.html) to the base URL.
base_url = base_url._replace(path=path + '/')

new_href = base_url.geturl() + href
link.attrib['href'] = new_href

if html_raw_response:
return obj.outerHtml()

return obj
Empty file.
8 changes: 8 additions & 0 deletions readthedocs/embed/v3/urls.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from django.conf.urls import url

from .views import EmbedAPI


urlpatterns = [
url(r'', EmbedAPI.as_view(), name='embed_api_v3'),
]
318 changes: 318 additions & 0 deletions readthedocs/embed/v3/views.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,318 @@
"""Views for the EmbedAPI v3 app."""

import logging
import re
from urllib.parse import urlparse
import requests

from selectolax.parser import HTMLParser
from pyquery import PyQuery as PQ # noqa

from django.conf import settings
from django.core.cache import cache
from django.shortcuts import get_object_or_404
from django.utils.functional import cached_property
from rest_framework import status
from rest_framework.permissions import AllowAny
from rest_framework.renderers import BrowsableAPIRenderer, JSONRenderer
from rest_framework.response import Response
from rest_framework.views import APIView

from readthedocs.api.v2.mixins import CachedResponseMixin
from readthedocs.core.unresolver import unresolve
from readthedocs.core.utils.extend import SettingsOverrideObject
from readthedocs.embed.utils import clean_links
from readthedocs.projects.constants import PUBLIC
from readthedocs.storage import build_media_storage

log = logging.getLogger(__name__)



class EmbedAPIBase(CachedResponseMixin, APIView):

# pylint: disable=line-too-long

"""
Embed a section of content from any Read the Docs page.

### Arguments

* url (with fragment) (required)
* doctool
* doctoolversion

### Example

GET https://readthedocs.org/api/v3/embed/?url=https://docs.readthedocs.io/en/latest/features.html%23#full-text-search
humitos marked this conversation as resolved.
Show resolved Hide resolved

""" # noqa

permission_classes = [AllowAny]
renderer_classes = [JSONRenderer, BrowsableAPIRenderer]

@cached_property
def unresolved_url(self):
url = self.request.GET.get('url')
if not url:
return None
return unresolve(url)

def _download_page_content(self, url):
humitos marked this conversation as resolved.
Show resolved Hide resolved
cache_key = f'embed-api-{url}'
cached_response = cache.get(cache_key)
if cached_response:
log.debug('Cached response. url=%s', url)
return cached_response

try:
response = requests.get(url, timeout=1)
humitos marked this conversation as resolved.
Show resolved Hide resolved
except requests.exceptions.TooManyRedirects:
log.warning('Too many redirects. url=%s', url)
return
except Exception: # noqa
log.warning('There was an error reading the URL requested. url=%s', url)
humitos marked this conversation as resolved.
Show resolved Hide resolved
return

if response.ok:
cache.set(
cache_key,
response.text,
timeout=settings.RTD_EMBED_API_PAGE_CACHE_TIMEOUT,
)
return response.text

def _get_page_content_from_storage(self):
humitos marked this conversation as resolved.
Show resolved Hide resolved
project = self.unresolved_url.project
version = get_object_or_404(
project.versions,
slug=self.unresolved_url.version_slug,
# Only allow PUBLIC versions when getting the content from our
# storage for privacy/security reasons
privacy_level=PUBLIC,
)
storage_path = project.get_storage_path(
'html',
version_slug=version.slug,
include_file=False,
version_type=version.type,
)
file_path = build_media_storage.join(
storage_path,
self.unresolved_url.filename,
)
try:
with build_media_storage.open(file_path) as fd:
return fd.read()
except Exception: # noqa
log.warning('Unable to read file. file_path=%s', file_path)

return None

def _get_content_by_fragment(self, url, fragment, external, doctool, doctoolversion):
if external:
url_without_fragment = urlparse(url)._replace(fragment='').geturl()
page_content = self._download_page_content(url_without_fragment)
else:
page_content = self._get_page_content_from_storage()
humitos marked this conversation as resolved.
Show resolved Hide resolved

return self._parse_based_on_doctool(page_content, fragment, doctool, doctoolversion)

def _find_main_node(self, html):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is reimplementing

def _get_main_node(self, html):

I'd like to find a way to have both of these codebases using similar abstractions, instead of reinventing the same logic over again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I mainly copied&pasted this from there. We can probably move this to a utils function somewhere and use the same in both places.

main_node = html.css_first('[role=main]')
if main_node:
log.info('Main node found. selector=[role=main]')
return main_node

main_node = html.css_first('main')
if main_node:
log.info('Main node found. selector=main')
return main_node

first_header = html.body.css_first('h1')
if first_header:
log.info('Main node found. selector=h1')
return first_header.parent

def _parse_based_on_doctool(self, page_content, fragment, doctool, doctoolversion):
node = None
if fragment:
selector = f'#{fragment}'
node = HTMLParser(page_content).css_first(selector)
else:
html = HTMLParser(page_content)
node = self._find_main_node(html)

if not node:
return

if doctool == 'sphinx':
# Handle ``dt`` special cases
if node.tag == 'dt':
if 'glossary' in node.parent.attributes.get('class'):
# Sphinx HTML structure for term glossary puts the ``id`` in the
# ``dt`` element with the title of the term. In this case, we
# return the parent node which contains the definition list
# and remove all ``dt/dd`` that are not the requested one

# Structure:
# <dl class="glossary docutils">
# <dt id="term-definition">definition</dt>
# <dd>Text definition for the term</dd>
# ...
# </dl>
parent = node.parent
for n in parent.traverse():
if n not in (node, node.next):
n.remove()
node = node.parent

elif 'citation' in node.parent.attributes.get('class'):
# Sphinx HTML structure for sphinxcontrib-bibtex puts the ``id`` in the
# ``dt`` element with the title of the cite. In this case, we
# return the parent node which contains the definition list
# and remove all ``dt/dd`` that are not the requested one

# Structure:
# <dl class="citation">
# <dt id="cite-id"><span><a>Title of the cite</a></span></dt>
# <dd>Content of the cite</dd>
# ...
# </dl>
parent = node.parent
for n in parent.traverse():
if n not in (node, node.next):
n.remove()
node = node.parent

else:
# Sphinx HTML structure for definition list puts the ``id``
# the ``dt`` element, instead of the ``dl``. This makes
# the backend to return just the title of the definition. If we
# detect this case, we return the parent with the whole ``dl`` tag

# Structure:
# <dl class="confval">
# <dt id="confval-config">
# <code class="descname">config</code>
# <a class="headerlink" href="#confval-config">¶</a></dt>
# <dd><p>Text with a description</p></dd>
# </dl>
node = node.parent

return node.html

def get(self, request):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not doing any auth here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for now. Do we need auth at all? We are giving users to access only public internal projects (from inside _get_page_content_from_storage method) or public webpages on the internet.

I remember thinking about this and not finding a good and safe way to support private projects with session auth. Do we want other type of auth for this? If this is the case, we will need to consider auth only for internal projects.

I thought it was safer, for now, to start only with public documents and grow from there if we require auth/private documents in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we need to support this on .com, so we definitely need to figure out auth I think. I'd imagine it would be with session auth for the same domain, and explicit API key auth cross-domain, as we discussed previously?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be implemented in corporate repository by overriding the class and the permission_classes field with a custom class that make these checks. Makes sense to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, seems reasonable.

url = request.GET.get('url')
doctool = request.GET.get('doctool')
doctoolversion = request.GET.get('doctoolversion')
if not url:
return Response(
{
'error': (
'Invalid arguments. '
'Please provide "url".'
)
},
status=status.HTTP_400_BAD_REQUEST
)

if not all([doctool, doctoolversion]) and any([doctool, doctoolversion]):
return Response(
{
'error': (
'Invalid arguments. '
'Please provide "doctool" and "doctoolversion" or none of them.'
)
},
status=status.HTTP_400_BAD_REQUEST
)

# NOTE: ``readthedocs.core.unresolver.unresolve`` returns ``None`` when
# it can find the project in our database
humitos marked this conversation as resolved.
Show resolved Hide resolved
external = self.unresolved_url is None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird logic. It seems to depend on an implicit request.GET? I feel like we should make this a direct function call, or something else more explicit.

Copy link
Member Author

@humitos humitos Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is weird. It does not depend on an implicit request.GET attribute, tho. It's using a "hidden feature" of the unresolver:

if not project_slug:
return None

that returns None when there is no project found in our database with that domain.


parsed_url = urlparse(url)
external_domain = parsed_url.netloc
if external and external_domain:
humitos marked this conversation as resolved.
Show resolved Hide resolved
allowed_domain = False
for domain in settings.RTD_EMBED_API_EXTERNAL_DOMAINS:
if re.match(domain, external_domain):
allowed_domain = True
break

if not allowed_domain:
humitos marked this conversation as resolved.
Show resolved Hide resolved
log.info('Domain not allowed. domain=%s url=%s', external_domain, url)
return Response(
{
'error': (
'External domain not allowed. '
f'domain={external_domain}'
)
},
status=status.HTTP_400_BAD_REQUEST,
)

# Check rate-limit for this particular domain
cache_key = f'embed-api-{external_domain}'
cache.get_or_set(cache_key, 0, timeout=settings.RTD_EMBED_API_DOMAIN_RATE_LIMIT_TIMEOUT)
cache.incr(cache_key)
if cache.get(cache_key) > settings.RTD_EMBED_API_DOMAIN_RATE_LIMIT:
log.warning('Too many requests for this domain. domain=%s', external_domain)
return Response(
{
'error': (
'Too many requests for this domain. '
f'domain={external_domain}'
)
},
status=status.HTTP_429_TOO_MANY_REQUESTS,
)

# NOTE: we could validate the fragment if we want. It must contain at
# least one character, cannot start with a number, and must not contain
# whitespaces (spaces, tabs, etc.).
fragment = parsed_url.fragment

content_requested = self._get_content_by_fragment(
url,
fragment,
external,
doctool,
doctoolversion,
)
if not content_requested:
log.warning('Identifier not found. url=%s', url)
return Response(
{
'error': (
"Can't find content for section: "
f"url={url} fragment={fragment}"
)
},
status=status.HTTP_404_NOT_FOUND
)

response = {
'url': url,
'fragment': fragment,
'content': clean_links(
content_requested,
url,
html_raw_response=True,
),
humitos marked this conversation as resolved.
Show resolved Hide resolved
'external': external,
}

if not external:
response.update({
'project': self.unresolved_url.project.slug,
'version': self.unresolved_url.version_slug,
'language': self.unresolved_url.language_slug,
'path': self.unresolved_url.filename,
humitos marked this conversation as resolved.
Show resolved Hide resolved
})
return Response(response)


class EmbedAPI(SettingsOverrideObject):
_default_class = EmbedAPIBase
Loading