Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

views: FAIR signposting level 1 support (HTTP Link headers) #2938

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
167 changes: 159 additions & 8 deletions invenio_app_rdm/records_ui/views/decorators.py
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a lot of methods added to decorators.py, should it be moved to a signposting-specific file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Agree, I thought the was already some signposting-related directory.

Copy link
Member Author

@ptamarit ptamarit Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • [ ] Move the code to a signposting-related file or directory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is now much less code in decorators.py now that I rely on invenio_rdm_records/resources/serializers/signposting/schema.py.

Original file line number Diff line number Diff line change
Expand Up @@ -10,19 +10,21 @@
"""Routes for record-related pages provided by Invenio-App-RDM."""

from functools import wraps
from itertools import islice

from flask import g, make_response, redirect, request, session, url_for
from flask import current_app, g, make_response, redirect, request, session, url_for
from flask_login import login_required
from invenio_communities.communities.resources.serializer import (
UICommunityJSONSerializer,
)
from invenio_communities.proxies import current_communities
from invenio_pidstore.errors import PIDDoesNotExistError
from invenio_rdm_records.proxies import current_rdm_records
from invenio_rdm_records.resources.serializers.utils import get_vocabulary_props
from invenio_records_resources.services.errors import PermissionDeniedError
from sqlalchemy.orm.exc import NoResultFound

from invenio_app_rdm.urls import record_url_for
from invenio_app_rdm.urls import download_url_for, export_url_for, record_url_for


def service():
Expand Down Expand Up @@ -365,20 +367,169 @@ def view(**kwargs):
return view


def add_signposting(f):
"""Add signposting link to view's response headers."""
def _get_header(rel, value, link_type=None):
header = f'<{value}> ; rel="{rel}"'
if link_type:
header += f' ; type="{link_type}"'
return header


def _get_signposting_cite_as(record):
"""Release self url points to RDM record.

It points to DataCite URL if the integration is enabled, otherwise it points to the HTML URL.
"""
doi_url = record["links"].get("doi")
html_url = record["links"]["self_html"]
return _get_header("cite-as", doi_url or html_url)


def _get_signposting_types(record):
resource_type = record["metadata"]["resource_type"]
props = get_vocabulary_props(
"resourcetypes",
[
"props.schema.org",
],
resource_type["id"],
)
url_schema_org = props.get("schema.org")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a better way to do this lookup.
I followed what's done in invenio_rdm_records/resources/serializers/signposting/schema.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just check that it is cached so we don't query db on every landing page request

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see these are indeed cached here, which is also mentioned in get_vocabulary_props.

return [
_get_header("type", url_schema_org),
_get_header("type", "https://schema.org/AboutPage"),
]


def _get_signposting_authors(record):
authors = []
# Limit authors to the first 10.
for creator in islice(record["metadata"]["creators"], 0, 10):
for identifier in creator["person_or_org"].get("identifiers", []):
if identifier["scheme"] == "orcid":
authors.append(
_get_header(
"author", "https://orcid.org/" + identifier["identifier"]
)
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support other schemes like ROR, etc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be that the safer option would be to use something like idutils.to_url(identifier, scheme) which will consistently produce a link.

Copy link
Member Author

@ptamarit ptamarit Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • [ ] Use idutils.to_url for authors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now relying on invenio_rdm_records/resources/serializers/signposting/schema.py's serialize_author which picks the first linkable ID.

return authors
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lars suggested that we might choose to not include authors at all since the list might be long and the full list can be found in the linkset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could apply some sensible limit? E.g. if less than 50 authors, include, otherwise don't include at all and basically have people rely on the explicit authors linkset?

Copy link
Member Author

@ptamarit ptamarit Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • [ ] Include authors if up to 50, otherwise do not include.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now relying on invenio_rdm_records/resources/serializers/signposting/schema.py's serialize_author which serializes all the authors.



def _get_signposting_describedbys(pid_value):
describedbys = []
for export_format, val in current_app.config.get(
"APP_RDM_RECORD_EXPORTERS", {}
).items():
url = export_url_for(pid_value=pid_value, export_format=export_format)
content_type = val["content-type"]
describedbys.append(_get_header("describedby", url, content_type))
return describedbys


def _get_signposting_licenses(record):
licenses = []
for right in record["metadata"].get("rights", []):
# First try to get `props.url` from the standard licenses,
# then try to get the optional `link` from the custom license.
url = right.get("props", {}).get("url") or right.get("link")
if url:
licenses.append(_get_header("license", url))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FAIR Signposting docs recommends to use SPDX license identifier (e.g. https://spdx.org/licenses/CC0-1.0).
However, in Zenodo we store URLs like https://creativecommons.org/publicdomain/zero/1.0/legalcode and not spdx.org URLs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If props["scheme"] == "spdx" I think we can safely generate the URL like https://spdx.org/licenses/{right["id"]}. We might have licenses (or even non-SPDX licenses), in which case just using url like here would be ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately our IDs are lower-cased (e.g. antlr-pd-fallback) while the SPDX URLs are are mixed-cased and case-sensitive (e.g. https://spdx.org/licenses/ANTLR-PD-fallback.html).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, I tried in the browser and copy-pasting URLs for some reason kept the original case... Ok, this is a bummer, I think we'll have to add the original spdx ID with the exact case as a props.spdx_id field or similar...

I think it would be fine to shelve this and just use the url, depends on whether we want to spend more time to re-import SPDX and update the existing license vocabulary (funnily, the dump we have is from more than a year ago).

return licenses


def _get_signposting_items(files, pid_value):
items = []
# Checking if the user has access to the files.
if files:
# Limiting the iteration to 100 files maximum.
for file in islice(files.to_dict()["entries"], 0, 100):
url = download_url_for(pid_value=pid_value, filename=file["key"])
items.append(_get_header("item", url, file["mimetype"]))
return items


def _get_signposting_collection(pid_value):
ui_url = record_url_for(pid_value=pid_value)
return _get_header("collection", ui_url, "text/html")


def _get_signposting_describes(pid_value):
ui_url = record_url_for(pid_value=pid_value)
return _get_header("describes", ui_url, "text/html")


def _get_signposting_linkset(pid_value):
api_url = record_url_for(_app="api", pid_value=pid_value)
return _get_header("linkset", api_url, "application/linkset+json")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is required for level 2 support and was already added in a previous pull request.
Here we only include a link of the type "application/linkset+json", but the docs requires to also include a link of type "application/linkset".



def add_signposting_landing_page(f):
"""Add signposting links to the landing page view's response headers."""

@wraps(f)
def view(*args, **kwargs):
response = make_response(f(*args, **kwargs))

# Relies on other decorators having operated before it
pid_value = kwargs["pid_value"]
signposting_link = record_url_for(_app="api", pid_value=pid_value)
record = kwargs["record"]
files = kwargs["files"]

signposting_headers = [
_get_signposting_cite_as(record),
*_get_signposting_types(record),
*_get_signposting_authors(record),
*_get_signposting_describedbys(pid_value),
*_get_signposting_licenses(record),
*_get_signposting_items(files, pid_value),
_get_signposting_linkset(pid_value),
]

response.headers["Link"] = " , ".join(signposting_headers)

return response

return view


def add_signposting_content_resources(f):
"""Add signposting links to the content resources view's response headers."""

@wraps(f)
def view(*args, **kwargs):
response = make_response(f(*args, **kwargs))

# Relies on other decorators having operated before it
pid_value = kwargs["pid_value"]

signposting_headers = [
_get_signposting_collection(pid_value),
_get_signposting_linkset(pid_value),
]

response.headers["Link"] = " , ".join(signposting_headers)

return response

return view


def add_signposting_metadata_resources(f):
"""Add signposting links to the metadata resources view's response headers."""

@wraps(f)
def view(*args, **kwargs):
response = make_response(f(*args, **kwargs))

# Relies on other decorators having operated before it
pid_value = kwargs["pid_value"]

signposting_headers = [
_get_signposting_describes(pid_value),
_get_signposting_linkset(pid_value),
]

response.headers["Link"] = " , ".join(signposting_headers)

response.headers["Link"] = (
f'<{signposting_link}> ; rel="linkset" ; type="application/linkset+json"' # fmt: skip
)
return response

return view
Expand Down
9 changes: 6 additions & 3 deletions invenio_app_rdm/records_ui/views/records.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,9 @@

from ..utils import get_external_resources
from .decorators import (
add_signposting,
add_signposting_content_resources,
add_signposting_landing_page,
add_signposting_metadata_resources,
pass_file_item,
pass_file_metadata,
pass_include_deleted,
Expand Down Expand Up @@ -141,7 +143,7 @@ def open(self):
@pass_record_or_draft(expand=True)
@pass_record_files
@pass_record_media_files
@add_signposting
@add_signposting_landing_page
def record_detail(
pid_value, record, files, media_files, is_preview=False, include_deleted=False
):
Expand Down Expand Up @@ -247,6 +249,7 @@ def record_detail(

@pass_is_preview
@pass_record_or_draft(expand=False)
@add_signposting_metadata_resources
def record_export(
pid_value, record, export_format=None, permissions=None, is_preview=False
):
Expand Down Expand Up @@ -309,7 +312,7 @@ def record_file_preview(

@pass_is_preview
@pass_file_item(is_media=False)
@add_signposting
@add_signposting_content_resources
def record_file_download(pid_value, file_item=None, is_preview=False, **kwargs):
"""Download a file from a record."""
download = bool(request.args.get("download"))
Expand Down
18 changes: 18 additions & 0 deletions invenio_app_rdm/urls.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,21 @@ def download_url_for(pid_value="", filename=""):
)

return "/".join(p.strip("/") for p in [url_prefix, url_path])


def export_url_for(pid_value="", export_format=""):
"""Return url for export route."""
url_prefix = current_app.config.get(f"SITE_UI_URL", "")

# We use [] so that this fails and brings to attention the configuration
# problem if APP_RDM_ROUTES.record_export is missing
# url_path = current_app.config["APP_RDM_ROUTES"]["record_export"].replace(
# "<pid_value>", pid_value
# )
url_path = (
current_app.config["APP_RDM_ROUTES"]["record_export"]
.replace("<pid_value>", pid_value)
.replace("<export_format>", export_format)
)

return "/".join(p.strip("/") for p in [url_prefix, url_path])
55 changes: 45 additions & 10 deletions tests/ui/test_signposting_ui.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,23 +11,58 @@
"""


def test_link_in_landing_page_response_headers(running_app, client, record):
res = client.head(f"/records/{record.id}")
def test_link_in_landing_page_response_headers(running_app, client, record_with_file):
ui_url = f"https://127.0.0.1:5000/records/{record_with_file.id}"
api_url = f"https://127.0.0.1:5000/api/records/{record_with_file.id}"
filename = "article.txt"

res = client.head(f"/records/{record_with_file.id}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question/comment: I think the HEAD implementation for Flask/Invenio is that we just treat it as a GET request and skip the body of the response. In that case, we're not saving anything in terms of computation/performance (if that was the original goal of just testing the HEAD response only).

IMHO, it's ok to keep as is, since none of the logic done for generating the header links is that much more complex or adds that big of an overhead compared to the rest of the GET response.

Copy link
Member Author

@ptamarit ptamarit Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Link header should be included in both GET and HEAD, as stated in the FAIR Signposting docs says:

In addition to being available via HTTP GET requests, the HTTP header that contains Link is accessible via the HTTP HEAD request, which only returns transaction metadata not a resource representation. As such machine agents can obtain a map for their journey by issuing a HTTP HEAD even against resources that have access restrictions. All the while saving bandwidth and hence energy.

  • Modify the tests to assert not only HEAD, but also GET.


assert (
res.headers["Link"]
== f'<https://127.0.0.1:5000/api/records/{record.id}> ; rel="linkset" ; type="application/linkset+json"' # noqa
)
assert res.headers["Link"].split(" , ") == [
f'<{ui_url}> ; rel="cite-as"',
'<https://schema.org/Photograph> ; rel="type"',
'<https://schema.org/AboutPage> ; rel="type"',
# The test record does not have an author with an identifier.
f'<{ui_url}/export/json> ; rel="describedby" ; type="application/json"',
f'<{ui_url}/export/json-ld> ; rel="describedby" ; type="application/ld+json"',
f'<{ui_url}/export/csl> ; rel="describedby" ; type="application/vnd.citationstyles.csl+json"',
f'<{ui_url}/export/datacite-json> ; rel="describedby" ; type="application/vnd.datacite.datacite+json"',
f'<{ui_url}/export/datacite-xml> ; rel="describedby" ; type="application/vnd.datacite.datacite+xml"',
f'<{ui_url}/export/dublincore> ; rel="describedby" ; type="application/x-dc+xml"',
f'<{ui_url}/export/marcxml> ; rel="describedby" ; type="application/marcxml+xml"',
f'<{ui_url}/export/bibtex> ; rel="describedby" ; type="application/x-bibtex"',
f'<{ui_url}/export/geojson> ; rel="describedby" ; type="application/vnd.geo+json"',
f'<{ui_url}/export/dcat-ap> ; rel="describedby" ; type="application/dcat+xml"',
f'<{ui_url}/export/codemeta> ; rel="describedby" ; type="application/ld+json"',
f'<{ui_url}/export/cff> ; rel="describedby" ; type="application/x-yaml"',
# The test record does not have a license.
f'<{ui_url}/files/{filename}> ; rel="item" ; type="text/plain"',
f'<{api_url}> ; rel="linkset" ; type="application/linkset+json"',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for the landing page is implemented in FAIRSignpostingProfileLvl1Serializer in invenio-rdm-records and is already tested there (see inveniosoftware/invenio-rdm-records#1908).
It stills makes sense to at least issue the HTTP call to the endpoint here, to make sure that the decorator is working properly, but maybe the assertion should be less detailed to avoid having to adapt this test every time we modify the other module?

]


def test_link_in_content_resource_response_headers(
running_app, client, record_with_file
):
ui_url = f"https://127.0.0.1:5000/records/{record_with_file.id}"
api_url = f"https://127.0.0.1:5000/api/records/{record_with_file.id}"
filename = "article.txt"

res = client.head(f"/records/{record_with_file.id}/files/{filename}")

assert (
res.headers["Link"]
== f'<https://127.0.0.1:5000/api/records/{record_with_file.id}> ; rel="linkset" ; type="application/linkset+json"' # noqa
)
assert res.headers["Link"].split(" , ") == [
f'<{ui_url}> ; rel="collection" ; type="text/html"',
f'<{api_url}> ; rel="linkset" ; type="application/linkset+json"',
]


def test_link_in_metadata_resource_response_headers(running_app, client, record):
ui_url = f"https://127.0.0.1:5000/records/{record.id}"
api_url = f"https://127.0.0.1:5000/api/records/{record.id}"

res = client.head(f"/records/{record.id}/export/bibtex")

assert res.headers["Link"].split(" , ") == [
f'<{ui_url}> ; rel="describes" ; type="text/html"',
f'<{api_url}> ; rel="linkset" ; type="application/linkset+json"',
]
Loading