Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use canonical urls for entity references when serializing jsonld #887

Closed
dannylamb opened this issue Jul 30, 2018 · 22 comments
Closed

Use canonical urls for entity references when serializing jsonld #887

dannylamb opened this issue Jul 30, 2018 · 22 comments
Assignees
Milestone

Comments

@dannylamb
Copy link
Contributor

dannylamb commented Jul 30, 2018

Right now, entity references have URLs pointing to their jsonld counterparts, but we should probably be writing out the canonical URL. The jsonld is an alternate representation of the node/media/term, but the triples are describing the canonical resource, independent of any other serialization.

In other words, with triples edited for brevity,

{
   "@graph":[
      {
         "@id":"http://localhost:8000/node/1?_format=jsonld",
         "@type":[
            "http://pcdm.org/models#Object"
         ],
         ...
         "http://schema.org/author":[
            {
               "@id":"http://localhost:8000/user/1?_format=jsonld"
            }
         ],
         "http://schema.org/additionalType":[
            {
               "@id":"http://localhost:8000/taxonomy/term/8?_format=jsonld"
            },
            {
               "@id":"http://localhost:8000/taxonomy/term/1?_format=jsonld"
            }
         ]
         ...
      },
   ]
}

should become

{
   "@graph":[
      {
         "@id":"http://localhost:8000/node/1",
         "@type":[
            "http://pcdm.org/models#Object"
         ],
         ...
         "http://schema.org/author":[
            {
               "@id":"http://localhost:8000/user/1"
            }
         ],
         "http://schema.org/additionalType":[
            {
               "@id":"http://localhost:8000/taxonomy/term/8"
            },
            {
               "@id":"http://localhost:8000/taxonomy/term/1"
            }
         ]
         ...
      },
   ]
}

The change is easy enough, just delete the setRouteParameter bit in https://github.com/Islandora-CLAW/jsonld/blob/c23337950a0d310f6ae032c27557e04c171a8daf/src/Normalizer/ContentEntityNormalizer.php#L258. I'm proposing we do it there because I think this should apply to everything, and is really independent of the whole "Islandora flushing to Fedora" process.

There are downstream ramifications, though. We'll need to update Milliner to be aware of the difference. It looks for the _format=jsonld url in a handful of places, and it also uses the _format=jsonld url when indexing the resource in Gemini.

@ajs6f
Copy link

ajs6f commented Jul 30, 2018 via email

@DiegoPino
Copy link
Contributor

@dannylamb maybe missing a piece there so have a few questions:

So currently
$url = $entity->toUrl('canonical', ['absolute' => TRUE]);
gives you the canonical URL PHP Object for any entity that has one(not always the case).

The following line, the one your propose to remove, only adds the json_ld argument to the URL because it is supposed to reference the entity "as seen" in JSON-LD format too, not the webpage (with formatting, theming, etc) of the entity, it is like a wikipage v/s wikiitem thing. Not the same even if speaking about the same thing. It feels to me, semantically it is correct to use the JSON-LD representation's URL when dealing with triples between JSON-LD documents.
So what is the need or why is this particular change needed?

If the difference is just the argument after the base URL, also, would it not easier, if you really need to remove it, to do that as a post process when parsing it on milliner?
I can only see good reasons on keeping it as it right now, but i can be mistaken. Maybe i just need to be convinced.

@dannylamb
Copy link
Contributor Author

@DiegoPino At a practical level, I don't think the data should be indexed in Gemini, the triplestore, and Fedora with ?_format=jsonld at the end of it. That can certainly be done as postprocessing, either in Milliner and the triplestore indexer. Or even as an alter.

But for me, semantically, my thinking is that the jsonld representation contains triples about the web resource, not triples about the jsonld representation of the web resource. Am I range 14'ing it here a bit?

@DiegoPino
Copy link
Contributor

@dannylamb not sure what you mean with at a practical level, but interpret this/ guess the GET argument interferes with some preconceptions of what a valid URL should look like? Or with some processing needs? We can discuss it the URL is opaque or even semantically significant as an URL (true!) but at a w3c, rdf specs level with or without ?_format=jsonld both are totally valid IRI/URIs.

The issue i see with removing the ?_format=jsonld piece is that, not only web-semantically but also human-semantically, a published web resource, a.k.a an HTML rendered node can contain way more stuff that the entity/node described by the original json-ld. A Node can contain embedded (on display) the related media,viewers, tables of contents of other stuff, forms, comments, images, some blocks, etc.

So the real issue here is the lack of content negotiation that does not allow the same URL to output different serialization v/s display eyecandy depending on how you request the URL without having to use the argument. And that leads for me to a community question that we need to resolve: WE need a canonical web resource URL, not the one drupal provides but our own, unique, stable,consistent one.

Happens that incremental node ids (node/1) are an issue for any repository. Drupal 8 right now does not expose existing entity UUIDs as a autoload argument on its routing system (core), but i remember we got this working in CLAW on one of the first iterations some 2 years ago, so it is possible!

Since content negotiation is not build into by default but any EventSubscriberInterface derived class that is subscribed to Symfony HTTP Kernel events (like onRequest) can have access and set to the HEADER_CONTENT_TYPE and set a new response, it can (fun and easy!) generate a RedirectResponse() with a new destination URL!

So if we provide a real canonical in terms of reliable, consistent URI for our islandora resources (if there is such thing actually) then we could use content negotiation to redirect internally and generated the correct output solving this issue.

Can we open this discussion to other folks? I kinda find it fundamental to define what our canonical will be. Thanks!

@seth-shaw-unlv
Copy link
Contributor

The way I conceive it is that the node URL is the 'thing' and should be mapped in Gemini and the triplestore; it 'happens' to return a default serialization of HTML (cause that is what people/browsers expects).

Yeah, the _format query parameter is a poor substitution for content negotiation (side-long glance at Drupal and Symphony). In addition to @DiegoPino's suggestion of generating a RedirectResponse, supposedly we could use an Apache rewrite condition to translate accept headers into the _format query parameter to plug this hole although I haven't tested it. There is a small chance Drupal will eventually support content negotiation again. Since there are a number of options that can be built on top of the base URL I wouldn't want to rely on the _format parameter for our URLs.

@whikloj
Copy link
Member

whikloj commented Aug 7, 2018

Jumping in here to agree with both sides of the discussion, but to note that Drupal's use of _format is waaay hard coded in or at least a couple Drupal versions ago when I tried to work around it I had no fun and found no way to avoid it.

But Drupal != semantic datastore so perhaps we need to accept this is a lighter-weight repository and if you want a full (LDP|graph|whatever) implementation you could expose yours to the web?

@dannylamb
Copy link
Contributor Author

FWIW, our REST api allows an informed client to at least find the _format routes from the canonical route. I'm not suggesting it as a general replacement for, or touting that it's better than, actual conneg, but all the available formats are advertised as "alternate" link headers on all GET and HEAD responses.

vagrant@claw:~$ curl -I http://localhost:8000/node/5
HTTP/1.1 200 OK
Date: Tue, 07 Aug 2018 14:47:22 GMT
Server: Apache/2.4.18 (Ubuntu)
X-Powered-By: PHP/7.0.30-0ubuntu0.16.04.1
Cache-Control: must-revalidate, no-cache, private
Link: <http://purl.org/coar/resource_type/c_c513>; rel="tag"; title="Image"
Link: <http://localhost:8000/media/13>; rel="related"; title="Preservation Master"
Link: <http://localhost:8000/media/14>; rel="related"; title="Service File"
Link: <http://localhost:8000/media/15>; rel="related"; title="Thumbnail"
Link: <http://localhost:8000/node/5?_format=jsonld>; rel="alternate"; type="application/ld+json"
Link: <http://localhost:8000/node/5?_format=json>; rel="alternate"; type="application/json"
Link: <http://localhost:8000/node/5>; rel="alternate"; hreflang="en"
Link: </node/5>; rel="canonical"
Link: </node/5>; rel="shortlink"
Link: </node/5>; rel="revision"
Link: </node?node=5>; rel="create"
X-Drupal-Dynamic-Cache: MISS
X-UA-Compatible: IE=edge
Content-language: en
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Vary: 
X-Generator: Drupal 8 (https://www.drupal.org)
X-Drupal-Cache: MISS
Content-Type: text/html; charset=UTF-8

And I'll throw another option onto the table. We can embed the JSONLD in the html, so that they are one in the same and don't need separate urls. Then Google would actually be reading your metadata at least.

@DiegoPino
Copy link
Contributor

DiegoPino commented Aug 7, 2018 via email

@whikloj whikloj added this to the 1.x milestone Apr 11, 2019
@dannylamb dannylamb modified the milestones: 1.x, 1.0.0 May 9, 2019
@rosiel
Copy link
Member

rosiel commented May 10, 2019

A reminder that this will need to be reflected in documentation of the object model/data flow.

@dannylamb
Copy link
Contributor Author

dannylamb commented May 14, 2019

Testing Instructions

Setup

Confirm Prior Behaviour

  • Create a Repository Item (say http://localhost:8000/node/1) and test by
    • Viewing the Json-ld ('http://localhost:8000/node/1?_format=jsonld`)
      • The @ids should have ?_format=jsonld on the end
    • Viewing the Fedora representation, it's triples should have ?_format=jsonld at the end
    • Querying the triplestore for the node, its triples should have ?_format=jsonld
    • Looking at Gemini, its db record should have ?_format=jsonld
  • Add a media to the node, and test it by:
    • Viewing the Json-ld ('http://localhost:8000/media/1?_format=jsonld`)
      • The @ids should have ?_format=jsonld on the end
    • Viewing the Fedora representation, it's triples should have ?_format=jsonld at the end
    • Querying the triplestore for the media, its triples should have ?_format=jsonld

Configure new behaviour

Confirm new behaviour

  • Create a Repository Item (say http://localhost:8000/node/2) and test by
    • Viewing the Json-ld ('http://localhost:8000/node/2?_format=jsonld`)
      • The @ids should not have ?_format=jsonld on the end
    • Viewing the Fedora representation, it's triples should not have ?_format=jsonld at the end
    • Querying the triplestore for the node, its triples should not have ?_format=jsonld
    • Looking at Gemini, its db record should not have ?_format=jsonld
  • Add a media to the node, and test it by:
    • Viewing the Json-ld ('http://localhost:8000/media/4?_format=jsonld`)
      • The @ids should not have ?_format=jsonld on the end
    • Viewing the Fedora representation, it's triples should not have ?_format=jsonld at the end
    • Querying the triplestore for the media, its triples should not have ?_format=jsonld

@dannylamb
Copy link
Contributor Author

dannylamb commented May 14, 2019

We'll follow all of this up with a claw playbook PR to make this configuration default. That way, folks can just get the new code and not be affected and configure it if they want to. But for new users, claw-playbook is just going to do this for them.

But these PRs can get merged in without the claw-playbook PR after we confirm both before and after behaviour.

Documentation PR pending as well...

@dannylamb
Copy link
Contributor Author

@rosiel doc PR is up, too.

Just waiting on a successful vagrant build to make the claw-playbook PR...

@dannylamb
Copy link
Contributor Author

Islandora-Devops/islandora-playbook#106 is up and ready to test. I had to move a bunch of branches over to the foundation repos so that I could get at them with ansible. I've also moved all the PRs over to point at those branches so we're merging what we're testing in case I push up more changes.

New Testing Instructions (only confirms new behaviour, not old)

After installing from Islandora-Devops/islandora-playbook#106:

  • Check out Gemini first
    • mysql -u root -pislandora gemini
    • select drupal_uri from Gemini;
    • You should see all the taxonomy terms, and the drupal uris should not have ?_format=jsonld
  • Create a Repository Item (say http://localhost:8000/node/2) and test by
    • Viewing the Json-ld ('http://localhost:8000/node/2?_format=jsonld`)
      • The @ids should not have ?_format=jsonld on the end
    • Viewing the Fedora representation, it's triples should not have ?_format=jsonld at the end
    • Querying the triplestore for the node, its triples should not have ?_format=jsonld
    • Looking at Gemini, its db record should not have ?_format=jsonld
  • Add a media to the node, and test it by:
    • Viewing the Json-ld ('http://localhost:8000/media/4?_format=jsonld`)
      • The @ids should not have ?_format=jsonld on the end
    • Viewing the Fedora representation, it's triples should not have ?_format=jsonld at the end
    • Querying the triplestore for the media, its triples should not have ?_format=jsonld

@dannylamb
Copy link
Contributor Author

OK @Islandora-CLAW/committers

Islandora-Devops/islandora-playbook#106 is ready to go. I've successfully installed using it and have a box with no more ?_format=jsonld.

@whikloj
Copy link
Member

whikloj commented May 18, 2019

I have reviewed the Crayfish, Islandora, claw-playbook and ansible-role-crayfish PRs. Someone else needs to review the JsonLD one.

@seth-shaw-unlv
Copy link
Contributor

I just approved the JSON-LD one. Are these getting merged in any particular order?

@whikloj
Copy link
Member

whikloj commented May 21, 2019

@seth-shaw-unlv JSON-LD needs to happen first, then a version of it should be tagged, then the Islandora PR needs to be updated to reference that new version of JSON-LD. Crayfish is in, so that PR needs to be updated to reference the new version I tagged. (0.2.0 I think) then it can go in, then a new version of ansible-role-crayfish needs to be tagged and that needs to be referenced in claw-playbook.

@seth-shaw-unlv
Copy link
Contributor

Alrighty then, I give you JSON-LD 0.2.0.

@dannylamb
Copy link
Contributor Author

@seth-shaw-unlv @whikloj I'll update the islandora pull to reference it.

@dannylamb
Copy link
Contributor Author

well... and all the other stuff too.

@dannylamb
Copy link
Contributor Author

islandora-deprecated/ansible-role-crayfish#23 and #134 have been updated to use 0.2.0 for both jsonld and crayfish.

When those are in, I'll update Islandora-Devops/islandora-playbook#106

@dannylamb
Copy link
Contributor Author

Resolved via Islandora-Devops/islandora-playbook@0036464

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants