Clarify attribute properties added to objects in rel-urls. #32

Zegnat · 2018-03-22T12:14:20Z

Current spec text:

add keys to the hash of the key with name url for each of these attributes (if present) and key not already set:

"hreflang": the value of the "hreflang" attribute

"media": the value of the "media" attribute

"title": the value of the "title" attribute

"type": the value of the "type" attribute

"text": the text content of the element if any

I think this should clarify 1) what to do with empty attribute values, and 2) what exactly “if any” means.

Currently parsers check the existence of an attribute, but never check their value. Thus empty attributes (e.g. hreflang="") will lead to empty strings being added to the object in rel-urls. This means the objects in rel-urls will always match with the authored HTML.

The drawback of this is that a link parsed later in the source document may not overwrite the empty value any more because the key is already present, even if they add information. Example:
```
<a href="#a" rel="a" hreflang=""></a>
<a href="#a" rel="a" hreflang="en"></a>
```
Will lead to (edited to only show the affected property):
```
{
  "rel-urls": {
    "#a": {
      "hreflang": ""
     }
  }
}
```
Clearly the empty string adds no information about the URL #a, but because it came first in source order that’s the one we keep.
Currently the text property should only be added if there is “any” text content. But nowhere does it define what text content means or when we consider there to be none.

This has already lead to a difference in implementations. The Python parser will add an empty text property, the PHP parser will not. Example:
```
<a href="#a" rel="a"></a>
```
In PHP:
```
{
  "rel-urls": {
    "#a": {
      "rels": [ "a" ]
    }
  }
}
```
In Python:
```
{
  "rel-urls": {
    "#a": {
      "rels": [ "a" ],
      "text": ""
    }
  }
}
```

I believe that, in both cases, storing an empty string value for any property is useless and adds no data. With the overwriting logic part of the parser, we may even miss out on information. Therefore I would propose rewriting the spec as follows:

add the following keys to the hash of the key with name url, unless the key is already set:

hreflang: the value of the element’s hreflang attribute, if defined and not an empty string

media: the value of the element’s media attribute, if defined and not an empty string

title: the value of the element’s title attribute, if defined and not an empty string

type: the value of the element’s type attribute, if defined and not an empty string

text: the textContent of the element, if not an empty string

The text was updated successfully, but these errors were encountered:

sknebel · 2018-03-22T12:43:29Z

Seems sensible to me, especially with the overwriting aspect. I can't think of a case where the ability to intentionally set one of these fields to an empty string would be important.

kartikprabhu · 2018-03-22T14:57:20Z

text should also specify the following

Is empty string checked before/after stripping leading and trailing spaces i.e. is text: " " considered valid?
Should child <style> and <script> elements be dropped before?
Should child <img> elements be replaced with their src or alt?

related: #20

Zegnat · 2018-03-22T15:05:44Z

While #15 is still waiting on resolving, I do think the text property should follow #17 and include the style, script, and img language.

How about this?

add the following keys to the hash of the key with name url, unless the key is already set:

hreflang: the value of the element’s hreflang attribute, if defined and not an empty string, after removing all leading/trailing whitespace

media: the value of the element’s media attribute, if defined and not an empty string, after removing all leading/trailing whitespace

title: the value of the element’s title attribute, if defined and not an empty string, after removing all leading/trailing whitespace

type: the value of the element’s type attribute, if defined and not an empty string, after removing all leading/trailing whitespace

text: the textContent of the element, if not an empty string, after:

dropping any nested <script> & <style> elements;

replacing any nested <img> elements with their alt attribute, if present; otherwise their src attribute, if present, adding a space at the beginning and end, resolving the URL if it’s relative

removing all leading/trailing whitespace.

gRegorLove · 2018-03-22T21:29:25Z

I'm +1 on most of this, but not sure about the "and not an empty string" part. I'm not sure it would cause any problems, but it feels a bit weird to me -- like the parser is trying to fix a publisher mistake. I'll defer to others on that.

kartikprabhu · 2018-03-22T21:48:08Z

discussion ref: https://chat.indieweb.org/microformats/2018-03-22#t1521753519281700

Zegnat · 2018-03-23T08:44:31Z

I’m not sure it would cause any problems, but it feels a bit weird to me — like the parser is trying to fix a publisher mistake.

It was my understanding that the entire rel-urls object existed only to assist consumers, it was drafted specifically for @kevinmarks’ universal feed parser according to the brainstorm. That might have been why I was overzealous “to fix a publisher mistake”, I thought the whole point was to collect the most accurate information available for the specified URL.

In fact, accepting these attributes with an empty string value ("") isn’t even valid HTML. hreflang “must be a valid BCP 47 language tag” and type “must be a valid MIME type string”. Neither of those accept empty strings. I am not saying microformats parsers should validate those values, but discarding invalid empty values seemed like a low hanging fruit. Apparently I was wrong.

I will concede to not having encountered it in the wild, with the footnote that I have never used the rel-urls object for anything. Now thinking about running an in browser experiment to see if I can find examples where skipping empty string values will net different results.

That said: point two of this issue still stands.

We need to resolve that “if any” means for text. Do we skip empty values there? Before or after stripping white space? I still propose we change “if any” to “if not an empty string” unless someone interprets it differently.

Apart from that, do we want to strip leading/trailing whitespace from the attributes, or do we want to keep all whitespace as authored?

New proposal, no skipping of empty values, but does strip surrounding whitespace:

add the following keys to the hash of the key with name url, unless the key is already set:

hreflang: the value of the element’s hreflang attribute, if defined, after removing all leading/trailing whitespace

media: the value of the element’s media attribute, if defined, after removing all leading/trailing whitespace

title: the value of the element’s title attribute, if defined, after removing all leading/trailing whitespace

type: the value of the element’s type attribute, if defined, after removing all leading/trailing whitespace

text: the textContent of the element, if not an empty string, after:

dropping any nested <script> & <style> elements;

replacing any nested <img> elements with their alt attribute, if present; otherwise their src attribute, if present, adding a space at the beginning and end, resolving the URL if it’s relative

removing all leading/trailing whitespace.

Zegnat mentioned this issue Mar 22, 2018

Improve rel parsing microformats/php-mf2#162

Merged

jgarber623 mentioned this issue May 12, 2020

rel-urls Parsing Issues #50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify attribute properties added to objects in rel-urls. #32

Clarify attribute properties added to objects in rel-urls. #32

Zegnat commented Mar 22, 2018

sknebel commented Mar 22, 2018

kartikprabhu commented Mar 22, 2018

Zegnat commented Mar 22, 2018 •

edited

Loading

gRegorLove commented Mar 22, 2018

kartikprabhu commented Mar 22, 2018

Zegnat commented Mar 23, 2018

Clarify attribute properties added to objects in rel-urls. #32

Clarify attribute properties added to objects in rel-urls. #32

Comments

Zegnat commented Mar 22, 2018

sknebel commented Mar 22, 2018

kartikprabhu commented Mar 22, 2018

Zegnat commented Mar 22, 2018 • edited Loading

gRegorLove commented Mar 22, 2018

kartikprabhu commented Mar 22, 2018

Zegnat commented Mar 23, 2018

Zegnat commented Mar 22, 2018 •

edited

Loading