Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitizing/correcting of (script) tag(s) #102

Closed
swentel opened this issue Dec 24, 2020 · 7 comments
Closed

Sanitizing/correcting of (script) tag(s) #102

swentel opened this issue Dec 24, 2020 · 7 comments

Comments

@swentel
Copy link
Contributor

swentel commented Dec 24, 2020

It looks like there might be a problem when the 'script' tag is included in a tweet. It's an assumption at the moment, the problem might be at granary too.

This is the tweet: https://twitter.com/simonw/status/1341785945364811776
(got that because adactio retweeted it)

I'm using granary to generate an HTML version of my twitter followers, which is then parsed by XRay. When I tested it with adactio alone (and @self) and pasted the body into https://xray.p3k.app/, the output is pretty weird: every article is included into a single content property.

So, this might be a granary problem, although I think it's fine as far as I can see.

@snarfed
Copy link
Contributor

snarfed commented Jan 9, 2021

Ruh roh, looks like this may be granary's fault after all. I see this in https://granary.io/twitter/@me/@all/@app/1341785945364811776?format=html&... :

<div class="e-content p-name">
  The "Let's use client-side JavaScript rendering for everything! A web page should be a single <script> tag!" crowd have held the developer marketing advantage for far too long
  </div>

Argh. I'll look into it.

@snarfed
Copy link
Contributor

snarfed commented Jan 10, 2021

I deployed a fix to granary and checked XRay, it seems happy now. Feel free to retry!

snarfed added a commit to snarfed/granary that referenced this issue Jan 10, 2021
fixes aaronpk/XRay#102

background discussion: https://chat.indieweb.org/microformats/2021-01-10#t1610238353449500

also fix a bug in test_testdata.py that was preventing a bunch of tests from running, argh. they're now broken. i'll fix them in upcoming commits.
@swentel
Copy link
Contributor Author

swentel commented Jan 10, 2021

@snarfed indeed, that fixes it.

However, the problem remains in Xray too - or maybe it's in mf2 parsing, too early in the morning now to debug :)

Payload underneath to test: I added an article before and after that tweet, with the unescaped characters again; if you paste that, you only get two items.

pinging @aaronpk as well.

<article class="h-entry">
  <span class="p-uid">tag:twitter.com:1341785945364811771</span>
  
  <time class="dt-published" datetime="2020-12-23T16:41:13+00:00">2020-12-23T16:41:13+00:00</time>
  
  <span class="p-author h-card">
    <data class="p-uid" value="tag:twitter.com:simonw"></data>
<data class="p-numeric-id" value="12497"></data>
    <a class="p-name u-url" href="https://twitter.com/simonw">Simon Willison</a>
<a class="u-url" href="https://simonwillison.net/"></a>
    <span class="p-nickname">simonw</span>
    <img class="u-photo" src="https://pbs.twimg.com/profile_images/378800000261649705/be9cc55e64014e6d7663c50d7cb9fc75.jpeg" alt="" />
  </span>

  <a class="u-url" href="https://twitter.com/simonw/status/1341785945364811771">https://twitter.com/simonw/status/1341785945364811771</a>
  <div class="e-content p-name">
  Some text before
  </div>

  <a class="u-in-reply-to" href="https://twitter.com/simonw/status/1341785130730287104"></a>
</article>

<article class="h-entry">
  <span class="p-uid">tag:twitter.com:1341785945364811776</span>
  
  <time class="dt-published" datetime="2020-12-23T16:41:13+00:00">2020-12-23T16:41:13+00:00</time>
  
  <span class="p-author h-card">
    <data class="p-uid" value="tag:twitter.com:simonw"></data>
<data class="p-numeric-id" value="12497"></data>
    <a class="p-name u-url" href="https://twitter.com/simonw">Simon Willison</a>
<a class="u-url" href="https://simonwillison.net/"></a>
    <span class="p-nickname">simonw</span>
    <img class="u-photo" src="https://pbs.twimg.com/profile_images/378800000261649705/be9cc55e64014e6d7663c50d7cb9fc75.jpeg" alt="" />
  </span>

  <a class="u-url" href="https://twitter.com/simonw/status/1341785945364811776">https://twitter.com/simonw/status/1341785945364811776</a>
  <div class="e-content p-name">
  
  The "Let's use client-side JavaScript rendering for everything! A web page should be a single <script> tag!" crowd have held the developer marketing advantage for far too long
  </div>

  <a class="u-in-reply-to" href="https://twitter.com/simonw/status/1341785130730287104"></a>
</article>

<article class="h-entry">
  <span class="p-uid">tag:twitter.com:1341785945364811772</span>
  
  <time class="dt-published" datetime="2020-12-23T16:41:13+00:00">2020-12-23T16:41:13+00:00</time>
  
  <span class="p-author h-card">
    <data class="p-uid" value="tag:twitter.com:simonw"></data>
<data class="p-numeric-id" value="12497"></data>
    <a class="p-name u-url" href="https://twitter.com/simonw">Simon Willison</a>
<a class="u-url" href="https://simonwillison.net/"></a>
    <span class="p-nickname">simonw</span>
    <img class="u-photo" src="https://pbs.twimg.com/profile_images/378800000261649705/be9cc55e64014e6d7663c50d7cb9fc75.jpeg" alt="" />
  </span>

  <a class="u-url" href="https://twitter.com/simonw/status/1341785945364811772">https://twitter.com/simonw/status/1341785945364811772</a>
  <div class="e-content p-name">
  Some text after
  </div>

  <a class="u-in-reply-to" href="https://twitter.com/simonw/status/1341785130730287104"></a>
</article>

@swentel
Copy link
Contributor Author

swentel commented Jan 10, 2021

So it's probably in the mf2 library, pasting that payload at http://pin13.net/mf2/ only returns 2 items.
Looking at the result, I figure there's some sort of html correcting going on (you see a closing script tag), but this mixes things up.

pinging @Zegnat and @gRegorLove

@swentel swentel changed the title Sanitizing of script tag Sanitizing/correcting of (script) tag(s) Jan 11, 2021
@swentel
Copy link
Contributor Author

swentel commented Jan 11, 2021

Updated the title to reflect better what's going on.

@gRegorLove
Copy link

@swentel I'm not certain if the mf2 parser should be correcting HTML (or to what extent). With an unmatched script element it gets particularly tricky since the parsing algorithm involves removing them to parse some values. Happy to discuss further in a php-mf2 issue or #indieweb-dev. I think this XRay issue is complete, though?

@swentel
Copy link
Contributor Author

swentel commented Jan 12, 2021

True, nothing to do anymore here.

@swentel swentel closed this as completed Jan 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants