Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

URL previews includes unicode punycode which causes issues URLs in the text body #14068

Open
HarHarLinks opened this issue Oct 5, 2022 · 5 comments
Labels
A-URL-Preview Issues related to generating server-side previews of remote URLs O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@HarHarLinks
Copy link
Contributor

Description

element-hq/element-web#23432

Synapse injects (5o0a) punycode into og:description of twitter URLs, leading to broken links.

Steps to reproduce

Forwarded Element issue:

Steps to reproduce

  1. Post a twitter URL into a room, URL previews enabled
  2. Twitter post content is put into double quotes ""
  3. in case of media-only posts, the t.co media URL is in quotes
  4. element includes the ending quote in the URL, resulting in a broken link

Outcome

What did you expect?

links work

What happened instead?

image

Demo URLs for reference:
https://twitter.com/FXNetworks/status/1577704289476128771
https://twitter.com/mischiefanimals/status/1576904037449969664

Operating system

arch

Application version

Element Nightly version: 2022100501 Olm version: 3.2.12

How did you install the app?

aur

Homeserver

private

Synapse Version

1.68.0

Installation Method

Docker (matrixdotorg/synapse)

Platform

debian, matrix-docker-ansible-deploy

Relevant log output

n/a

Anything else that would be useful to know?

curl "https://publish.twitter.com/oembed?url=https://twitter.com/mischiefanimals/status/1576904037449969664" | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   853  100   853    0     0   3946      0 --:--:-- --:--:-- --:--:--  4140
{
  "url": "https://twitter.com/mischiefanimals/status/1576904037449969664",
  "author_name": "animals going goblin mode",
  "author_url": "https://twitter.com/mischiefanimals",
  "html": "<blockquote class=\"twitter-tweet\"><p lang=\"zxx\" dir=\"ltr\"><a href=\"https://t.co/fVP8YWHS2j\">pic.twitter.com/fVP8YWHS2j</a></p>&mdash; animals going goblin mode (@mischiefanimals) <a href=\"https://twitter.com/mischiefanimals/status/1576904037449969664?ref_src=twsrc%5Etfw\">October 3, 2022</a></blockquote>\n<script async src=\"https://platform.twitter.com/widgets.js\" charset=\"utf-8\"></script>\n",
  "width": 550,
  "height": null,
  "type": "rich",
  "cache_age": "3153600000",
  "provider_name": "Twitter",
  "provider_url": "https://twitter.com",
  "version": "1.0"
}

https://user-images.githubusercontent.com/2403652/194146661-044758f9-fefd-4744-9ef2-bd3aec094d40.png

@squahtx squahtx added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Occasional Affects or can be seen by some users regularly or most users rarely labels Oct 6, 2022
@squahtx
Copy link
Contributor

squahtx commented Oct 6, 2022

The Unicode quotation marks are being added by Twitter itself, not Synapse. This isn't something that the Synapse team will fix, since we want to avoid adding special handling for certain websites.

$ curl "https://twitter.com/mischiefanimals/status/1576904037449969664" -H "User-Agent: Synapse (bot; +https://github.com/matrix-org/synapse)" | less
...<meta data-rh="true" content="<E2><80><9C>https://t.co/fVP8YWHS2j<E2><80><9D>" property="og:description"/>...

@squahtx squahtx closed this as not planned Won't fix, can't repro, duplicate, stale Oct 6, 2022
@t3chguy
Copy link
Member

t3chguy commented Oct 6, 2022

@squahtx you already have "special" handling for bits of Twitter: https://github.com/matrix-org/synapse/blob/develop/synapse/res/providers.json - in theory adding all of Twitter to that whitelist would resolve this.

@squahtx squahtx reopened this Oct 6, 2022
@clokep
Copy link
Member

clokep commented Oct 6, 2022

@squahtx you already have "special" handling for bits of Twitter: develop/synapse/res/providers.json - in theory adding all of Twitter to that whitelist would resolve this.

That file is essentially unused now; moments don't exist.

We backed out the changes to use oEmbed for all of Twitter in #11985 because it gives significantly worse results.

It seems that Twitter itself is giving a better description in the oEmbed though (the Synapse parsed version would be):

pic.twitter.com/fVP8YWHS2j\n\n— animals going goblin mode (@mischiefanimals)\n\nOctober 3, 2022

vs. what can be pulled from their OpenGraph tags in the HTML, which as @squahtx pulled above is:

<80><9C>https://t.co/fVP8YWHS2j<80><9D>

We should be preferring this, see:

# Compile the Open Graph response by using the scraped
# information from the HTML and overlaying any information
# from the oEmbed response.
og = {**og_from_html, **og_from_oembed}
await self._precache_image_url(user, media_info, og)

@clokep
Copy link
Member

clokep commented Oct 6, 2022

We should be preferring this, see:

Ah, I see what's going on -- Twitter doesn't have oEmbed autodiscovery enabled, so we are only scraping the HTML in this case.

If we were to add the Twitter URLs back to the providers.json than we would only fetch the oEmbed, which would lose the image preview.

We could always scrape the given URL, but also check oEmbed info if available. Would be reasonable in terms of "try to find the most info possible", but would result in duplicate queries in some situations... it would treat autodiscovery of oEmbed more similar to the hard-coded providers list though. 🤷

@clokep clokep added O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Tolerable Minor significance, cosmetic issues, low or no impact to users. and removed O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. labels Oct 6, 2022
@clokep
Copy link
Member

clokep commented Oct 6, 2022

I downgraded to tolerable since you can easily click the link being previewed, and uncommon since I think this is an odd example where a URL was given as the only tweet content (which seems a bit weird -- but do shout if this seems incorrect).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-URL-Preview Issues related to generating server-side previews of remote URLs O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

5 participants