Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL unfurling (initial implementation) #3471

Merged
merged 9 commits into from
May 18, 2023
Merged

URL unfurling (initial implementation) #3471

merged 9 commits into from
May 18, 2023

Conversation

ilmotta
Copy link
Contributor

@ilmotta ilmotta commented May 10, 2023

Summary

This PR is the initial implementation for the new URL unfurling requirements. The most important one is that only the message sender will pay the privacy cost for unfurling and extracting metadata from websites. Once the message is sent, the unfurled data will be stored at the protocol level and receivers will just profit and happily decode the metadata to render it.

There's a lot to share here because this is the first PR and I accumulated notes throughout development.

Further development of this URL unfurling capability will be mostly guided by issues created on clients. For the moment in status-mobile: https://github.com/status-im/status-mobile/labels/url-preview

Demo

I have used this branch to implement the mobile experience you see in the video below. The mobile PR will be created once I'm confident (based on your reviews) that the solution is good enough, but it will exist behind a dev-only feature flag, so that's why this status-go PR should be 100% backwards compatible. Edit: The feature will be available in mobile.

link-previews.webm

Terminology

In the code, I've tried to stick to the word "unfurl URL" to really mean the process of extracting metadata from a website, sort of lower level. I use "link preview" to mean a higher level structure which is enriched by unfurled data. "link preview" is also how designers refer to it.

User flows

Please, expand the Details section below to see detailed diagrams about user flows.

👇 See details
  1. Carol needs to see link previews while typing in the chat input field. Notice from the diagram nothing is persisted and that status-go endpoints are essentially stateless.
#+begin_src plantuml :results verbatim
  Client->>Server: Call wakuext_getTextURLs
  Server-->>Client: Normalized URLs
  Client->>Client: Render cached unfurled URLs
  Client->>Server: Unfurl non-cached URLs.\nCall wakuext_unfurlURLs
  Server->>Website: Fetch metadata
  Website-->>Server: Metadata (thumbnail URL, title, etc)
  Server->>Website: Fetch thumbnail
  Server->>Website: Fetch favicon
  Website-->>Server: Favicon bytes
  Website-->>Server: Thumbnail bytes
  Server->>Server: Decode & process images
  Server-->>Client: Unfurled data (thumbnail data URI, etc)
#+end_src
     ,------.                                 ,------.                             ,-------.
     |Client|                                 |Server|                             |Website|
     `--+---'                                 `--+---'                             `---+---'
        |        Call wakuext_getTextURLs        |                                     |
        | --------------------------------------->                                     |
        |                                        |                                     |
        |             Normalized URLs            |                                     |
        | <- - - - - - - - - - - - - - - - - - - -                                     |
        |                                        |                                     |
        |----.                                   |                                     |
        |    | Render cached unfurled URLs       |                                     |
        |<---'                                   |                                     |
        |                                        |                                     |
        |         Unfurl non-cached URLs.        |                                     |
        |         Call wakuext_unfurlURLs        |                                     |
        | --------------------------------------->                                     |
        |                                        |                                     |
        |                                        |            Fetch metadata           |
        |                                        | ------------------------------------>
        |                                        |                                     |
        |                                        | Metadata (thumbnail URL, title, etc)|
        |                                        | <- - - - - - - - - - - - - - - - - -
        |                                        |                                     |
        |                                        |           Fetch thumbnail           |
        |                                        | ------------------------------------>
        |                                        |                                     |
        |                                        |            Fetch favicon            |
        |                                        | ------------------------------------>
        |                                        |                                     |
        |                                        |            Favicon bytes            |
        |                                        | <- - - - - - - - - - - - - - - - - -
        |                                        |                                     |
        |                                        |           Thumbnail bytes           |
        |                                        | <- - - - - - - - - - - - - - - - - -
        |                                        |                                     |
        |                                        |----.                                |
        |                                        |    | Decode & process images        |
        |                                        |<---'                                |
        |                                        |                                     |
        | Unfurled data (thumbnail data URI, etc)|                                     |
        | <- - - - - - - - - - - - - - - - - - - -                                     |
     ,--+---.                                 ,--+---.                             ,---+---.
     |Client|                                 |Server|                             |Website|
     `------'                                 `------'                             `-------'
  1. Carol sends the text message with link previews in the RPC request wakuext_sendChatMessages. status-go assumes the link previews are good because it can't and shouldn't attempt to re-unfurl them.
#+begin_src plantuml :results verbatim
  Client->>Server: Call wakuext_sendChatMessages
  Server->>Server: Transform link previews to\nbe proto-marshalled
  Server->DB: Write link previews serialized as JSON
  Server-->>Client: Updated message response
#+end_src
     ,------.                       ,------.                                  ,--.
     |Client|                       |Server|                                  |DB|
     `--+---'                       `--+---'                                  `+-'
        | Call wakuext_sendChatMessages|                                       |
        | ----------------------------->                                       |
        |                              |                                       |
        |                              |----.                                  |
        |                              |    | Transform link previews to       |
        |                              |<---' be proto-marshalled              |
        |                              |                                       |
        |                              |                                       |
        |                              | Write link previews serialized as JSON|
        |                              | -------------------------------------->
        |                              |                                       |
        |   Updated message response   |                                       |
        | <- - - - - - - - - - - - - - -                                       |
     ,--+---.                       ,--+---.                                  ,+-.
     |Client|                       |Server|                                  |DB|
     `------'                       `------'                                  `--'
  1. The message was sent over waku and persisted locally in Carol's device. She should now see the link previews in the chat history. There can be many link previews shared by other chat members, therefore it is important to serve the assets via the media server to avoid overloading the ReactNative bridge with lots of big JSON payloads containing base64 encoded data URIs (maybe this concern is meaningless for desktop). When a client is rendering messages with link previews, they will have the field linkPreviews, and the thumbnail URL will point to the local media server.
 #+begin_src plantuml :results verbatim
   Client->>Server: GET /link-preview/thumbnail (media server)
   Server->>DB: Read from user_messages.unfurled_links
   Server->Server: Unmarshal JSON
   Server-->>Client: HTTP Content-Type: image/jpeg/etc
 #+end_src
     ,------.                                    ,------.                                  ,--.
     |Client|                                    |Server|                                  |DB|
     `--+---'                                    `--+---'                                  `+-'
        | GET /link-preview/thumbnail (media server)|                                       |
        | ------------------------------------------>                                       |
        |                                           |                                       |
        |                                           | Read from user_messages.unfurled_links|
        |                                           | -------------------------------------->
        |                                           |                                       |
        |                                           |----.                                  |
        |                                           |    | Unmarshal JSON                   |
        |                                           |<---'                                  |
        |                                           |                                       |
        |     HTTP Content-Type: image/jpeg/etc     |                                       |
        | <- - - - - - - - - - - - - - - - - - - - -                                        |
     ,--+---.                                    ,--+---.                                  ,+-.
     |Client|                                    |Server|                                  |DB|
     `------'                                    `------'                                  `--'

Some limitations of the current implementation

The following points will become separate issues in status-go that I'll work on over the next couple weeks. In no order of importance:

  • Improve how multiple links are fetched; retries on failure and testing how unfurling behaves around the timeout limits (deterministically, not by making real HTTP calls as I did). Concurrently unfurl URLs #3498
  • Unfurl favicons and store them in the protobuf too.
  • For this PR, I added unfurling support only for websites with OpenGraph meta tags. Other unfurlers will be implemented on demand. The next one will probably be for oEmbed, the protocol supported by YouTube, for example.
  • Resize and/or compress thumbnails (and favicons). Often times, thumbnails are huge for the purposes of link previews. There is already support for compressing JPEGs in status-go, but I prefer to work with compression in a separate PR because I'd like to also solve the problem for PNGs (probably convert them to JPEGs, plus compress them). This would be a safe choice for thumbnails, favicons not so much because transparency is desirable.
  • Editing messages is not yet supported.
  • I haven't coded any artificial limit on the number of previews or on the size of the thumbnail payload. This will be done in a separate issue. I have heard the ideal solution may be to split messages into smaller chunks of ~125 KiB because of libp2p, but that might be too complicated at this stage of the product (?).
  • Link preview deletion.
  • Add support for the sender (only) to build up a dynamic allowlist. No allowlist will be necessary.
  • For the moment, OpenGraph metadata is extracted by requesting data for the English language (and fallback to whatever is available). In the future, we'll want to unfurl by respecting the user's local device language. Some websites, like GoDaddy, are already localized based on the device's IP, but many aren't.
  • The website's description text should be limited by a certain number of characters, especially because it's outside our control. Exactly how much has not been decided yet, so it'll be done separately.
  • URL normalization can be tricky, so I implemented only the basics to help with caching. For example, the url https://status.im and HTTPS://status.im are considered identical. Also, a URL is considered valid for unfurling if its TLD exists according to publicsuffix.EffectiveTLDPlusOne. This was essential, otherwise the default Go url.Parse approach would consider many invalid URLs valid, and thus the server would waste resources trying to unfurl the unfurleable.

Other requirements

  • If the message is edited, the link previews should reflect the edited text, not the original one. This has been aligned with the design team as well.
  • If the website's thumbnail or the favicon can't be fetched, just ignore them. The only mandatory piece of metadata is the website's title and URL.
  • Link previews in clients should be generated in near real-time, that is, as the user types, previews are updated. In mobile this performs very well, and it's what other clients like WhatsApp, Telegram, and Facebook do.

Decisions

Here are the important decisions I have made for this PR.

  • While the user typing in the input field, the client is constantly (debounced) asking status-go to parse the text and extract normalized URLs and then the client checks if they're already in its in-memory cache. If they are, no RPC call is made. I chose this approach to achieve the best possible performance in mobile and avoid the whole RPC overhead, since the chat experience is already not smooth enough. The mobile client uses URLs as cache keys in a hashmap, i.e. if the key is present, it means the preview is readily available (naive, but good enough for now). This decision also gave me more flexibility to find the best UX at this stage of the feature.

  • Due to the requirement that users should be able to see independent loading indicators for each link preview, when status-go can't unfurl a URL, it doesn't return it in the response.

  • As an initial implementation, I added the BLOB column unfurled_links to the user_messages table. The preview data is then serialized as JSON before being stored in this column. I felt that creating a separate table and the related code for this initial PR would be inconvenient. Is that reasonable to you? Once things stabilize I can create a proper table if we want to avoid this kind of solution with serialized columns.

Don't we already have code to unfurl URLs?

Yes, in protocol/urls/, but I found it to be far from what we needed, so I created a new package in protocols/linkpreview. Once the new implementation stabilizes in the upcoming status-go PRs, I'll remove the old one. Nevertheless, the old implementation most certainly helped during development, since there are similarities.

Currently, this PR should have no production impact, because the implementation in status-mobile is behind a feature flag, and status-desktop hasn't started the work yet. So we have time to gradually improve the implementation in follow-up PRs. Edit: the feature will be included in the next mobile release due to priority changes.

@ilmotta ilmotta self-assigned this May 10, 2023
@ghost
Copy link

ghost commented May 10, 2023

Pull Request Checklist

  • Have you updated the documentation, if impacted (e.g. docs.status.im)?
  • Have you tested changes with mobile?
  • Have you tested changes with desktop?

@status-im-auto
Copy link
Member

status-im-auto commented May 10, 2023

Jenkins Builds

Click to see older builds (21)
Commit #️⃣ Finished (UTC) Duration Platform Result
c8315fd #1 2023-05-10 12:20:47 ~1 min ios 📄log
c8315fd #1 2023-05-10 12:20:52 ~1 min android 📄log
c8315fd #1 2023-05-10 12:21:54 ~2 min linux 📄log
✖️ c8315fd #1 2023-05-10 12:22:14 ~2 min tests 📄log
✖️ 0491a51 #2 2023-05-12 16:37:26 ~2 min tests 📄log
✔️ 0491a51 #2 2023-05-12 16:37:31 ~2 min linux 📦zip
✔️ 0491a51 #2 2023-05-12 16:38:58 ~4 min ios 📦zip
✔️ 0491a51 #2 2023-05-12 16:39:11 ~4 min android 📦aar
✔️ 1fa073b #3 2023-05-15 17:42:35 ~4 min ios 📦zip
✔️ 1fa073b #3 2023-05-15 17:43:20 ~5 min android 📦aar
✖️ 1fa073b #3 2023-05-15 17:43:34 ~5 min tests 📄log
✔️ 1fa073b #3 2023-05-15 17:44:02 ~6 min linux 📦zip
✖️ 6d0a41b #4 2023-05-15 18:06:53 ~2 min tests 📄log
✔️ 6d0a41b #4 2023-05-15 18:07:06 ~2 min linux 📦zip
✔️ 6d0a41b #4 2023-05-15 18:07:19 ~3 min ios 📦zip
✔️ 6d0a41b #4 2023-05-15 18:10:16 ~6 min android 📦aar
✔️ 5895354 #5 2023-05-15 18:22:39 ~2 min linux 📦zip
✔️ 5895354 #5 2023-05-15 18:23:21 ~3 min ios 📦zip
✔️ 5895354 #5 2023-05-15 18:24:11 ~4 min android 📦aar
✖️ 5895354 #5 2023-05-15 18:42:04 ~21 min tests 📄log
✔️ 5895354 #6 2023-05-16 15:54:54 ~15 min tests 📄log
Commit #️⃣ Finished (UTC) Duration Platform Result
✔️ f64e4ec #6 2023-05-17 17:55:13 ~2 min linux 📦zip
✔️ f64e4ec #6 2023-05-17 17:57:01 ~4 min android 📦aar
✔️ f64e4ec #6 2023-05-17 17:57:36 ~4 min ios 📦zip
✔️ f64e4ec #7 2023-05-17 18:07:17 ~14 min tests 📄log
✔️ 54c6f0b #7 2023-05-18 18:24:12 ~2 min linux 📦zip
✔️ 54c6f0b #7 2023-05-18 18:25:52 ~4 min android 📦aar
✔️ 54c6f0b #7 2023-05-18 18:25:59 ~4 min ios 📦zip
✔️ 54c6f0b #8 2023-05-18 18:33:52 ~12 min tests 📄log

@caybro caybro requested a review from alexjba May 10, 2023 12:21
@cammellos
Copy link
Contributor

@ilmotta amazing work, awesome quality!

As an initial implementation, I added the BLOB column unfurled_links to
the user_messages table. The preview data is then serialized as JSON before
being stored in this column. I felt that creating a separate table and the
related code for this initial PR would be inconvenient. Is that reasonable to
you? Once things stabilize I can create a proper table if we want to avoid
this kind of solution with serialized columns.

Moving to a table would be a breaking change most likely, since we don't want to support both, and migrating would be painful.
I think storing them as a BLOB in this case is probably ok, it can't be migrated easily though, so if you expect changes in the structure, that's going to be annoying, I am a bit ambivalent on this one, so probably someone else might have a stronger opinion? @Samyoul?

Add support for the sender (only) to build up a dynamic allowlist.

I believe that's not necessary, we agreed with design/john that it would only be a on/off toggle

Resize and/or compress thumbnails (and favicons).

This could be important, depending on the size of images, but of course better in a separate PR

I have heard
the ideal solution may be to split messages into smaller chunks of

It could be, but probably is best to limit to a safe number, but only after we implement compression.

Again, amazing work!

@@ -129,7 +129,7 @@ func (s *ChatTestSuite) TestSerializeJSON() {

message.From = "0x04deaafa03e3a646e54a36ec3f6968c1d3686847d88420f00c0ab6ee517ee1893398fca28aacd2af74f2654738c21d10bad3d88dc64201ebe0de5cf1e313970d3d"
message.Clock = 1
message.Text = "`some markdown text`"
message.Text = "`some markdown text` https://status.im"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change relevant? we just want to check it doesn't choke on links?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much, I guess it's one of those changes I did right at the beginning and it survived. Let me remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, I found a bug during development in the markdown parser. I still need to report it.

Extracted from the tests I wrote:

// There is a bug in the code that builds the AST from markdown text,
// because it removes the closing parenthesis, which means it won't be
// possible to unfurl this URL.
{args: "https://en.wikipedia.org/wiki/Status_message_(instant_messaging)", expected: []string{"https://en.wikipedia.org/wiki/Status_message_(instant_messaging"}},

// UnfurlURLs assumes clients pass URLs verbatim that were validated and
// processed by GetURLs.
func UnfurlURLs(urls []string) ([]common.LinkPreview, error) {
logger, err := zap.NewDevelopment()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally is passed around, so it logs in the right place, not a big deal though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite understand @cammellos. In this part I had to instantiate a zap logger and I passed it around. Should I do it differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving the comment for reference, as @cammellos explained to me. It is important to get the logger instance from api.service.messenger, that way, logs are written to geth.log, otherwise they go to stdout only.

Done a4039c6

@ilmotta
Copy link
Contributor Author

ilmotta commented May 10, 2023

It could be, but probably is best to limit to a safe number, but only after we implement compression.

Good to know because just limiting is so much simpler also. I talked to John about this, but he hold the opinion that it would be better to avoid hardcoding a limit. We eventually agreed on the number 5. It's open for discussion how we want to limit. And I agree with you, an artificial limit is a safe choice for now.

generally is passed around, so it logs in the right place, not a big deal though

I didn't quite understand @cammellos. In this part I had to instantiate a zap logger and I passed it around. Should I do it differently?

I believe that's not necessary, we agreed with design/john that it would only be a on/off toggle

Cool, I got that wrong. So much simpler.

Moving to a table would be a breaking change most likely, since we don't want to support both, and migrating would be painful.

Yes, it would be painful. I saw a lot of serialized stuff in the user_messages table and I thought it be okayish since the implementation wouldn't be used yet (behind a feature flag). The broken window effect... I think the structure can accrete, for example when we need to store favicons. I was thinking I would just throw away the data in the migration since no user would be using the new unfurling yet. But now that you mentioned, I don't know, I could use a table for sure to avoid all this.

@ilmotta amazing work, awesome quality!

Thank you Andrea!

@ilmotta
Copy link
Contributor Author

ilmotta commented May 16, 2023

@alexjba, @Samyoul, @caybro, @siphiuel, the implementation is mostly done and under scrutiny by QAs in status-mobile and soon I'll be able to merge both PRs (client & server).

I would prefer to merge this status-go PR with at least two approvals (but I only got one). Could one of you take a look at it please? Thanks a lot!

Copy link
Contributor

@vitvly vitvly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lesson in PR-making

@ilmotta ilmotta force-pushed the feat/link-previews branch from f64e4ec to 54c6f0b Compare May 18, 2023 18:21
@ilmotta ilmotta merged commit 6fa8c11 into develop May 18, 2023
@ilmotta ilmotta deleted the feat/link-previews branch May 18, 2023 18:43
ilmotta added a commit to status-im/status-mobile that referenced this pull request May 18, 2023
This is the introductory work to support the new requirements for unfurling
URLs (while the message is a draft) and displaying link previews (after the
message is sent). Refer to the related status-go PR for a lot more interesting
details status-im/status-go#3471.

Fixes #15469

### Notes

- The old link preview code will be removed separately, both in status-go and
  status-mobile.
- I did the bulk of the work in status-go
  status-im/status-go#3471. If you want to understand
  how this is all implemented, do check out the status-go PR because I heavily
  documented the solution, rationale, next steps, etc.

### Performance

Does the feature perform well? Yes, there's very little overhead because
unfurling URLs happen in status-go and the event is debounced. I also payed
special attention to use a simple caching mechanism to avoid doing unnecessary
RPC requests to status-go if the URLs are cached in the client.

I have some ideas on how to improve performance further, but not in this PR
which is already screaming for reviews.
@ilmotta ilmotta mentioned this pull request Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants