Format submitted URLs with PreparedRequest.prepare_url instead of requests.utils.requote_uri #3443

rebeccacremona · 2023-12-05T22:47:37Z

See ENG-493.

We have long since been running submitted URLs through requests.utils requote_uri before handing them off to the browser for capture.

When we began using Scoop and the Scoop API to to produce archives, that added another layer of URL validation: the Scoop API runs its own validation check before accepting the job.

This surfaced some interesting behavior: turns out that we have been passing URLs with [ and ] left un(percent)encoded, all this while.

That is generally recognized as invalid: rejected by curl, etc.... though surely some websites break the rules in querystrings, hashes, etc. And so, it is, quite reasonably, rejected by the validator used by the Scoop API.

I looked into:
a) getting Perma's validator to reject, for parity
b) getting Scoop's validator to accept, for parity
c) why we are using requote_uri anyway
d) why you can pass URLs with unencoded brackets into python requests and not have a problem

That investigation uncovered a more useful URL-formatting method inside requests, PreparedRequest.prepare_url. It calls requote_uri, but also does a bunch of other stuff, including work to handle unicode/punycode domains.

I compared the output of the two for 4+ million Perma Links, and examined the cases that differed. PreparedRequest.prepare_url seems unquestionably superior.

This PR arranges for Perma to attempt PreparedRequest.prepare_url, and fall back to requote_uri if that fails for any reason. I decided to try/except with a fallback to our current behavior because I don't, at this moment, want to change the way that URL validation failures look in Perma at all: that feels too invasive.

This is unlikely to materially affect any attempted captures:

right now, unicode domains are rejected by Django's validator (TBD)
URLs that intentionally include brackets, percent encoded or otherwise, are exceedingly rare... and I haven't been able to find even one where its presence affected the outcome of the capture (e.g., it was buried in a tracking querystring or the like).

But... by capturing my silly test example, http://example.com/angle-bracket=[, I came to realize that we have been setting the primary capture's URL, which is used for playback, to the vanilla submitted_url and not to ascii_safe_url, the actual URL handed to the capture engine and browser.

I believe that to be in error.... though I don't believe that, up until this point, any playbacks have been affected. (There is some Webrecorder/RWP fanciness involved that appears to have been taking care of it for us.) But, with this change, we do in fact need to be pointing playbacks at the ACTUAL url passed to Scoop, and so I updated the primary capture to:

                # create primary capture placeholder
                Capture(
                    link=link,
                    role='primary',
                    status='pending',
                    record_type='response',
                    #url=link.submitted_url,
                    url=link.ascii_safe_url,
                ).save()

I tested with a bunch of URLs, including our favorite https://jamalouki.net/موضة/آخر صيحات الموضة/نقشة الورود ستكسو إطلالتكِ بالكامل في الخريف المقبل, and didn't spot any problems.

While this could turn out to be a more extensive change than I'm currently aware of (though I really think, since I ran against comparisons against over 4 million links, it probably isn't)... I think we'll want to find that out by hearing about concrete examples that don't work from users, rather than wracking our brains trying to think of things to try.

This PR should a) stop the validation errors we are getting from Scoop, and b) possibly allow the successful capture of more URLs that users paste into Perma's input field... even if they are a little messed up.

…st requote_uri.

matteocargnelutti

YES! 👏

codecov · 2023-12-05T22:56:34Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (7d7cbf8) 70.38% compared to head (065aa8d) 70.39%.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #3443      +/-   ##
===========================================
+ Coverage    70.38%   70.39%   +0.01%     
===========================================
  Files           52       52              
  Lines         6801     6804       +3     
===========================================
+ Hits          4787     4790       +3     
  Misses        2014     2014

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rebeccacremona added 2 commits December 5, 2023 16:44

Use latest Scoop API.

00fd3b4

Process submitted_url using PreparedRequest.prepare_url instead of ju…

065aa8d

…st requote_uri.

rebeccacremona requested a review from a team as a code owner December 5, 2023 22:47

rebeccacremona requested review from matteocargnelutti and removed request for a team December 5, 2023 22:47

matteocargnelutti approved these changes Dec 5, 2023

View reviewed changes

rebeccacremona merged commit be832db into harvard-lil:develop Dec 5, 2023
2 checks passed

rebeccacremona deleted the url-munging branch December 7, 2023 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Format submitted URLs with PreparedRequest.prepare_url instead of requests.utils.requote_uri #3443

Format submitted URLs with PreparedRequest.prepare_url instead of requests.utils.requote_uri #3443

rebeccacremona commented Dec 5, 2023 •

edited

Loading

matteocargnelutti left a comment

codecov bot commented Dec 5, 2023

Format submitted URLs with PreparedRequest.prepare_url instead of requests.utils.requote_uri #3443

Format submitted URLs with PreparedRequest.prepare_url instead of requests.utils.requote_uri #3443

Conversation

rebeccacremona commented Dec 5, 2023 • edited Loading

matteocargnelutti left a comment

Choose a reason for hiding this comment

codecov bot commented Dec 5, 2023

Codecov Report

rebeccacremona commented Dec 5, 2023 •

edited

Loading