Format submitted URLs with PreparedRequest.prepare_url instead of requests.utils.requote_uri #3443
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See ENG-493.
We have long since been running submitted URLs through
requests.utils
requote_uri before handing them off to the browser for capture.When we began using Scoop and the Scoop API to to produce archives, that added another layer of URL validation: the Scoop API runs its own validation check before accepting the job.
This surfaced some interesting behavior: turns out that we have been passing URLs with
[
and]
left un(percent)encoded, all this while.That is generally recognized as invalid: rejected by curl, etc.... though surely some websites break the rules in querystrings, hashes, etc. And so, it is, quite reasonably, rejected by the validator used by the Scoop API.
I looked into:
a) getting Perma's validator to reject, for parity
b) getting Scoop's validator to accept, for parity
c) why we are using
requote_uri
anywayd) why you can pass URLs with unencoded brackets into python requests and not have a problem
That investigation uncovered a more useful URL-formatting method inside requests,
PreparedRequest.prepare_url
. It callsrequote_uri
, but also does a bunch of other stuff, including work to handle unicode/punycode domains.I compared the output of the two for 4+ million Perma Links, and examined the cases that differed.
PreparedRequest.prepare_url
seems unquestionably superior.This PR arranges for Perma to attempt
PreparedRequest.prepare_url
, and fall back torequote_uri
if that fails for any reason. I decided to try/except with a fallback to our current behavior because I don't, at this moment, want to change the way that URL validation failures look in Perma at all: that feels too invasive.This is unlikely to materially affect any attempted captures:
But... by capturing my silly test example,
http://example.com/angle-bracket=[
, I came to realize that we have been setting the primary capture's URL, which is used for playback, to the vanillasubmitted_url
and not toascii_safe_url
, the actual URL handed to the capture engine and browser.I believe that to be in error.... though I don't believe that, up until this point, any playbacks have been affected. (There is some Webrecorder/RWP fanciness involved that appears to have been taking care of it for us.) But, with this change, we do in fact need to be pointing playbacks at the ACTUAL url passed to Scoop, and so I updated the primary capture to:
I tested with a bunch of URLs, including our favorite
https://jamalouki.net/موضة/آخر صيحات الموضة/نقشة الورود ستكسو إطلالتكِ بالكامل في الخريف المقبل
, and didn't spot any problems.While this could turn out to be a more extensive change than I'm currently aware of (though I really think, since I ran against comparisons against over 4 million links, it probably isn't)... I think we'll want to find that out by hearing about concrete examples that don't work from users, rather than wracking our brains trying to think of things to try.
This PR should a) stop the validation errors we are getting from Scoop, and b) possibly allow the successful capture of more URLs that users paste into Perma's input field... even if they are a little messed up.