Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format submitted URLs with PreparedRequest.prepare_url instead of requests.utils.requote_uri #3443

Merged
merged 2 commits into from
Dec 5, 2023

Conversation

rebeccacremona
Copy link
Contributor

@rebeccacremona rebeccacremona commented Dec 5, 2023

See ENG-493.

We have long since been running submitted URLs through requests.utils requote_uri before handing them off to the browser for capture.

When we began using Scoop and the Scoop API to to produce archives, that added another layer of URL validation: the Scoop API runs its own validation check before accepting the job.

This surfaced some interesting behavior: turns out that we have been passing URLs with [ and ] left un(percent)encoded, all this while.

That is generally recognized as invalid: rejected by curl, etc.... though surely some websites break the rules in querystrings, hashes, etc. And so, it is, quite reasonably, rejected by the validator used by the Scoop API.

I looked into:
a) getting Perma's validator to reject, for parity
b) getting Scoop's validator to accept, for parity
c) why we are using requote_uri anyway
d) why you can pass URLs with unencoded brackets into python requests and not have a problem

That investigation uncovered a more useful URL-formatting method inside requests, PreparedRequest.prepare_url. It calls requote_uri, but also does a bunch of other stuff, including work to handle unicode/punycode domains.

I compared the output of the two for 4+ million Perma Links, and examined the cases that differed. PreparedRequest.prepare_url seems unquestionably superior.

This PR arranges for Perma to attempt PreparedRequest.prepare_url, and fall back to requote_uri if that fails for any reason. I decided to try/except with a fallback to our current behavior because I don't, at this moment, want to change the way that URL validation failures look in Perma at all: that feels too invasive.

This is unlikely to materially affect any attempted captures:

  • right now, unicode domains are rejected by Django's validator (TBD)
  • URLs that intentionally include brackets, percent encoded or otherwise, are exceedingly rare... and I haven't been able to find even one where its presence affected the outcome of the capture (e.g., it was buried in a tracking querystring or the like).

But... by capturing my silly test example, http://example.com/angle-bracket=[, I came to realize that we have been setting the primary capture's URL, which is used for playback, to the vanilla submitted_url and not to ascii_safe_url, the actual URL handed to the capture engine and browser.

I believe that to be in error.... though I don't believe that, up until this point, any playbacks have been affected. (There is some Webrecorder/RWP fanciness involved that appears to have been taking care of it for us.) But, with this change, we do in fact need to be pointing playbacks at the ACTUAL url passed to Scoop, and so I updated the primary capture to:

                # create primary capture placeholder
                Capture(
                    link=link,
                    role='primary',
                    status='pending',
                    record_type='response',
                    #url=link.submitted_url,
                    url=link.ascii_safe_url,
                ).save()

I tested with a bunch of URLs, including our favorite https://jamalouki.net/موضة/آخر صيحات الموضة/نقشة الورود ستكسو إطلالتكِ بالكامل في الخريف المقبل, and didn't spot any problems.

While this could turn out to be a more extensive change than I'm currently aware of (though I really think, since I ran against comparisons against over 4 million links, it probably isn't)... I think we'll want to find that out by hearing about concrete examples that don't work from users, rather than wracking our brains trying to think of things to try.

This PR should a) stop the validation errors we are getting from Scoop, and b) possibly allow the successful capture of more URLs that users paste into Perma's input field... even if they are a little messed up.

@rebeccacremona rebeccacremona requested a review from a team as a code owner December 5, 2023 22:47
@rebeccacremona rebeccacremona requested review from matteocargnelutti and removed request for a team December 5, 2023 22:47
Copy link
Contributor

@matteocargnelutti matteocargnelutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES! 👏

Copy link

codecov bot commented Dec 5, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (7d7cbf8) 70.38% compared to head (065aa8d) 70.39%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3443      +/-   ##
===========================================
+ Coverage    70.38%   70.39%   +0.01%     
===========================================
  Files           52       52              
  Lines         6801     6804       +3     
===========================================
+ Hits          4787     4790       +3     
  Misses        2014     2014              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rebeccacremona rebeccacremona merged commit be832db into harvard-lil:develop Dec 5, 2023
2 checks passed
@rebeccacremona rebeccacremona deleted the url-munging branch December 7, 2023 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants