Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare against canonical URLs (to ignore some query parameters or url shorteners) #35

Open
AliSoftware opened this issue Aug 3, 2019 · 5 comments

Comments

@AliSoftware
Copy link
Contributor

AliSoftware commented Aug 3, 2019

Some query parameters don't make a difference in the page being linked to and should be ignored when Camille when checking if the link had already been posted.

Of course other query parameters could make the link point to a different article so we should not ignore them all, just a select few.

Typical cases we should ignore:

  • ?s=21 in links to Twitter posts
  • utm_ parameters

Similarly, common URL shorteners, like youtu.be/xyz for youtube.com/watch?v=xyz point to the same video and should be considered the same link.

One way to do that would be to look at the canonical URL of the link instead of what the link looked like when posted, and use that canonical URL as the base for comparison.

@AliSoftware AliSoftware changed the title Filter some query parameters out when comparing links Compare against canonical URLs (to ignore some query parameters or url shorteners) Aug 9, 2019
@ZevEisenberg
Copy link
Contributor

Is there a standard/well-known header or something for retrieving the canonical URL? Or will we have to special-case all of the things you've listed?

@AliSoftware
Copy link
Contributor Author

There is an official way on the web to declare canonical URLs for a website. It consists of a tag of the form <link rel=“canonical” href=“https://example.com/sample-page/” /> in the HTML content of the page.

This official way is what search engines use to index pages using just their canonical URL, so this is pretty widespread and used by most websites.

What I don't know is if there's some nice service/API we could use (maybe a tool provided publicly by some of the most common search engines?) to which we could send an arbitrary URL, let the service get the content and extract the tag from it, and return the canonical URL to us.

Using such a service if it exists would be way better than making Camille do the parsing of the HTML herself, because making a request to load the whole HTML content of an URL just to extract the canonical tag from it would take a lot of time, bandwidth and credits for just that piece of info, while I'm pretty sure search engines cache that info for all the sites they index… and there's some chance they make that info then directly available to anyone via some API

@tal
Copy link

tal commented Oct 1, 2020

my bet is any such service would be longer than doing this simple call. the html isn't that big and the other service is just doing the work as well so would likely be shorter.

big reason to use services for stuff is they can normalize edge cases well, this is a very straightforward xpath lookup

@AliSoftware
Copy link
Contributor Author

@tal the main benefit I was thinking about if we were to use a service provided by a search engine is that hopefully it would return cached value, without the need for them to do the request + parsing when we query them, because they would already have done that long step ages ago when they indexed the page in their own search engine databases and all

Also, in practice there are indeed edge cases. From my quick browsing about the topic, sure rel='canonical' is the most common and official way to do it, but there are still other ways like 301 redirections and others. So implementing it ourselves within Camille might be reinventing the wheel for all those cases in addition to not taking advantage of the cache DBs of search engines which already did the work while indexing…

@tal
Copy link

tal commented Oct 1, 2020

I'm weary of handling those services without much benefit because you still have to handle all hose conditions being returned by the service. Can't assume 100% uptime and good behavior.

But it's up to whoever implements to decide. Scraping the web is super easy and a lot faster than I think you're worried it'd be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants