Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FlagRemoveQuery (strips entire query part) #16

Closed
wants to merge 1 commit into from

Conversation

bcampbell
Copy link

I won't be at all offended if you think this flag has no place in purell, but it's one I would use almost exclusively :- )

context: I'm dealing with lots of news article URLs, and the unique part of the URL is almost always a slug, and the query used only for tracking (eg ?from=rssfeed ). Stripping the query is always my first step when normalizing URLs for comparison.

eg (apologies for dailymail link):
http://www.dailymail.co.uk/news/article-3656841/It-s-Democrats-end-25-hour-sit-Congress-gun-control-walk-cheering-crowd-without-vote-demanded.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490

Just one of the many frustrations when trying to decide if two URLs are referring to the same article - there are loads more. Newspaper sites can commit some real atrocities when dealing with URLs!

@mna
Copy link
Member

mna commented Jun 24, 2016

Hello Ben,

Thanks for the PR, it looks good! I'm unsure at the moment if it feels right to have this in the package. One thing is the helper flags FlagsAllGreedy and FlagsAllNonGreedy, those were poorly thought-out and kind of prevent adding new flags in a sensible way (either you break the old behaviour of those flags by or-ing the new one, or you break the semantics by leaving it out - it's not "All" anymore).

Also, I would think a common need may be to remove some well-known, referral-type query strings, while keeping others? The new flag wouldn't help in that case. I'm thinking out loud here, but maybe all that's missing is a FlagCustom or a new NormalizeURLFunc(*url.URL, NormalizationFlags, func(*url.URL)) that calls the func after applying the provided normalization flags, so you can do what you want with the *url.URL before returning the normalized string.

I'll think about that a bit more, and feel free let me know what you think about this.

Martin

@bcampbell
Copy link
Author

Instead of custom flags or a NormalizeURLFunc() with a callback, how about a nice simple:

func Normalize(u *url.URL, f NormalizationFlags) *url.URL

Which returns a *url.URL, upon which the caller can perform their own further custom munging (with a nearby note explaining why they should use urlesc.Escape() rather than the net/url equivalent :- )

I know you're keen to keep the API minimal, so I think that'd be the most orthogonal way to handle custom munging...

@bcampbell
Copy link
Author

I get what you mean about FlagsAllGreedy and FlagsAllNonGreedy. Actually, I think that's as good a reason as any to reject incompatible flags. A bit arbitrary, maybe, but hey.

Regarding FlagRemoveQuery vs query filtering blacklist/whitelist... I can only really talk about my own use case, which is dealing with news and blog sites in bulk:
Yes, there are some well-known referral-type strings (like the google "utm_*" query tags), but sooooo many sites define their own ones too, which screw up my url-comparison code.
Maybe 1 in 100 sites have article URLs with significant query data (eg: http://example.com?article_id=12345), but for the the other 99, the query part is always extraneous and needs to be stripped.

So, for my needs, I always start from the assumption that the query part of the URL is just noise.

@mna
Copy link
Member

mna commented Jun 27, 2016

Instead of custom flags or a NormalizeURLFunc() with a callback, how about a nice simple:
func Normalize(u *url.URL, f NormalizationFlags) *url.URL

Very true, in fact the existing NormalizeURL func already works for this, even though it returns the normalized string. The *url.URL value is modified in-place, so the URL passed in is normalized after the call, and you can then remove the query string (u.RawQuery = "") if you want to ignore that part. The urlesc package was temporary and that call is not required since Go1.5 (see #14).

So with that said, given that the proposed flag is a one-liner that can be done quite easily with the current API, I think I'll pass on the PR.

Thanks,
Martin

@mna mna closed this Jun 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants