Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in the domain-parsing regex #1

Open
GoogleCodeExporter opened this issue Mar 14, 2015 · 6 comments
Open

Bug in the domain-parsing regex #1

GoogleCodeExporter opened this issue Mar 14, 2015 · 6 comments

Comments

@GoogleCodeExporter
Copy link

'.org.ua' doesn't get recognized as a proper TLD, causing it the whole TLD to 
get blacklisted as one 
if a spammer uses 'domain.org.ua'...
The domain extraction regex needs to be updated.

Overall the exhaustive approach used by the URL domain-parsing regex (used to 
extract remove 
subdomains while keeping only domains and TLDs from URLs) probably needs a bit 
of dusting off. 
Either to make sure the TLD list is up-to-date or make the approach a bit more 
flexible to new 
TLDs.

Original issue reported on code.google.com by [email protected] on 16 Jul 2008 at 7:19

@GoogleCodeExporter
Copy link
Author

Original comment by [email protected] on 16 Jul 2008 at 7:19

  • Added labels: regex

@GoogleCodeExporter
Copy link
Author

Mozilla maintains a public list of all TLDs.  Should we just check against that?

http://publicsuffix.org/

Original comment by [email protected] on 21 Jul 2008 at 3:02

@GoogleCodeExporter
Copy link
Author

FYI -- I plan on updating this from the Mozilla list, but the page is currently 
down.

Original comment by [email protected] on 5 Jun 2010 at 10:05

@GoogleCodeExporter
Copy link
Author

Update: Going to be a bit more complicated tha simply updating the existing 
list. The current list from publicsuffic.org is over 3,000 entries long, and 
that includes some wildcards!  So rather than passing a massive PHP array, I 
think we'll have to create & populate a MySQL table and check against that.  Of 
course that also means keeping said table updated....

Original comment by [email protected] on 30 Jul 2010 at 4:05

@GoogleCodeExporter
Copy link
Author

Removing myself as Owner for this.  I don't know well enough the proper way to 
handle the length of the updated complete TLD list, but I'm pretty sure we 
can't pass a 3,000-item array in PHP without breaking something.

This is an important one though, and I would appreciate somebody more skilled 
picking this up.

Keep in mind that in the long run we also need some means of keeping the list 
updated.

(Also changing from priority-medium to priority-high)

Original comment by [email protected] on 11 Jan 2011 at 11:08

  • Added labels: Priority-High
  • Removed labels: Priority-Medium

@GoogleCodeExporter
Copy link
Author

The Internet landscape is getting more complicated.  With the new wave of 
basically infinite arbitrary TLDs on their way -- e.g. ".media" --  I'm not 
sure if it will be possible to parse this anymore.

Unless... perhaps the new TLDs are all single-dot, in which case we may 
theoretically be able to check against a list of known double-dot TLDs -- e.g. 
".co.uk" -- and just assume that in all other cases, whatever's after that dot 
is the TLD?

Original comment by [email protected] on 22 Nov 2013 at 10:27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants