Bug in the domain-parsing regex #1

GoogleCodeExporter · 2015-03-14T17:51:55Z

'.org.ua' doesn't get recognized as a proper TLD, causing it the whole TLD to 
get blacklisted as one 
if a spammer uses 'domain.org.ua'...
The domain extraction regex needs to be updated.

Overall the exhaustive approach used by the URL domain-parsing regex (used to 
extract remove 
subdomains while keeping only domains and TLDs from URLs) probably needs a bit 
of dusting off. 
Either to make sure the TLD list is up-to-date or make the approach a bit more 
flexible to new 
TLDs.

Original issue reported on code.google.com by [email protected] on 16 Jul 2008 at 7:19

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2015-03-14T17:51:55Z

Original comment by [email protected] on 16 Jul 2008 at 7:19

Added labels: regex

GoogleCodeExporter · 2015-03-14T17:51:56Z

Mozilla maintains a public list of all TLDs.  Should we just check against that?

http://publicsuffix.org/

Original comment by [email protected] on 21 Jul 2008 at 3:02

GoogleCodeExporter · 2015-03-14T17:51:56Z

FYI -- I plan on updating this from the Mozilla list, but the page is currently 
down.

Original comment by [email protected] on 5 Jun 2010 at 10:05

GoogleCodeExporter · 2015-03-14T17:51:56Z

Update: Going to be a bit more complicated tha simply updating the existing 
list. The current list from publicsuffic.org is over 3,000 entries long, and 
that includes some wildcards!  So rather than passing a massive PHP array, I 
think we'll have to create & populate a MySQL table and check against that.  Of 
course that also means keeping said table updated....

Original comment by [email protected] on 30 Jul 2010 at 4:05

GoogleCodeExporter · 2015-03-14T17:51:56Z

Removing myself as Owner for this.  I don't know well enough the proper way to 
handle the length of the updated complete TLD list, but I'm pretty sure we 
can't pass a 3,000-item array in PHP without breaking something.

This is an important one though, and I would appreciate somebody more skilled 
picking this up.

Keep in mind that in the long run we also need some means of keeping the list 
updated.

(Also changing from priority-medium to priority-high)

Original comment by [email protected] on 11 Jan 2011 at 11:08

Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter · 2015-03-14T17:51:56Z

The Internet landscape is getting more complicated.  With the new wave of 
basically infinite arbitrary TLDs on their way -- e.g. ".media" --  I'm not 
sure if it will be possible to parse this anymore.

Unless... perhaps the new TLDs are all single-dot, in which case we may 
theoretically be able to check against a list of known double-dot TLDs -- e.g. 
".co.uk" -- and just assume that in all other cases, whatever's after that dot 
is the TLD?

Original comment by [email protected] on 22 Nov 2013 at 10:27

GoogleCodeExporter added Type-Defect Priority-High regex auto-migrated labels Mar 14, 2015

strider72 added the help wanted label Jun 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in the domain-parsing regex #1

Bug in the domain-parsing regex #1

GoogleCodeExporter commented Mar 14, 2015

GoogleCodeExporter commented Mar 14, 2015

GoogleCodeExporter commented Mar 14, 2015

GoogleCodeExporter commented Mar 14, 2015

GoogleCodeExporter commented Mar 14, 2015

GoogleCodeExporter commented Mar 14, 2015

GoogleCodeExporter commented Mar 14, 2015

Bug in the domain-parsing regex #1

Bug in the domain-parsing regex #1

Comments

GoogleCodeExporter commented Mar 14, 2015

GoogleCodeExporter commented Mar 14, 2015

GoogleCodeExporter commented Mar 14, 2015

GoogleCodeExporter commented Mar 14, 2015

GoogleCodeExporter commented Mar 14, 2015

GoogleCodeExporter commented Mar 14, 2015

GoogleCodeExporter commented Mar 14, 2015