fasttld is a high performance top level domains (TLD) extraction module based on the compressed trie data structure
implemented with the builtin python dict()
.
The goal of fasttld is to extract top level domains (TLDs) from URLs efficiently. In the other words, we extract com
from URLs like www.google.com
or https://maps.google.com:8080/a/long/path/?query=42
.
Running something like ".".join(domain.split('.')[1::]) is not a viable solution, for example, maps.baidu.com.cn
would give us the wrong result baidu.com.cn
instead of com.cn
.
The fasttld module solves this problem by using the regularly-updated Mozilla Public Suffix List and the trie data structure to efficiently extract subdomains, hostnames, and TLDs from URLs.
fasttld also supports extraction of private domains listed in the Mozilla Public Suffix List like 'blogspot.co.uk' and 'sinaapp.com'.
You can install fasttld from PyPI.
pip install fasttld
or build from source
git clone https://github.com/jophy/fasttld.git && cd fasttld
python setup.py install
>>> from fasttld import FastTLDExtract
>>> t = FastTLDExtract()
>>> res = t.extract("https://[email protected]:5000/a/b/c/d/e/f/g/h/i?id=42")
>>> scheme, userinfo, subdomain, domain, suffix, port, path, domain_name = res
>>> scheme, userinfo, subdomain, domain, suffix, port, path, domain_name
('https://', 'some-user', 'a.long.subdomain', 'ox', 'ac.uk', '5000', 'a/b/c/d/e/f/g/h/i?id=42', 'ox.ac.uk')
extract() returns a tuple (scheme, userinfo, subdomain, domain, suffix, port, path, domain_name)
.
Whenever fasttld is called, it will automatically update the local copy of the Mozilla Public Suffix List if it is more than 3 days old. You can also run the update process manually via the following commands.
>>> import fasttld
>>> fasttld.update()
or
>>> from fasttld import FastTLDExtract
>>> FastTLDExtract().update()
This option can be disabled setting the environment flag FASTTLD_NO_AUTO_UPDATE
to 1
.
You can also specify your own public suffix list file.
>>> from fasttld import FastTLDExtract
>>> FastTLDExtract(file_path='/path/to/psl/file').extract('domain', subdomain=False)
If you do not need to extract subdomains, you can disable subdomain output with subdomain=False
.
>>> from fasttld import FastTLDExtract
>>> FastTLDExtract().extract('domain', subdomain=False) # set subdomain=False
According to the Mozilla.org wiki, the Mozilla Public Suffix List contains private domains like blogspot.co.uk
and sinaapp.com
because some registered domain owners wish to delegate subdomains to mutually-untrusting parties, and find that being added to the PSL gives their solution more favourable security properties.
By default, fasttld treats private domains as TLDs (i.e. exclude_private_suffix=False
)
>>> from fasttld import FastTLDExtract
>>> FastTLDExtract(exclude_private_suffix=False).extract('news.blogspot.co.uk')
>>> ('', '', '', 'news', 'blogspot.co.uk', '', '', 'news.blogspot.co.uk') # blogspot.co.uk is treated as a TLD
>>> FastTLDExtract().extract('news.blogspot.co.uk') # this is the default behaviour
>>> ('', '', '', 'news', 'blogspot.co.uk', '', '', 'news.blogspot.co.uk') # same output as above
You can instruct fasttld to exclude private domains by setting exclude_private_suffix=True
>>> from fasttld import FastTLDExtract
>>> FastTLDExtract(exclude_private_suffix=True).extract('news.blogspot.co.uk') # set exclude_private_suffix=True
>>> ('', '', 'news', 'blogspot', 'co.uk', '', '', 'blogspot.co.uk') # notice that co.uk is now recognised as the TLD instead of blogspot.co.uk
Similar modules include tldextract and tld.
Initialize the module class once, then call its extract function ten million times. Measure the time taken.
Python 3.9.12, AMD Ryzen 7 5800X 3.8 GHz 8 cores 16 threads, 48GB RAM
module\case | jophy.com |
www.baidu.com.cn |
jo.noexist |
https://maps.google.com.ua/a/long/path?query=42 |
1.1.1.1 |
https://192.168.55.1 |
---|---|---|---|---|---|---|
fasttld | 7.60s | 9.90s | 5.28s | 5.67s | 5.06s | 5.30s |
tldextract | 22.96s | 29.32s | 25.06s | 31.69s | 33.89s | 35.15s |
tld | 26.75s | 29.00s | 23.01s | 27.55s | 22.79s | 22.55s |
Excluding subdomains (i.e. subdomain=False
)
module\case | jophy.com |
www.baidu.com.cn |
jo.noexist |
https://maps.google.com.ua/a/long/path?query=42 |
1.1.1.1 |
https://192.168.55.1 |
---|---|---|---|---|---|---|
fasttld | 7.55s | 8.98s | 5.20s | 5.52s | 5.13s | 5.25s |
On average, fasttld is 4 to 5 times faster than the other modules. It retains its performance advantage even when parsing long URLs like https://maps.google.com.ua/a/long/path?query=42
- Some code borrowed from the tldextract module