Python 2/3 script to download Verisign TLD zone file, extract, transform, and load the domain data into MongoDB
Verisign's .com, .net, and .name top-level domain information is available in zone file format for download from their trusted FTP server. To gain access to the FTP server(s) and top-level domain zone files, request permission at verisigninc.com.
The provided files are in gzip format. This Python script, compatible with Python versions 2 and 3, will:
- Download gzipped files from Verisign's trusted FTP servers
- Incrementally stream data from the gzipped files into smaller sorted temporary ASCII storage files
- Dedupe domain names
- Load unique domain names into a MongoDB database
- Cleanup temporary ASCII storage files
This script was written to allow for the parsing and processing of these large files on small worker server instances without placing heavy load on memory or storage capacity.
Open etl.py in a text editor or IDE and modify the variables at the top of the script.
After you've configured etl.py appropriately, run:
python ./etl.py
*note: This process takes a while. Find popcorn.