Skip to content
This repository has been archived by the owner on Jun 1, 2021. It is now read-only.
/ har-heedless Public archive

Scriptable batch downloading of webpages to generate HTTP Archive (HAR) files, using PhantomJS.

License

Notifications You must be signed in to change notification settings

joelpurra/har-heedless

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Scriptable batch downloading of webpages to generate HTTP Archive (HAR) files, using PhantomJS. See har-dulcify for aggregate HAR analysis. You might want to use har-portent, which runs both downloads multiple dataset variations using har-heedless and then analyzes them with har-dulcify in a single step.

⚠️ This project has been archived

No future updates are planned. Feel free to continue using it, but expect no support.

  • Downloads the front web page of all domains in a dataset.
    • Input is a text file with one domain name per line.
    • Downloads n domains in parallel.
      • Tested with over 100 parallel requests on a single of moderate speed and memory. YMMV.
      • Machine load heavily depends on the complexity and response rate of the average domain in the dataset.
    • Shows progress as well as expected time to finish downloads.
    • Download domains with different prefixes as separate dataset variations.
      • Default prefixes:
        • http://
        • https://
        • http://www.
        • https://www.
    • Retries failed domains twice to reduce effect of any intermittent problems.
      • Increases domain timeouts for failed domains.
    • Saves screenshots of all webpages.

Usage

# Downloads domain front pages in parallel.
# domains | ./src/domain/parallel.sh <prefix> <parallelism> --screenshot <true|false>
<domains.txt ./src/domain/parallel.sh 'https://www.' 10 --screenshot true

# More advanced usage, with pipe-viewer (pv) for speed estimates.
size=$(wc -l domains.txt | awk '{ print $1 }')
pv --line-mode --size "$size" -cN "input" domains.txt | ./src/domain/parallel.sh 'https://www.' 10 --screenshot true | pv --line-mode --size "$size" -cN "output" >> "domains.log"

Other options:

# Download domain front pages in serial. This can be very slow.
# domains | ./src/domain/serial.sh <prefix> --screenshot <true|false>
<domains.txt ./src/domain/serial.sh 'https://www.' --screenshot true

# Download custom URLs in parallel. Note that almost no testing of non-front-page donwloading has been done.
# urls | ./src/url/parallel.sh --screenshot <true|false>
<urls.txt ./src/url/parallel.sh --screenshot true

# Download custom URLs in serial. This can be very slow. Note that almost no testing of non-front-page donwloading has been done.
# urls | ./src/url/serial.sh --screenshot <true|false>
<urls.txt ./src/url/serial.sh --screenshot true

# Download a single URL. Note that almost no testing of non-front-page donwloading has been done.
# ./src/url/single.sh <URL> --screenshot <true|false>
./src/url/single.sh 'https://joelpurra.com/' --screenshot true

# Download fetch a single HAR, optionally with an embedded screenshot. Note that almost no testing of non-front-page donwloading has been done.
# ./src/get/har.sh <URL> --screenshot <true|false>
./src/get/har.sh 'https://joelpurra.com/' --screenshot true

Original purpose

Photo of Joel Purra presenting his master's thesis, named Swedes Online: You Are More Tracked Than You Think

Built as a component in Joel Purra's master's thesis research, where downloading lots of front pages in the .se top level domain zone was required to analyze their content and use of internal/external resources.

Citations

If you use, like, reference, or base work on the thesis report Swedes Online: You Are More Tracked Than You Think, the IEEE LCN 2016 paper Third-party Tracking on the Web: A Swedish Perspective, open source code, or open data, please add at least on of the following two citations with a link to the project website: https://joelpurra.com/projects/masters-thesis/

Master's thesis citation:

Joel Purra. 2015. Swedes Online: You Are More Tracked Than You Think. Master's thesis. Linköping University (LiU), Linköping, Sweden. https://joelpurra.com/projects/masters-thesis/

IEEE LCN 2016 paper citation:

J. Purra, N. Carlsson, Third-party Tracking on the Web: A Swedish Perspective, Proc. IEEE Conference on Local Computer Networks (LCN), Dubai, UAE, Nov. 2016. https://joelpurra.com/projects/masters-thesis/

Thanks


Copyright (c) 2014, 2015, 2016, 2017 Joel Purra. Released under GNU General Public License version 3.0 (GPL-3.0).

About

Scriptable batch downloading of webpages to generate HTTP Archive (HAR) files, using PhantomJS.

Resources

License

Stars

Watchers

Forks

Packages

No packages published