Support a best practice retry-n-times crawl approach #23

motin · 2019-08-03T07:17:11Z

After performing several larger scale crawls it is clear that there will always be a percentage of crawl visits that fail due to random circumstances, such as:

Sites temporarily not responding / overloaded (503 etc)
Containers randomly crash (out-of-memory / process crash / node preempted etc)
OpenWPM crashes
etc

Even though the above crash/failure-causes should individually be remedied, it is unlikely to completely avoid them. Thus, it seems to me that the best approach to perform crawls is as per follows:

Run the crawl starting with ~15 non-preemptible nodes with auto-scaling enabled (so we avoid out-of-memory issues as much as possible)
After crawl has finished (or stopped-prematurely due to some failure), use the contents of the crawl_history table to find out which sites that were on the crawl list but ultimately did not end up in crawl_history with bool_success == 1
Clear any remaining parts of the crawl-queue and run a new iteration of the crawl with this smaller list
Repeat n times (n could be 3 for instance)

This in effect allows us to retry failed site visits and end up with as many successful sites as reasonably possible (if after n retries it still fails, it should not be due to temporary issues such as out-of-memory or similar). For the analysis, we can disregard failed results and (assuming that crawl_history entry will only be written after a successful crawl visit has completely finished including all it's data has been uploaded) we will get at most one copy of a each successful visit with complete data.

To implement this, we basically only need to implement a python script for point 2 above, and update the docs to describe how to attempt a new crawl iteration.

motin · 2019-08-03T07:19:04Z

assuming that crawl_history entry will only be written after a successful crawl visit has completely finished including all it's data has been uploaded

@englehardt is this a valid assumption? If not, can we make it true? :)

englehardt · 2019-08-05T15:22:45Z

assuming that crawl_history entry will only be written after a successful crawl visit has completely finished including all it's data has been uploaded

@englehardt is this a valid assumption? If not, can we make it true? :)

A crawl history event is only written after the full page load has been completed and the data queued for saving. We definitely could make it true, but we'll need to find some way to efficiently store site visit data. Writing individual parquet files for each site visit makes a million site dataset super slow to work with. Maybe just writing it and having one slow repartition step afterwards is acceptable.

birdsarah · 2019-08-05T17:08:53Z

I am generally in support of this, but for some things it feels like a workaround for issues that are just bugs within openwpm. e.g. losing hundreds of sites worth of successfully collected data because of how we've chosen to batch data.

I'm wondering if there are methodological concerns of introducing bias with hitting the same site(s) again. I'm not sure there are, I just want to think it through.

I'd like to make sure if we think there's a difference between data not collected due to measurement issues (e.g. batch fail) and real internet responses e.g. a site returns a 503.

motin · 2019-08-06T08:58:07Z

A crawl history event is only written after the full page load has been completed and the data queued for saving. We definitely could make it true, but we'll need to find some way to efficiently store site visit data. Writing individual parquet files for each site visit makes a million site dataset super slow to work with. Maybe just writing it and having one slow repartition step afterwards is acceptable.

I believe @birdsarah regularly does this repartition step before analysis of OpenWPM crawl data for performance reasons. It makes sense to optimize against data loss at data collection time and repartition post-crawl. Added a tracking issue for this in openwpm/OpenWPM#443

I'm wondering if there are methodological concerns of introducing bias with hitting the same site(s) again. I'm not sure there are, I just want to think it through.

Could be, but worth noting is that Scrapy employs a retry 2 times by default.

I'd like to make sure if we think there's a difference between data not collected due to measurement issues (e.g. batch fail) and real internet responses e.g. a site returns a 503.

Good perspective. We should at least explicitly differentiate between these type of failures in the crawl_history table to allow for investigation into this. The current implementation does some differentiation based on command success (1), command failure (0) and command timeout (-1) but I am not sure what this translates to. @englehardt can you shine a light on this?

motin · 2019-08-10T06:03:46Z

To prevent some of the potential bias, a modification of this approach is in step 2 to only retry sites that did not end up in crawl_history at all. This only works around data loss-related issues and does not retry otherwise failed or timed out visit attempts.

motin · 2019-08-16T09:44:22Z

Experience from recent crawls favors this retry-n-times approach for improving the completeness of the captured data.

motin · 2019-09-01T07:22:37Z

With #27, #28 and #30 merged, this issue has less of an impact (only around 0.13% of records on a 100k crawl got lost recently).

vringar · 2021-08-06T14:38:06Z

Experience from recent crawls favors this retry-n-times approach for improving the completeness of the captured data.

This is currently implemented with the completion callback only removing an element from the redis queue if it was successful.
Otherwise the element will be just released and marked as "up for grabs"

motin mentioned this issue Aug 6, 2019

Make it possible to know whether or not a site visit lead to fully saved visit-related data openwpm/OpenWPM#443

Closed

vringar closed this as completed Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support a best practice retry-n-times crawl approach #23

Support a best practice retry-n-times crawl approach #23

motin commented Aug 3, 2019

motin commented Aug 3, 2019

englehardt commented Aug 5, 2019 •

edited by motin

Loading

birdsarah commented Aug 5, 2019

motin commented Aug 6, 2019

motin commented Aug 10, 2019 •

edited

Loading

motin commented Aug 16, 2019

motin commented Sep 1, 2019 •

edited

Loading

vringar commented Aug 6, 2021

Support a best practice retry-n-times crawl approach #23

Support a best practice retry-n-times crawl approach #23

Comments

motin commented Aug 3, 2019

motin commented Aug 3, 2019

englehardt commented Aug 5, 2019 • edited by motin Loading

birdsarah commented Aug 5, 2019

motin commented Aug 6, 2019

motin commented Aug 10, 2019 • edited Loading

motin commented Aug 16, 2019

motin commented Sep 1, 2019 • edited Loading

vringar commented Aug 6, 2021

englehardt commented Aug 5, 2019 •

edited by motin

Loading

motin commented Aug 10, 2019 •

edited

Loading

motin commented Sep 1, 2019 •

edited

Loading