Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a best practice retry-n-times crawl approach #23

Closed
motin opened this issue Aug 3, 2019 · 8 comments
Closed

Support a best practice retry-n-times crawl approach #23

motin opened this issue Aug 3, 2019 · 8 comments

Comments

@motin
Copy link
Contributor

motin commented Aug 3, 2019

After performing several larger scale crawls it is clear that there will always be a percentage of crawl visits that fail due to random circumstances, such as:

  • Sites temporarily not responding / overloaded (503 etc)
  • Containers randomly crash (out-of-memory / process crash / node preempted etc)
  • OpenWPM crashes
  • etc

Even though the above crash/failure-causes should individually be remedied, it is unlikely to completely avoid them. Thus, it seems to me that the best approach to perform crawls is as per follows:

  1. Run the crawl starting with ~15 non-preemptible nodes with auto-scaling enabled (so we avoid out-of-memory issues as much as possible)
  2. After crawl has finished (or stopped-prematurely due to some failure), use the contents of the crawl_history table to find out which sites that were on the crawl list but ultimately did not end up in crawl_history with bool_success == 1
  3. Clear any remaining parts of the crawl-queue and run a new iteration of the crawl with this smaller list
  4. Repeat n times (n could be 3 for instance)

This in effect allows us to retry failed site visits and end up with as many successful sites as reasonably possible (if after n retries it still fails, it should not be due to temporary issues such as out-of-memory or similar). For the analysis, we can disregard failed results and (assuming that crawl_history entry will only be written after a successful crawl visit has completely finished including all it's data has been uploaded) we will get at most one copy of a each successful visit with complete data.

To implement this, we basically only need to implement a python script for point 2 above, and update the docs to describe how to attempt a new crawl iteration.

@motin
Copy link
Contributor Author

motin commented Aug 3, 2019

assuming that crawl_history entry will only be written after a successful crawl visit has completely finished including all it's data has been uploaded

@englehardt is this a valid assumption? If not, can we make it true? :)

@englehardt
Copy link
Contributor

englehardt commented Aug 5, 2019

assuming that crawl_history entry will only be written after a successful crawl visit has completely finished including all it's data has been uploaded

@englehardt is this a valid assumption? If not, can we make it true? :)

A crawl history event is only written after the full page load has been completed and the data queued for saving. We definitely could make it true, but we'll need to find some way to efficiently store site visit data. Writing individual parquet files for each site visit makes a million site dataset super slow to work with. Maybe just writing it and having one slow repartition step afterwards is acceptable.

@birdsarah
Copy link
Contributor

I am generally in support of this, but for some things it feels like a workaround for issues that are just bugs within openwpm. e.g. losing hundreds of sites worth of successfully collected data because of how we've chosen to batch data.

I'm wondering if there are methodological concerns of introducing bias with hitting the same site(s) again. I'm not sure there are, I just want to think it through.

I'd like to make sure if we think there's a difference between data not collected due to measurement issues (e.g. batch fail) and real internet responses e.g. a site returns a 503.

@motin
Copy link
Contributor Author

motin commented Aug 6, 2019

A crawl history event is only written after the full page load has been completed and the data queued for saving. We definitely could make it true, but we'll need to find some way to efficiently store site visit data. Writing individual parquet files for each site visit makes a million site dataset super slow to work with. Maybe just writing it and having one slow repartition step afterwards is acceptable.

I believe @birdsarah regularly does this repartition step before analysis of OpenWPM crawl data for performance reasons. It makes sense to optimize against data loss at data collection time and repartition post-crawl. Added a tracking issue for this in openwpm/OpenWPM#443

I'm wondering if there are methodological concerns of introducing bias with hitting the same site(s) again. I'm not sure there are, I just want to think it through.

Could be, but worth noting is that Scrapy employs a retry 2 times by default.

I'd like to make sure if we think there's a difference between data not collected due to measurement issues (e.g. batch fail) and real internet responses e.g. a site returns a 503.

Good perspective. We should at least explicitly differentiate between these type of failures in the crawl_history table to allow for investigation into this. The current implementation does some differentiation based on command success (1), command failure (0) and command timeout (-1) but I am not sure what this translates to. @englehardt can you shine a light on this?

@motin
Copy link
Contributor Author

motin commented Aug 10, 2019

To prevent some of the potential bias, a modification of this approach is in step 2 to only retry sites that did not end up in crawl_history at all. This only works around data loss-related issues and does not retry otherwise failed or timed out visit attempts.

@motin
Copy link
Contributor Author

motin commented Aug 16, 2019

Experience from recent crawls favors this retry-n-times approach for improving the completeness of the captured data.

@motin
Copy link
Contributor Author

motin commented Sep 1, 2019

With #27, #28 and #30 merged, this issue has less of an impact (only around 0.13% of records on a 100k crawl got lost recently).

@vringar
Copy link
Contributor

vringar commented Aug 6, 2021

Experience from recent crawls favors this retry-n-times approach for improving the completeness of the captured data.

This is currently implemented with the completion callback only removing an element from the redis queue if it was successful.
Otherwise the element will be just released and marked as "up for grabs"

@vringar vringar closed this as completed Aug 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants