-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support a best practice retry-n-times crawl approach #23
Comments
@englehardt is this a valid assumption? If not, can we make it true? :) |
A crawl history event is only written after the full page load has been completed and the data queued for saving. We definitely could make it true, but we'll need to find some way to efficiently store site visit data. Writing individual parquet files for each site visit makes a million site dataset super slow to work with. Maybe just writing it and having one slow repartition step afterwards is acceptable. |
I am generally in support of this, but for some things it feels like a workaround for issues that are just bugs within openwpm. e.g. losing hundreds of sites worth of successfully collected data because of how we've chosen to batch data. I'm wondering if there are methodological concerns of introducing bias with hitting the same site(s) again. I'm not sure there are, I just want to think it through. I'd like to make sure if we think there's a difference between data not collected due to measurement issues (e.g. batch fail) and real internet responses e.g. a site returns a 503. |
I believe @birdsarah regularly does this repartition step before analysis of OpenWPM crawl data for performance reasons. It makes sense to optimize against data loss at data collection time and repartition post-crawl. Added a tracking issue for this in openwpm/OpenWPM#443
Could be, but worth noting is that Scrapy employs a retry 2 times by default.
Good perspective. We should at least explicitly differentiate between these type of failures in the crawl_history table to allow for investigation into this. The current implementation does some differentiation based on command success (1), command failure (0) and command timeout (-1) but I am not sure what this translates to. @englehardt can you shine a light on this? |
To prevent some of the potential bias, a modification of this approach is in step 2 to only retry sites that did not end up in crawl_history at all. This only works around data loss-related issues and does not retry otherwise failed or timed out visit attempts. |
Experience from recent crawls favors this retry-n-times approach for improving the completeness of the captured data. |
This is currently implemented with the completion callback only removing an element from the redis queue if it was successful. |
After performing several larger scale crawls it is clear that there will always be a percentage of crawl visits that fail due to random circumstances, such as:
Even though the above crash/failure-causes should individually be remedied, it is unlikely to completely avoid them. Thus, it seems to me that the best approach to perform crawls is as per follows:
This in effect allows us to retry failed site visits and end up with as many successful sites as reasonably possible (if after n retries it still fails, it should not be due to temporary issues such as out-of-memory or similar). For the analysis, we can disregard failed results and (assuming that crawl_history entry will only be written after a successful crawl visit has completely finished including all it's data has been uploaded) we will get at most one copy of a each successful visit with complete data.
To implement this, we basically only need to implement a python script for point 2 above, and update the docs to describe how to attempt a new crawl iteration.
The text was updated successfully, but these errors were encountered: