-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dealing with large harvest source with high data volatility #5022
Comments
Just a random question from an outside perspective (mostly related to making the 2.0 harvester better), is the case of iterative harvesting handled any better? Like if a harvest is 100k and it fails at 10k, is there a way to keep those 10k and "continue where it left off"? |
the most recent finished (forced) ioos job showed ~3% (1k) of the 35k records for processing had no change. although this number isn't alarmingly high this harvest source has been unreliable in its details we use to determine change ( e.g. waf file timestamp ). beyond the file timestamp we use a timestamp within the documents themselves to determine change. if, however, no meaningful change has occurred within the document ( e.g. just the timestamp and nothing else ) then the document shouldn't be harvested thus reducing load and potentially resolving stuck jobs ( see #1537 ) |
We should be able to 30k records easily in the new harvester. @rshewitt does you know happen to know offhand if the timestamp is part of the sourcehash that we use to determine changes? If so, we should probably make a ticket to examine this. |
compare against internal db is cheap, but if all that has changed is a timestamp i'd like to avoid the expensive sync to CKAN. |
|
Catalog harvester gets overwhelmed by larger harvest jobs such as NOAA ioos, one harvest job might have to harvest 30k records. This kind of job might run for days and create multiple issues including catalog downtime as observer here. We need to find way to cope with it to minimize its impact to the catalog app.
We have set its schedule to monthly. But ideally we want to run at the weekend/holidays. I would suggest we set it to manual and set a O&M schedule to run every 4 weeks at Friday afternoons.
On the new harvester 2.0, hopefully we don't have to deal with the same issue. If we do, we might have to build better/finer control on how to schedule particular harvest sources.
The text was updated successfully, but these errors were encountered: