Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with large harvest source with high data volatility #5022

Open
FuhuXia opened this issue Dec 26, 2024 · 5 comments
Open

Dealing with large harvest source with high data volatility #5022

FuhuXia opened this issue Dec 26, 2024 · 5 comments
Labels
bug Software defect or bug

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Dec 26, 2024

Catalog harvester gets overwhelmed by larger harvest jobs such as NOAA ioos, one harvest job might have to harvest 30k records. This kind of job might run for days and create multiple issues including catalog downtime as observer here. We need to find way to cope with it to minimize its impact to the catalog app.

We have set its schedule to monthly. But ideally we want to run at the weekend/holidays. I would suggest we set it to manual and set a O&M schedule to run every 4 weeks at Friday afternoons.

On the new harvester 2.0, hopefully we don't have to deal with the same issue. If we do, we might have to build better/finer control on how to schedule particular harvest sources.

@FuhuXia FuhuXia added the bug Software defect or bug label Dec 26, 2024
@nickumia
Copy link

Just a random question from an outside perspective (mostly related to making the 2.0 harvester better), is the case of iterative harvesting handled any better? Like if a harvest is 100k and it fails at 10k, is there a way to keep those 10k and "continue where it left off"?

@rshewitt
Copy link
Contributor

the most recent finished (forced) ioos job showed ~3% (1k) of the 35k records for processing had no change. although this number isn't alarmingly high this harvest source has been unreliable in its details we use to determine change ( e.g. waf file timestamp ). beyond the file timestamp we use a timestamp within the documents themselves to determine change. if, however, no meaningful change has occurred within the document ( e.g. just the timestamp and nothing else ) then the document shouldn't be harvested thus reducing load and potentially resolving stuck jobs ( see #1537 )

@btylerburton
Copy link
Contributor

On the new harvester 2.0, hopefully we don't have to deal with the same issue. If we do, we might have to build better/finer control on how to schedule particular harvest sources.

We should be able to 30k records easily in the new harvester. @rshewitt does you know happen to know offhand if the timestamp is part of the sourcehash that we use to determine changes? If so, we should probably make a ticket to examine this.

@btylerburton
Copy link
Contributor

compare against internal db is cheap, but if all that has changed is a timestamp i'd like to avoid the expensive sync to CKAN.

@rshewitt
Copy link
Contributor

  • one thing to mention is that the file timestamp in waf isn't considered currently but there's a ticket for it. i'll revisit the AC on it.
  • the timestamp within the document would contribute to the hash. basically nothing special is done for waf files as of right now. we traverse the tree and get all the documents and continue on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug
Projects
Status: 🧊 Icebox
Development

No branches or pull requests

4 participants