Dealing with large harvest source with high data volatility #5022

FuhuXia · 2024-12-26T20:20:59Z

Catalog harvester gets overwhelmed by larger harvest jobs such as NOAA ioos, one harvest job might have to harvest 30k records. This kind of job might run for days and create multiple issues including catalog downtime as observer here. We need to find way to cope with it to minimize its impact to the catalog app.

We have set its schedule to monthly. But ideally we want to run at the weekend/holidays. I would suggest we set it to manual and set a O&M schedule to run every 4 weeks at Friday afternoons.

On the new harvester 2.0, hopefully we don't have to deal with the same issue. If we do, we might have to build better/finer control on how to schedule particular harvest sources.

nickumia · 2024-12-28T04:56:38Z

Just a random question from an outside perspective (mostly related to making the 2.0 harvester better), is the case of iterative harvesting handled any better? Like if a harvest is 100k and it fails at 10k, is there a way to keep those 10k and "continue where it left off"?

rshewitt · 2025-01-31T17:40:36Z

the most recent finished (forced) ioos job showed ~3% (1k) of the 35k records for processing had no change. although this number isn't alarmingly high this harvest source has been unreliable in its details we use to determine change ( e.g. waf file timestamp ). beyond the file timestamp we use a timestamp within the documents themselves to determine change. if, however, no meaningful change has occurred within the document ( e.g. just the timestamp and nothing else ) then the document shouldn't be harvested thus reducing load and potentially resolving stuck jobs ( see #1537 )

btylerburton · 2025-01-31T17:51:52Z

On the new harvester 2.0, hopefully we don't have to deal with the same issue. If we do, we might have to build better/finer control on how to schedule particular harvest sources.

We should be able to 30k records easily in the new harvester. @rshewitt does you know happen to know offhand if the timestamp is part of the sourcehash that we use to determine changes? If so, we should probably make a ticket to examine this.

btylerburton · 2025-01-31T17:53:41Z

compare against internal db is cheap, but if all that has changed is a timestamp i'd like to avoid the expensive sync to CKAN.

rshewitt · 2025-01-31T18:04:09Z

one thing to mention is that the file timestamp in waf isn't considered currently but there's a ticket for it. i'll revisit the AC on it.
the timestamp within the document would contribute to the hash. basically nothing special is done for waf files as of right now. we traverse the tree and get all the documents and continue on.

FuhuXia added the bug Software defect or bug label Dec 26, 2024

github-project-automation bot added this to data.gov team board Dec 26, 2024

nickumia mentioned this issue Dec 28, 2024

Add Functionality to Stop a Running Job on the Harvest Job Page #4897

Open

Bagesary moved this to 🧊 Icebox in data.gov team board Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with large harvest source with high data volatility #5022

Dealing with large harvest source with high data volatility #5022

FuhuXia commented Dec 26, 2024

nickumia commented Dec 28, 2024

rshewitt commented Jan 31, 2025

btylerburton commented Jan 31, 2025

btylerburton commented Jan 31, 2025

rshewitt commented Jan 31, 2025

Dealing with large harvest source with high data volatility #5022

Dealing with large harvest source with high data volatility #5022

Comments

FuhuXia commented Dec 26, 2024

nickumia commented Dec 28, 2024

rshewitt commented Jan 31, 2025

btylerburton commented Jan 31, 2025

btylerburton commented Jan 31, 2025

rshewitt commented Jan 31, 2025