Non redundant ancestry #100

alexdunnjpl · 2024-01-26T01:45:20Z

🗒️ Summary

Implements #91

During a processing run, db writes are skipped for bundle/collection documents which have already been processed with an up-to-date version of the ancestry software (per versioning tag, like used by repairkit).

All processing, read/compute/write, is skipped for non-aggregate product references belonging to a registry-refs page which has already been processed with an up-to-date version of the software.

Db writes are ordered such that it can be inferred that if an aggregate product has been tagged as up-to-date, all its descendants will also be up-to-date (assuming that re-harvesting a bundle or collection will overwrite its document, losing existing ancestry version metadata (@jordanpadams @tloubrieu-jpl @al-niessner is this a safe assumption, or do I need to check harvest's code?)

Once processing has completed, any products or registry-refs pages which do not indicate that they are up-to-date are counted and output in an ERROR log, indicating that some products were either harvested during sweeper processing or that some are getting missed and require a (much slower, yet-to-be-implemented) validation sweeper to correctly process.

Execution time for ancestry against sbnpsi is ~35min. With the optimisations it's <2sec on subsequent runs. For nodes like psa which have a million aggregate products this should not be expected, but it should still cut ancestry runtime to 0.1-1% of previous duration.

The only caveat here is that progress is only made if execution completes - if ancestry repeatedly fails mid-execution due to resource issues, it won't make incremental progress and eventually succeed. This means that it's probably best for me to perform the first run against each node on a local machine with plenty of disk space, to avoid the need to allocate unnecessarily-large storage for ECS. This process will need to be repeated if/when the ancestry software version is incremented.

@jordanpadams @tloubrieu-jpl Further optimization to make incremental process, avoiding this caveat, is possible and may be desirable, it just requires a less-naive approach to ordering/streaming the updates.

c1p1_nonaggs
c1p2_nonaggs
c1_refs_pages
c2p1_nonaggs
c2p2_nonaggs
c2_refs_pages
...

instead of

nonaggs
refs_pages
collections
bundles

Please open/triage a ticket for this work if that seems warranted.

⚙️ Test Data and/or Report

Functional tests pass. New changes tested manually, final manual test in-progress.

♻️ Related Issues

fixes #91

this allows for version conflict handling

…n-page documents

a default value could lead to problems in future

alexdunnjpl · 2024-01-26T06:20:37Z

sbnpsi benchmark, 1.5M product, execution dropped from 35min to 57sec. This is at least partly because orphaned documents are continually reprocessed, and sbnpsi has ~5k remaining after processing due to missing collections.

It's unclear why it's taking 1/35th of the time despite only reprocessing 1/300th of the document corpus.

alexdunnjpl added 15 commits January 18, 2024 17:53

rename default log filepath

0f7afac

add detailed runtime duration breakdown to logs

3544ee6

add value-safety to ancestry non-aggregate record generation

aa85bfb

enforce safe order of updates, descendants-first

729db22

update repairkit versioning key to use constant, standardized value

eaac389

implement skipping bundle/collection updates based on software version

3c910aa

implement ancestry software version metadata in updates

07380b3

add primary_term and seq_no to Update

bbf937e

this allows for version conflict handling

implement ancestry version metadata writes to registry-refs collectio…

cb5e1c6

…n-page documents

implement non-redundant non-aggregate collection-page query

9c1fb33

add SWEEPERS_ANCESTRY_VERSION_METADATA_KEY to index mappings

8959907

implement orphaned docs checking

2cef0d8

remove default index value for db.write_updated_docs()

00b749b

a default value could lead to problems in future

remove default index value for db.query_registry_db() for consistency

60d3819

improve log message

82962a6

alexdunnjpl requested review from tloubrieu-jpl, nutjob4life and collinss-jpl as code owners January 26, 2024 01:45

alexdunnjpl mentioned this pull request Jan 26, 2024

Non redundant provenance #101

Merged

nutjob4life approved these changes Jan 26, 2024

View reviewed changes

alexdunnjpl merged commit 896fb17 into main Jan 29, 2024
2 checks passed

alexdunnjpl deleted the non-redundant-ancestry branch January 29, 2024 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non redundant ancestry #100

Non redundant ancestry #100

alexdunnjpl commented Jan 26, 2024 •

edited

Loading

alexdunnjpl commented Jan 26, 2024

Non redundant ancestry #100

Non redundant ancestry #100

Conversation

alexdunnjpl commented Jan 26, 2024 • edited Loading

🗒️ Summary

⚙️ Test Data and/or Report

♻️ Related Issues

alexdunnjpl commented Jan 26, 2024

alexdunnjpl commented Jan 26, 2024 •

edited

Loading