Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non redundant ancestry #100

Merged
merged 15 commits into from
Jan 29, 2024
Merged

Non redundant ancestry #100

merged 15 commits into from
Jan 29, 2024

Conversation

alexdunnjpl
Copy link
Contributor

@alexdunnjpl alexdunnjpl commented Jan 26, 2024

🗒️ Summary

Implements #91

During a processing run, db writes are skipped for bundle/collection documents which have already been processed with an up-to-date version of the ancestry software (per versioning tag, like used by repairkit).

All processing, read/compute/write, is skipped for non-aggregate product references belonging to a registry-refs page which has already been processed with an up-to-date version of the software.

Db writes are ordered such that it can be inferred that if an aggregate product has been tagged as up-to-date, all its descendants will also be up-to-date (assuming that re-harvesting a bundle or collection will overwrite its document, losing existing ancestry version metadata (@jordanpadams @tloubrieu-jpl @al-niessner is this a safe assumption, or do I need to check harvest's code?)

Once processing has completed, any products or registry-refs pages which do not indicate that they are up-to-date are counted and output in an ERROR log, indicating that some products were either harvested during sweeper processing or that some are getting missed and require a (much slower, yet-to-be-implemented) validation sweeper to correctly process.

Execution time for ancestry against sbnpsi is ~35min. With the optimisations it's <2sec on subsequent runs. For nodes like psa which have a million aggregate products this should not be expected, but it should still cut ancestry runtime to 0.1-1% of previous duration.

The only caveat here is that progress is only made if execution completes - if ancestry repeatedly fails mid-execution due to resource issues, it won't make incremental progress and eventually succeed. This means that it's probably best for me to perform the first run against each node on a local machine with plenty of disk space, to avoid the need to allocate unnecessarily-large storage for ECS. This process will need to be repeated if/when the ancestry software version is incremented.

@jordanpadams @tloubrieu-jpl Further optimization to make incremental process, avoiding this caveat, is possible and may be desirable, it just requires a less-naive approach to ordering/streaming the updates.

c1p1_nonaggs
c1p2_nonaggs
c1_refs_pages
c2p1_nonaggs
c2p2_nonaggs
c2_refs_pages
...

instead of

nonaggs
refs_pages
collections
bundles

Please open/triage a ticket for this work if that seems warranted.

⚙️ Test Data and/or Report

Functional tests pass. New changes tested manually, final manual test in-progress.

♻️ Related Issues

fixes #91

@alexdunnjpl
Copy link
Contributor Author

sbnpsi benchmark, 1.5M product, execution dropped from 35min to 57sec. This is at least partly because orphaned documents are continually reprocessed, and sbnpsi has ~5k remaining after processing due to missing collections.

It's unclear why it's taking 1/35th of the time despite only reprocessing 1/300th of the document corpus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate/implement non-redundant ancestry processing
2 participants