-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate/implement non-redundant provenance processing #92
Comments
good progress on this ticket. |
Wrong ticket, that'd be #91 - this'n hasn't been started yet. @jordanpadams @tloubrieu-jpl can we make the assumption that versions of products will be inserted in chronological order? That is to say (assuming the sweeper never failed), if version V is in the registry at some point in time, all versions <V are guaranteed to also be in the registry? I'm guessing we can't, but doesn't hurt to ask. |
No. There are data products produced by IMG that version based on the Ops pipeline, but only some of the Ops products are actually released. So they can have a product version 30.0, but only have 4 versions in the archive. |
@jordanpadams to be more specific, I mean "is it guaranteed that no version <V will be written into the registry at a later date?" |
@alexdunnjpl no. because we are creating these tools after numerous versions already exist for this data, nodes are often just loading the latest. eventually we will push on them to load past versions. |
@jordanpadams roger that, thanks! This is fine, it just means that there's an additional candidate optimisation which isn't possible. As it stands, local benchmarks indicate that provenance should take approximately 4min per 1M archived/certified products when not processing newly-harvested data. I imagine that in ECR it should be a little faster. repairkit and ancestry should complete immediately when not processing newly-harvested data, so that's about as good a job as we can do. |
@gxtchen , this can be tested by using the docker compose deployment of the full registry. Start it like that:
It runs sweeper once by default. After you update the registry database, you can re-run sweeper in a different terminal by:
|
To update manually a document in the registry database, @gxtchen you can use https://opensearch.org/docs/1.0/opensearch/rest-api/document-apis/update-document/#:~:text=If%20you%20need%20to%20update,runs%20to%20update%20the%20document. |
Checked for duplicates
No - I haven't checked
π§βπ¬ User Persona(s)
No response
πͺ Motivation
...so that ECS costs are significantly reduced
π Additional Details
No response
Acceptance Criteria
Provenance sweeper only processes data which is new, modified, or was processed with an out-of-date version of the ancestry sweeper, or which is tightly-coupled (definition TBD) to a document which has been modified
Given: a registry opensearch database
When I do: add a new product in the opensearch database (with harvest)
I expect: sweeper do only process the new product
Given: a registry opensearch database
When I do: update manually the harvest_time of a product in the opensearch database
I expect: sweeper do only process the updated product
Given: a registry opensearch database
When I do: update manually the sweeper version of a product in the opensearch database
I expect: sweeper do only process the updated product
βοΈ Engineering Details
No response
The text was updated successfully, but these errors were encountered: