Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate/implement non-redundant provenance processing #92

Closed
alexdunnjpl opened this issue Jan 2, 2024 · 8 comments Β· Fixed by #101
Closed

Investigate/implement non-redundant provenance processing #92

alexdunnjpl opened this issue Jan 2, 2024 · 8 comments Β· Fixed by #101
Assignees
Labels

Comments

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Jan 2, 2024

Checked for duplicates

No - I haven't checked

πŸ§‘β€πŸ”¬ User Persona(s)

No response

πŸ’ͺ Motivation

...so that ECS costs are significantly reduced

πŸ“– Additional Details

No response

Acceptance Criteria

Provenance sweeper only processes data which is new, modified, or was processed with an out-of-date version of the ancestry sweeper, or which is tightly-coupled (definition TBD) to a document which has been modified

Given: a registry opensearch database
When I do: add a new product in the opensearch database (with harvest)
I expect: sweeper do only process the new product

Given: a registry opensearch database
When I do: update manually the harvest_time of a product in the opensearch database
I expect: sweeper do only process the updated product

Given: a registry opensearch database
When I do: update manually the sweeper version of a product in the opensearch database
I expect: sweeper do only process the updated product

βš™οΈ Engineering Details

No response

@alexdunnjpl alexdunnjpl added needs:triage requirement the current issue is a requirement labels Jan 2, 2024
@jordanpadams jordanpadams added B14.1 enhancement New feature or request and removed requirement the current issue is a requirement labels Jan 9, 2024
@github-project-automation github-project-automation bot moved this to Release Backlog in B14.1 Jan 9, 2024
@jordanpadams jordanpadams moved this from Release Backlog to πŸš€ Sprint Backlog in B14.1 Jan 9, 2024
@tloubrieu-jpl
Copy link
Member

good progress on this ticket.

@alexdunnjpl
Copy link
Contributor Author

Wrong ticket, that'd be #91 - this'n hasn't been started yet.

@jordanpadams @tloubrieu-jpl can we make the assumption that versions of products will be inserted in chronological order? That is to say (assuming the sweeper never failed), if version V is in the registry at some point in time, all versions <V are guaranteed to also be in the registry?

I'm guessing we can't, but doesn't hurt to ask.

@github-project-automation github-project-automation bot moved this from Backlog to 🏁 Done in EN Portfolio Backlog Jan 29, 2024
@github-project-automation github-project-automation bot moved this from πŸš€ Sprint Backlog to 🏁 Done in B14.1 Jan 29, 2024
@jordanpadams
Copy link
Member

@alexdunnjpl

if version V is in the registry at some point in time, all versions <V are guaranteed to also be in the registry?

No. There are data products produced by IMG that version based on the Ops pipeline, but only some of the Ops products are actually released. So they can have a product version 30.0, but only have 4 versions in the archive.

@alexdunnjpl
Copy link
Contributor Author

@jordanpadams to be more specific, I mean "is it guaranteed that no version <V will be written into the registry at a later date?"

@jordanpadams
Copy link
Member

@alexdunnjpl no. because we are creating these tools after numerous versions already exist for this data, nodes are often just loading the latest. eventually we will push on them to load past versions.

@alexdunnjpl
Copy link
Contributor Author

@jordanpadams roger that, thanks!

This is fine, it just means that there's an additional candidate optimisation which isn't possible.

As it stands, local benchmarks indicate that provenance should take approximately 4min per 1M archived/certified products when not processing newly-harvested data. I imagine that in ECR it should be a little faster.

repairkit and ancestry should complete immediately when not processing newly-harvested data, so that's about as good a job as we can do.

@tloubrieu-jpl
Copy link
Member

tloubrieu-jpl commented Mar 11, 2024

@gxtchen , this can be tested by using the docker compose deployment of the full registry.

Start it like that:

docker compose --profile=int-registry-batch-loader up -d

It runs sweeper once by default.

After you update the registry database, you can re-run sweeper in a different terminal by:

  1. adding a tag "sweepers" to the sweeper service in the docker-compose.yaml file.
  2. launching the command: docker compose --profile=sweepers up

@tloubrieu-jpl
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🏁 Done
Status: 🏁 Done
Development

Successfully merging a pull request may close this issue.

3 participants