Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR follower data corruption recovery automation #3917

Open
1 task
jbrown-xentity opened this issue Aug 8, 2022 · 1 comment
Open
1 task

SOLR follower data corruption recovery automation #3917

jbrown-xentity opened this issue Aug 8, 2022 · 1 comment
Labels
O&M Operations and maintenance tasks for the Data.gov platform

Comments

@jbrown-xentity
Copy link
Contributor

User Story

In order to recover from SOLR follower data corruption, data.gov admins want an automated process to get instance into a recovering/recovered state.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN a SOLR follower is not syncing with leader or unresponsive
    WHEN an alert/notification occurs alerting of follower in bad state
    THEN the follower data is cleared
    AND the service is restarted and follower is confirmed to be recovering

Background

Related to current instability issues around SOLR

Security Considerations (required)

None

Sketch

Current manual approach:

  • Update task definition to on startup remove the corrupted CKAN solr core data
  • Update service to use ^^ task definition
  • Let service restart task
  • Update service to use the previous standard task definition

Need to find a way to automate this on a trigger, either from AWS cloudwatch or some type of SOLR API request on status.
There may also be a corruption error where the core is still stable, but unable to get updates from the leader and goes stale. This would be the same recovery process, but on a different trigger.

@hkdctol
Copy link
Contributor

hkdctol commented Aug 18, 2022

Moving to icebox for now as this is less likely.

@hkdctol hkdctol moved this from 📔 Product Backlog to 🧊 Icebox in data.gov team board May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O&M Operations and maintenance tasks for the Data.gov platform
Projects
Status: 🧊 Icebox
Development

No branches or pull requests

2 participants