You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In order to recover from SOLR follower data corruption, data.gov admins want an automated process to get instance into a recovering/recovered state.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
GIVEN a SOLR follower is not syncing with leader or unresponsive
WHEN an alert/notification occurs alerting of follower in bad state
THEN the follower data is cleared
AND the service is restarted and follower is confirmed to be recovering
Update task definition to on startup remove the corrupted CKAN solr core data
Update service to use ^^ task definition
Let service restart task
Update service to use the previous standard task definition
Need to find a way to automate this on a trigger, either from AWS cloudwatch or some type of SOLR API request on status.
There may also be a corruption error where the core is still stable, but unable to get updates from the leader and goes stale. This would be the same recovery process, but on a different trigger.
The text was updated successfully, but these errors were encountered:
User Story
In order to recover from SOLR follower data corruption, data.gov admins want an automated process to get instance into a recovering/recovered state.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
WHEN an alert/notification occurs alerting of follower in bad state
THEN the follower data is cleared
AND the service is restarted and follower is confirmed to be recovering
Background
Related to current instability issues around SOLR
Security Considerations (required)
None
Sketch
Current manual approach:
Need to find a way to automate this on a trigger, either from AWS cloudwatch or some type of SOLR API request on status.
There may also be a corruption error where the core is still stable, but unable to get updates from the leader and goes stale. This would be the same recovery process, but on a different trigger.
The text was updated successfully, but these errors were encountered: