SOLR follower data corruption recovery automation #3917

jbrown-xentity · 2022-08-08T18:01:35Z

User Story

In order to recover from SOLR follower data corruption, data.gov admins want an automated process to get instance into a recovering/recovered state.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

GIVEN a SOLR follower is not syncing with leader or unresponsive
WHEN an alert/notification occurs alerting of follower in bad state
THEN the follower data is cleared
AND the service is restarted and follower is confirmed to be recovering

Background

Related to current instability issues around SOLR

Security Considerations (required)

None

Sketch

Current manual approach:

Update task definition to on startup remove the corrupted CKAN solr core data
Update service to use ^^ task definition
Let service restart task
Update service to use the previous standard task definition

Need to find a way to automate this on a trigger, either from AWS cloudwatch or some type of SOLR API request on status.
There may also be a corruption error where the core is still stable, but unable to get updates from the leader and goes stale. This would be the same recovery process, but on a different trigger.

hkdctol · 2022-08-18T20:32:44Z

Moving to icebox for now as this is less likely.

jbrown-xentity added this to data.gov team board Aug 8, 2022

hkdctol moved this to Product Backlog in data.gov team board Aug 18, 2022

nickumia-reisys mentioned this issue Sep 15, 2022

Dissect Solr Performance through New Relic #3956

Open

5 tasks

hkdctol moved this from 📔 Product Backlog to 🧊 Icebox in data.gov team board May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR follower data corruption recovery automation #3917

SOLR follower data corruption recovery automation #3917

jbrown-xentity commented Aug 8, 2022

hkdctol commented Aug 18, 2022

SOLR follower data corruption recovery automation #3917

SOLR follower data corruption recovery automation #3917

Comments

jbrown-xentity commented Aug 8, 2022

User Story

Acceptance Criteria

Background

Security Considerations (required)

Sketch

hkdctol commented Aug 18, 2022