Better health checks and monitoring for solr service #930
Labels
component/catalog
Related to catalog component playbooks/roles
component/inventory
Inventory playbooks/roles
User Story
As an operator, I want to know when Solr is returning a high number of errors that might impact the health of Catalog and Inventory so that I am not waiting for Catalog or Inventory to go down before being able to act on the Solr issue.
Details
Solr serves both Catalog and Inventory and is a major backing service to CKAN. If Solr is unavailable, or returning errors, Catalog and Inventory are basically down. Currently, we only know that Solr is acting up when we observe a higher number of errors in Catalog or Inventory, which might manifest as an Uptrends "down" alert.
We currently monitor via New Relic that the solr service is running and the host is up.
Because of how we shard traffic to solr, it's very possible that one solr instance having issues would go unnoticed, or appear as intermittent errors in Catalog and Inventory.
Acceptance Criteria
The text was updated successfully, but these errors were encountered: