SOLR cloud leader loss recovery #3784

jbrown-xentity · 2022-04-15T22:25:53Z

User Story

In order to have little downtime on SOLR when all leaders are lost, data.gov admins want a recovery path for SOLR cloud cluster.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

GIVEN SOLR has data in it
WHEN all SOLR instances are down and/or "not leaders"
THEN a documented/automated system is in place to elect a leader \

Background

During the last few solr deployments, this has been a recurring issue. Solr is still operational, but the collection that has our CKAN data is down. The data is intact, but the collection can't be used because there is no shard/node stepping up to be the leader. There are facilities in place to force a leader in such normally uncommon scenarios.

[Official Docs] Force A Leader.

The official docs don't seem to work out of the box. From researching, it seems there are other ways to clean up the collection state to allow the collection to recover properly with the ForceLeader API Call.

[Community Doc] Clean up state data and then Force A Leader.

Security Considerations (required)

None.

Sketch

Explorative by design..

nickumia-reisys · 2022-04-18T16:59:36Z

Preliminary results

Communicate with Zookeeper API

I was trying to find where the files that are referenced in the link above exist and could not locate them physically on either the solr pods/pvs nor the zookeeper pods/pvs. I found a reference that talked about utilizing the Zookeeper API to manipulate the data.

I was on my way to try to access this API, but then was blocked by the fact that Zookeeper is not publicly available, so python libraries to connect with it won't work until we expose it in some form... Or now that I think about, I can run a python pod to run the script to interact with it internally within the EKS cluster.

There's one python library that can communicate with Zookeeper: kazoo (Not to be confused with the zookeeper package that is actually for ML/Data Science 🙄)

Upgrade Zookeeper for stability upgrades

I also researched our deployment of zookeeper with zookeeper-operator. It turns out there is an update in the zookeeper-operator from 0.2.12 to 0.2.13 which upgrades zookeeper from 3.6.1 to 3.6.3 with a lot of bug fixes.

[BUG] Reading stale ZookeeperCluster spec/status can lead to undesired pod and PVC deletion pravega/zookeeper-operator#314
- This issue was actually fixes in the currently deployed version of zookeeper, but it seems a bit suspicious
https://issues.apache.org/jira/browse/ZOOKEEPER-4220
Operator Reconciles endlessly pravega/zookeeper-operator#389

Release Notes for zookeeper-operator 0.2.13
Release Notes for zookeeper 3.6.2
Release Notes for zookeeper 3.6.3

Additional notes:

It seems zookeeper uses getent instead of nslookup or dig to check cluster communication links. It seems like an interesting design choice that gets introduced in zookeeper-operator 0.2.13. It might also relate to some logs that we are seeing in our solr cluster.

nickumia-reisys · 2022-04-19T13:42:33Z

Implementation Details

Communicate with Zookeeper API

I was able to interact with the Zookeeper Instance by deploying a basic python pod to the same EKS cluster and run the kazoo library from there.

To deploy the python pod,

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: zok-client
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: zok-client-app
  replicas: 1
  template:
    metadata:
      labels:
        app.kubernetes.io/name: zok-client-app
    spec:
      containers:
      - image: python:3.8
        name: zok-client-app
        command: ['sleep']
        args: ['infinity']
        securityContext:
          allowPrivilegeEscalation: false

Once deployed, I looked up the zookeeper-client for a particular solr cluster,

Got into the pod,

kubectl exec -ti pod/zok-client-6fddb85487-nh4sj -- bash

Installed kazoo and manipulated the data on zookeeper, watching the effects take effect in realtime,

>>> from kazoo.client import KazooClient
>>> zk = KazooClient(hosts='172.20.15.209:2181')
>>> zk.start()
>>> zk.delete("/collections/ckan/leader_elect/shard1/election/<file>")
>>> zk.get("/collections/ckan/leader_elect/shard1/election/<file>")
>>> zk.create("/collections/ckan/leaders/shard1/leader", b"<data>")
>>> zk.set("/collections/ckan/state.json", b"<data>")

I could not get a new leader elected after following the directions above. The best I could do was trick the collection into coming up. The problem was that, because there was no leader, the collection would say, "I don't know what to do without a leader" and then all of the nodes would go down again. When the collection was revived, the data loaded into Catalog App and it showed that the issue was purely coordinating the leader/follower paradigm and not Solr outright rejecting data. Whether that means that Zookeeper had issues or Zookeeper's integration with Solr in high load scenarios is the problem is still unknown.

nickumia-reisys · 2022-04-19T17:00:10Z

While a completely different direction, this was a good archive that might explain why this ticket is necessary in the first place, https://www.mail-archive.com/[email protected]/msg143043.html

nickumia-reisys · 2022-04-20T13:22:33Z

Since there hasn't been any breakthroughs, these are the results from initial discovery dive,

Zookeeper tracks the leader/follower state/management of Solr. However, the physical files are nowhere to be found. The only way to CRUD the data is by using the Zookeeper client API (I used the kazoo python library, but the zkcli should work as well).
The nodes supporting our CKAN collection can be tricked to come up; however, there's no leader and they go back down.
There are files in /collections/ckan/leader_elect/shard1/election/ that manage the leader election status of the solr collection. The lowest number filename is supposed to be the leader.
- Upon a ForceLeader API call, all of the files begin to increment as a new leader is trying to be elected.
There is a PeerSync operation that is stuck in "Error applying updates", most likely due to comments in the link posted above.
- Solution seems to be,
  - Don't restart Solr nodes during indexing operations (with no clear path on how to satisfy this).

The ForceLeader API call won't work if there is already an active leader and Solr will through an Exception

  "error":{
      "metadata":[
        "error-class","org.apache.solr.common.SolrException",
        "root-error-class","org.apache.solr.common.SolrException"],
      "msg":"The shard already has an active leader. Force leader is not applicable. State: shard1:{\n  \"range\":\"80000000-7fffffff\",\n  \"state\":\"active\",\n  \"replicas\":{\n    \"core_node3\":{\n      \"core\":\"ckan_local_shard1_replica_n1\",\n      \"node_name\":\"default-solr-e424cd22ef1c76ed-solrcloud-0.solrcloud2.ssb.data.gov:80_solr\",\n      \"base_url\":\"http://default-solr-e424cd22ef1c76ed-solrcloud-0.solrcloud2.ssb.data.gov:80/solr\",\n      \"state\":\"active\",\n      \"type\":\"NRT\",\n      \"force_set_state\":\"false\"},\n    \"core_node4\":{\n      \"core\":\"ckan_local_shard1_replica_n2\",\n      \"node_name\":\"default-solr-e424cd22ef1c76ed-solrcloud-3.solrcloud2.ssb.data.gov:80_solr\",\n      \"base_url\":\"http://default-solr-e424cd22ef1c76ed-solrcloud-3.solrcloud2.ssb.data.gov:80/solr\",\n      \"state\":\"active\",\n      \"type\":\"NRT\",\n      \"force_set_state\":\"false\",\n      \"leader\":\"true\"}}}",

ForceLeader curl:

curl -L 'http://<username>:<password>@<url>:80/solr/admin/collections?action=FORCELEADER&collection=ckan&shard=shard1'

nickumia-reisys · 2023-07-20T13:05:03Z

I believe it's safe to close this issue because we have been running the legacy leader/follower Solr deployment (and not SolrCloud). In case we ever go back to SolrCloud, we can re-open this issue.

nickumia-reisys self-assigned this Apr 18, 2022

nickumia-reisys removed their assignment Apr 20, 2022

jbrown-xentity mentioned this issue Apr 21, 2022

Harvester causes SOLR cloud to crash #3783

Open

4 tasks

jbrown-xentity changed the title ~~SOLR leader loss recovery~~ SOLR cloud leader loss recovery Apr 21, 2022

nickumia-reisys mentioned this issue Sep 15, 2022

Dissect Solr Performance through New Relic #3956

Open

5 tasks

nickumia-reisys closed this as completed Jul 20, 2023

nickumia-reisys mentioned this issue Jul 20, 2023

Solr Internode Traffic denied by 401 Unauthorized #3770

Closed

2 tasks

nickumia-reisys self-assigned this Jul 20, 2023

nickumia-reisys added component/solr-service Related to Solr-as-a-Service, a brokered Solr offering component/ssb Testing bug Software defect or bug labels Oct 7, 2023

nickumia-reisys added this to data.gov team board Oct 7, 2023

nickumia-reisys moved this to 🗄 Closed in data.gov team board Oct 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR cloud leader loss recovery #3784

SOLR cloud leader loss recovery #3784

jbrown-xentity commented Apr 15, 2022 •

edited by nickumia-reisys

Loading

nickumia-reisys commented Apr 18, 2022

nickumia-reisys commented Apr 19, 2022

nickumia-reisys commented Apr 19, 2022

nickumia-reisys commented Apr 20, 2022 •

edited

Loading

nickumia-reisys commented Jul 20, 2023

SOLR cloud leader loss recovery #3784

SOLR cloud leader loss recovery #3784

Comments

jbrown-xentity commented Apr 15, 2022 • edited by nickumia-reisys Loading

User Story

Acceptance Criteria

Background

Security Considerations (required)

Sketch

nickumia-reisys commented Apr 18, 2022

Preliminary results

Communicate with Zookeeper API

Upgrade Zookeeper for stability upgrades

Additional notes:

nickumia-reisys commented Apr 19, 2022

Implementation Details

Communicate with Zookeeper API

nickumia-reisys commented Apr 19, 2022

nickumia-reisys commented Apr 20, 2022 • edited Loading

nickumia-reisys commented Jul 20, 2023

jbrown-xentity commented Apr 15, 2022 •

edited by nickumia-reisys

Loading

nickumia-reisys commented Apr 20, 2022 •

edited

Loading