Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR cloud leader loss recovery #3784

Closed
1 task
jbrown-xentity opened this issue Apr 15, 2022 · 5 comments
Closed
1 task

SOLR cloud leader loss recovery #3784

jbrown-xentity opened this issue Apr 15, 2022 · 5 comments
Assignees
Labels
bug Software defect or bug component/solr-service Related to Solr-as-a-Service, a brokered Solr offering component/ssb Testing

Comments

@jbrown-xentity
Copy link
Contributor

jbrown-xentity commented Apr 15, 2022

User Story

In order to have little downtime on SOLR when all leaders are lost, data.gov admins want a recovery path for SOLR cloud cluster.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN SOLR has data in it
    WHEN all SOLR instances are down and/or "not leaders"
    THEN a documented/automated system is in place to elect a leader \

Background

During the last few solr deployments, this has been a recurring issue. Solr is still operational, but the collection that has our CKAN data is down. The data is intact, but the collection can't be used because there is no shard/node stepping up to be the leader. There are facilities in place to force a leader in such normally uncommon scenarios.

The official docs don't seem to work out of the box. From researching, it seems there are other ways to clean up the collection state to allow the collection to recover properly with the ForceLeader API Call.

Security Considerations (required)

None.

Sketch

Explorative by design..

@nickumia-reisys nickumia-reisys self-assigned this Apr 18, 2022
@nickumia-reisys
Copy link
Contributor

Preliminary results

Communicate with Zookeeper API

I was trying to find where the files that are referenced in the link above exist and could not locate them physically on either the solr pods/pvs nor the zookeeper pods/pvs. I found a reference that talked about utilizing the Zookeeper API to manipulate the data.

I was on my way to try to access this API, but then was blocked by the fact that Zookeeper is not publicly available, so python libraries to connect with it won't work until we expose it in some form... Or now that I think about, I can run a python pod to run the script to interact with it internally within the EKS cluster.

There's one python library that can communicate with Zookeeper: kazoo (Not to be confused with the zookeeper package that is actually for ML/Data Science 🙄)

Upgrade Zookeeper for stability upgrades

I also researched our deployment of zookeeper with zookeeper-operator. It turns out there is an update in the zookeeper-operator from 0.2.12 to 0.2.13 which upgrades zookeeper from 3.6.1 to 3.6.3 with a lot of bug fixes.

Release Notes for zookeeper-operator 0.2.13
Release Notes for zookeeper 3.6.2
Release Notes for zookeeper 3.6.3

Additional notes:

It seems zookeeper uses getent instead of nslookup or dig to check cluster communication links. It seems like an interesting design choice that gets introduced in zookeeper-operator 0.2.13. It might also relate to some logs that we are seeing in our solr cluster.

@nickumia-reisys
Copy link
Contributor

Implementation Details

Communicate with Zookeeper API

I was able to interact with the Zookeeper Instance by deploying a basic python pod to the same EKS cluster and run the kazoo library from there.

To deploy the python pod,

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: zok-client
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: zok-client-app
  replicas: 1
  template:
    metadata:
      labels:
        app.kubernetes.io/name: zok-client-app
    spec:
      containers:
      - image: python:3.8
        name: zok-client-app
        command: ['sleep']
        args: ['infinity']
        securityContext:
          allowPrivilegeEscalation: false

Once deployed, I looked up the zookeeper-client for a particular solr cluster,

image

Got into the pod,

kubectl exec -ti pod/zok-client-6fddb85487-nh4sj -- bash

Installed kazoo and manipulated the data on zookeeper, watching the effects take effect in realtime,

>>> from kazoo.client import KazooClient
>>> zk = KazooClient(hosts='172.20.15.209:2181')
>>> zk.start()
>>> zk.delete("/collections/ckan/leader_elect/shard1/election/<file>")
>>> zk.get("/collections/ckan/leader_elect/shard1/election/<file>")
>>> zk.create("/collections/ckan/leaders/shard1/leader", b"<data>")
>>> zk.set("/collections/ckan/state.json", b"<data>")

I could not get a new leader elected after following the directions above. The best I could do was trick the collection into coming up. The problem was that, because there was no leader, the collection would say, "I don't know what to do without a leader" and then all of the nodes would go down again. When the collection was revived, the data loaded into Catalog App and it showed that the issue was purely coordinating the leader/follower paradigm and not Solr outright rejecting data. Whether that means that Zookeeper had issues or Zookeeper's integration with Solr in high load scenarios is the problem is still unknown.

@nickumia-reisys
Copy link
Contributor

While a completely different direction, this was a good archive that might explain why this ticket is necessary in the first place, https://www.mail-archive.com/[email protected]/msg143043.html

@nickumia-reisys nickumia-reisys removed their assignment Apr 20, 2022
@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Apr 20, 2022

Since there hasn't been any breakthroughs, these are the results from initial discovery dive,

  • Zookeeper tracks the leader/follower state/management of Solr. However, the physical files are nowhere to be found. The only way to CRUD the data is by using the Zookeeper client API (I used the kazoo python library, but the zkcli should work as well).
  • The nodes supporting our CKAN collection can be tricked to come up; however, there's no leader and they go back down.
  • There are files in /collections/ckan/leader_elect/shard1/election/ that manage the leader election status of the solr collection. The lowest number filename is supposed to be the leader.
    • Upon a ForceLeader API call, all of the files begin to increment as a new leader is trying to be elected.
  • There is a PeerSync operation that is stuck in "Error applying updates", most likely due to comments in the link posted above.
    • Solution seems to be,
      • Don't restart Solr nodes during indexing operations (with no clear path on how to satisfy this).
  • The ForceLeader API call won't work if there is already an active leader and Solr will through an Exception
      "error":{
          "metadata":[
            "error-class","org.apache.solr.common.SolrException",
            "root-error-class","org.apache.solr.common.SolrException"],
          "msg":"The shard already has an active leader. Force leader is not applicable. State: shard1:{\n  \"range\":\"80000000-7fffffff\",\n  \"state\":\"active\",\n  \"replicas\":{\n    \"core_node3\":{\n      \"core\":\"ckan_local_shard1_replica_n1\",\n      \"node_name\":\"default-solr-e424cd22ef1c76ed-solrcloud-0.solrcloud2.ssb.data.gov:80_solr\",\n      \"base_url\":\"http://default-solr-e424cd22ef1c76ed-solrcloud-0.solrcloud2.ssb.data.gov:80/solr\",\n      \"state\":\"active\",\n      \"type\":\"NRT\",\n      \"force_set_state\":\"false\"},\n    \"core_node4\":{\n      \"core\":\"ckan_local_shard1_replica_n2\",\n      \"node_name\":\"default-solr-e424cd22ef1c76ed-solrcloud-3.solrcloud2.ssb.data.gov:80_solr\",\n      \"base_url\":\"http://default-solr-e424cd22ef1c76ed-solrcloud-3.solrcloud2.ssb.data.gov:80/solr\",\n      \"state\":\"active\",\n      \"type\":\"NRT\",\n      \"force_set_state\":\"false\",\n      \"leader\":\"true\"}}}",

ForceLeader curl:

curl -L 'http://<username>:<password>@<url>:80/solr/admin/collections?action=FORCELEADER&collection=ckan&shard=shard1'

@jbrown-xentity jbrown-xentity changed the title SOLR leader loss recovery SOLR cloud leader loss recovery Apr 21, 2022
@nickumia-reisys
Copy link
Contributor

I believe it's safe to close this issue because we have been running the legacy leader/follower Solr deployment (and not SolrCloud). In case we ever go back to SolrCloud, we can re-open this issue.

@nickumia-reisys nickumia-reisys self-assigned this Jul 20, 2023
@nickumia-reisys nickumia-reisys added component/solr-service Related to Solr-as-a-Service, a brokered Solr offering component/ssb Testing bug Software defect or bug labels Oct 7, 2023
@nickumia-reisys nickumia-reisys moved this to 🗄 Closed in data.gov team board Oct 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug component/solr-service Related to Solr-as-a-Service, a brokered Solr offering component/ssb Testing
Projects
Archived in project
Development

No branches or pull requests

2 participants