-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR cloud leader loss recovery #3784
Comments
Preliminary resultsCommunicate with Zookeeper APII was trying to find where the files that are referenced in the link above exist and could not locate them physically on either the solr pods/pvs nor the zookeeper pods/pvs. I found a reference that talked about utilizing the Zookeeper API to manipulate the data. I was on my way to try to access this API, but then was blocked by the fact that Zookeeper is not publicly available, so python libraries to connect with it won't work until we expose it in some form... Or now that I think about, I can run a python pod to run the script to interact with it internally within the EKS cluster. There's one python library that can communicate with Zookeeper: kazoo (Not to be confused with the zookeeper package that is actually for ML/Data Science 🙄) Upgrade Zookeeper for stability upgradesI also researched our deployment of
Release Notes for Additional notes:It seems zookeeper uses |
Implementation DetailsCommunicate with Zookeeper APII was able to interact with the Zookeeper Instance by deploying a basic python pod to the same EKS cluster and run the To deploy the python pod, ---
apiVersion: apps/v1
kind: Deployment
metadata:
name: zok-client
spec:
selector:
matchLabels:
app.kubernetes.io/name: zok-client-app
replicas: 1
template:
metadata:
labels:
app.kubernetes.io/name: zok-client-app
spec:
containers:
- image: python:3.8
name: zok-client-app
command: ['sleep']
args: ['infinity']
securityContext:
allowPrivilegeEscalation: false Once deployed, I looked up the Got into the pod, kubectl exec -ti pod/zok-client-6fddb85487-nh4sj -- bash Installed >>> from kazoo.client import KazooClient
>>> zk = KazooClient(hosts='172.20.15.209:2181')
>>> zk.start()
>>> zk.delete("/collections/ckan/leader_elect/shard1/election/<file>")
>>> zk.get("/collections/ckan/leader_elect/shard1/election/<file>")
>>> zk.create("/collections/ckan/leaders/shard1/leader", b"<data>")
>>> zk.set("/collections/ckan/state.json", b"<data>") I could not get a new leader elected after following the directions above. The best I could do was trick the collection into coming up. The problem was that, because there was no leader, the collection would say, "I don't know what to do without a leader" and then all of the nodes would go down again. When the collection was revived, the data loaded into Catalog App and it showed that the issue was purely coordinating the leader/follower paradigm and not Solr outright rejecting data. Whether that means that Zookeeper had issues or Zookeeper's integration with Solr in high load scenarios is the problem is still unknown. |
While a completely different direction, this was a good archive that might explain why this ticket is necessary in the first place, https://www.mail-archive.com/[email protected]/msg143043.html |
Since there hasn't been any breakthroughs, these are the results from initial discovery dive,
ForceLeader curl: curl -L 'http://<username>:<password>@<url>:80/solr/admin/collections?action=FORCELEADER&collection=ckan&shard=shard1' |
I believe it's safe to close this issue because we have been running the legacy leader/follower Solr deployment (and not SolrCloud). In case we ever go back to SolrCloud, we can re-open this issue. |
User Story
In order to have little downtime on SOLR when all leaders are lost, data.gov admins want a recovery path for SOLR cloud cluster.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
WHEN all SOLR instances are down and/or "not leaders"
THEN a documented/automated system is in place to elect a leader \
Background
During the last few solr deployments, this has been a recurring issue. Solr is still operational, but the collection that has our CKAN data is down. The data is intact, but the collection can't be used because there is no shard/node stepping up to be the leader. There are facilities in place to force a leader in such normally uncommon scenarios.
The official docs don't seem to work out of the box. From researching, it seems there are other ways to clean up the collection state to allow the collection to recover properly with the ForceLeader API Call.
Security Considerations (required)
None.
Sketch
Explorative by design..
The text was updated successfully, but these errors were encountered: