-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible Race when using high number of replicas of Elastic and Cassandra #38
Comments
Interesting find @etiennedi, appreciate the detailed write-up, makes the issue clear. I'm trying to reproduce now. One comment would be that yes, in general, we would want wait for all clusters (cassandra, elasticsearch) be up and running before initiating weaviate. Practically this shouldn't be an issue, as it's only ever when first creating a cluster. With that being said I wouldn't expect it to break Weaviate. A more interesting test would be scaling up the cassandra cluster, this would be a more realistic scenario. My test results (will update this comment as I run more tests) Large Cassandra values (Cluster Size - 20, Seed Size - 3) |
Thanks a lot for starting to investigate. Your fist findings make me quite optimistic that the issue might indeed be due to something else. |
Update, I ran tests with the following cluster sizes:
Import script with the following args:
All seemed to go well until weaviate crashed (and wouldn't restart, error below)
Attaching logs from validation script @etiennedi Note: the cluster took a long time get everything up and running (~30 mins) |
That's wonderful news. Thank you! Since opening the issue, I've pushed quite some changes in weaviate, so there is a chance that either this was a one-off occurrence during my testing or it has been fixed in the meantime. I'm somewhat leaning towards the latter, since the exact error message I got was related to issues around the key auth. And the "keys" feature has been completely removed since then.
This is expected, I think there's an issue for it (#37). However this is almost fixed. I have a fix on a branch that I will merge shortly which will remove the configuration from the local file system and put that config to etcd. I'm a bit surprised that a mere crash was enough to break it. I would have expected that it would take at least a pod recreation. But nevertheless, I'll merge the "horizontal scalabilty" feature shortly and then all these issues are gone. This does however mean, we need to add etcd to our helm chart. I've just opened an issue to include this (#41)
Did you only start running the verification script after 30min or did you already start while the cluster was still scaling up? |
Here's an interesting issue I found while playing around with the Helm chart. The default configuration works fine, but if I increase the number of replicas for C* and Elasticsearch, weaviate stops working as more pods are added. (Note: I don't mean increase as in "scaling up after initial deployment", i mean modifying the
values.yml
before I've ever installed the helm chart!)I haven't digged into this more deeply, but I have a couple of suspicions of what could be going wrong. First of all, this is what I did (i.e. steps to reproduce)
How to reproduce, i.e. what I've done and noticed
401
claiming that the used key does not exist. This means it could not retrieve the key from the dbWhat we should do to investigate this more
Open Questions:
@idcrosby Possibly you know more about these questions of mine
Further actions
@idcrosby Sorry for this long mess of text, if a video would make this easier to understand, please let me know and I'll record one. Nevertheless some input based on your experience would help tremendously, especially as we would like to demo the ability to handle large datasets with weaviate to potential customers shortly.
PS: Great work on the Terraform and Helm setup, I had never used terraform before and setting up the cluster with it was super convenient!
Attachment 1: Changes in values.yaml
The text was updated successfully, but these errors were encountered: