Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Race when using high number of replicas of Elastic and Cassandra #38

Open
etiennedi opened this issue Jan 9, 2019 · 4 comments
Assignees
Labels

Comments

@etiennedi
Copy link
Member

etiennedi commented Jan 9, 2019

Here's an interesting issue I found while playing around with the Helm chart. The default configuration works fine, but if I increase the number of replicas for C* and Elasticsearch, weaviate stops working as more pods are added. (Note: I don't mean increase as in "scaling up after initial deployment", i mean modifying the values.yml before I've ever installed the helm chart!)

I haven't digged into this more deeply, but I have a couple of suspicions of what could be going wrong. First of all, this is what I did (i.e. steps to reproduce)

How to reproduce, i.e. what I've done and noticed

  1. I've randomly increased the replica numbers for Cassandra and Elasticsearch in the values yaml. Admittedly, this hasn't followed any logic, but just random increases. I would not rule out that my values simply don't make sense. (see below)
  2. Install helm chart as usual
  3. I've ran a simple import script. This script was purpose built for what I was trying to test, however, the regular verification script should work just as well. It's only important that it runs for some time (at least 2-3 minutes, the more the better)
  4. Both Elasticserach and Cassandra scale up very slowly. One pod is added at a time if a previous pod was marked as ready. This means weaviate will already be ready while the cluster will still be building (this is why I'm seeing a potential race).
  5. As more and more nodes are added, weaviate suddenly responds with 401 claiming that the used key does not exist. This means it could not retrieve the key from the db
  6. My suspicion is that as more and more nodes are added, data is not being replicated. Maybe (this is a big maybe) Janus needs something like a QUORUM from Cassandra. This could explain why the default 2-node cluster is fine, but as it grows bigger (with possibly emtpy nodes) a QUORUM can no longer be reached.

What we should do to investigate this more

  1. Not scale up both Elasticsearch and Cassandra at the same time, but rather one at a time. This way we can figure out which one is the culprit
  2. Confirm the suspicions above, there are so many variables this could also be due to something completely else
  3. Investigate if there actually is a race and find ways to mitigate it.

Open Questions:

@idcrosby Possibly you know more about these questions of mine

  1. How does the Cassandra helm chart work. Let's say I request 20 nodes. Because of the gradual startup of the nodes, this takes about 30-60min. While the nodes are still being built, what happens with data? Is there anything running to replicate data between nodes. Or is it possibly even a requirement to not add data until the cluster is considered "stable"?

Further actions

@idcrosby Sorry for this long mess of text, if a video would make this easier to understand, please let me know and I'll record one. Nevertheless some input based on your experience would help tremendously, especially as we would like to demo the ability to handle large datasets with weaviate to potential customers shortly.

PS: Great work on the Terraform and Helm setup, I had never used terraform before and setting up the cluster with it was super convenient!

Attachment 1: Changes in values.yaml

diff values.yaml values_beefed_up.yaml
36,37c36,37
<     cluster_size: 2
<     seed_size: 2
---
>     cluster_size: 20
>     seed_size: 5
50,51c50,51
<   master.replicas: 2
<   client.replicas: 2
---
>   master.replicas: 10
>   client.replicas: 10
53c53
<     MINIMUM_MASTER_NODES: "2"
---
>     MINIMUM_MASTER_NODES: "5"
66,67c66,67
<       memory: 2Gi
<       cpu: 1
---
>       memory: 24Gi
>       cpu: 7
69,70c69,70
<       memory: 2Gi
<       cpu: 1
---
>       memory: 24Gi
>       cpu: 7
@etiennedi etiennedi added the Helm label Jan 9, 2019
@idcrosby
Copy link
Contributor

idcrosby commented Jan 9, 2019

Interesting find @etiennedi, appreciate the detailed write-up, makes the issue clear. I'm trying to reproduce now. One comment would be that yes, in general, we would want wait for all clusters (cassandra, elasticsearch) be up and running before initiating weaviate. Practically this shouldn't be an issue, as it's only ever when first creating a cluster. With that being said I wouldn't expect it to break Weaviate.

A more interesting test would be scaling up the cassandra cluster, this would be a more realistic scenario.


My test results (will update this comment as I run more tests)

Large Cassandra values (Cluster Size - 20, Seed Size - 3)
Ran the test script while Cassandra was scaling up, no issues (all 200s)

@etiennedi
Copy link
Member Author

Thanks a lot for starting to investigate. Your fist findings make me quite optimistic that the issue might indeed be due to something else.

@idcrosby
Copy link
Contributor

idcrosby commented Jan 31, 2019

Update, I ran tests with the following cluster sizes:

 #Cassandra
cassandra:
  deploy: true
  image:
    tag: 3
  config:
    cluster_size: 20
    seed_size: 5
    start_rpc: true
  resources:
    requests:
      memory: 4Gi
      cpu: 2
    limits:
      memory: 4Gi
      cpu: 2

#Elasticsearch
elasticsearch:
  deploy-so: true
  master:
    replicas: 10
  client:
    replicas: 10
  cluster:
    env: 
      MINIMUM_MASTER_NODES: "5"

#Janus
janusgraph:
...
  replicaCount: 5
  resources:
    requests:
      memory: 10Gi
      cpu: 3
    limits:
      memory: 10Gi
      cpu: 3

Import script with the following args:

"generate", "-t", "128", "-r", "128", "-a", "128", "-v", "1000000", "-c", "1250", "-w", "http://weaviate:80"]

All seemed to go well until weaviate crashed (and wouldn't restart, error below)

2019/01/31 18:53:16 INFO: Sucessfully pinged Gremlin server.
2019/01/31 18:53:16 DEBUG: Initializeing JanusGraph.
2019/01/31 18:53:46 ERROR: Could not initialize connector: Could not initialize the basic Janus schema..
2019/01/31 18:53:46 ERROR: This error needs to be resolved. For more info, check creativesoftwarefdn.org/weaviate. Exiting now...

Attaching logs from validation script @etiennedi
verification-logs.txt

Note: the cluster took a long time get everything up and running (~30 mins)

@etiennedi
Copy link
Member Author

That's wonderful news. Thank you!

Since opening the issue, I've pushed quite some changes in weaviate, so there is a chance that either this was a one-off occurrence during my testing or it has been fixed in the meantime. I'm somewhat leaning towards the latter, since the exact error message I got was related to issues around the key auth. And the "keys" feature has been completely removed since then.

All seemed to go well until weaviate crashed (and wouldn't restart, error below)

This is expected, I think there's an issue for it (#37). However this is almost fixed. I have a fix on a branch that I will merge shortly which will remove the configuration from the local file system and put that config to etcd.

I'm a bit surprised that a mere crash was enough to break it. I would have expected that it would take at least a pod recreation. But nevertheless, I'll merge the "horizontal scalabilty" feature shortly and then all these issues are gone. This does however mean, we need to add etcd to our helm chart. I've just opened an issue to include this (#41)

Note: the cluster took a long time get everything up and running (~30 mins)

Did you only start running the verification script after 30min or did you already start while the cluster was still scaling up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants