Possible Race when using high number of replicas of Elastic and Cassandra #38

etiennedi · 2019-01-09T10:37:25Z

Here's an interesting issue I found while playing around with the Helm chart. The default configuration works fine, but if I increase the number of replicas for C* and Elasticsearch, weaviate stops working as more pods are added. (Note: I don't mean increase as in "scaling up after initial deployment", i mean modifying the values.yml before I've ever installed the helm chart!)

I haven't digged into this more deeply, but I have a couple of suspicions of what could be going wrong. First of all, this is what I did (i.e. steps to reproduce)

How to reproduce, i.e. what I've done and noticed

I've randomly increased the replica numbers for Cassandra and Elasticsearch in the values yaml. Admittedly, this hasn't followed any logic, but just random increases. I would not rule out that my values simply don't make sense. (see below)
Install helm chart as usual
I've ran a simple import script. This script was purpose built for what I was trying to test, however, the regular verification script should work just as well. It's only important that it runs for some time (at least 2-3 minutes, the more the better)
Both Elasticserach and Cassandra scale up very slowly. One pod is added at a time if a previous pod was marked as ready. This means weaviate will already be ready while the cluster will still be building (this is why I'm seeing a potential race).
As more and more nodes are added, weaviate suddenly responds with 401 claiming that the used key does not exist. This means it could not retrieve the key from the db
My suspicion is that as more and more nodes are added, data is not being replicated. Maybe (this is a big maybe) Janus needs something like a QUORUM from Cassandra. This could explain why the default 2-node cluster is fine, but as it grows bigger (with possibly emtpy nodes) a QUORUM can no longer be reached.

What we should do to investigate this more

Not scale up both Elasticsearch and Cassandra at the same time, but rather one at a time. This way we can figure out which one is the culprit
Confirm the suspicions above, there are so many variables this could also be due to something completely else
Investigate if there actually is a race and find ways to mitigate it.

Open Questions:

@idcrosby Possibly you know more about these questions of mine

How does the Cassandra helm chart work. Let's say I request 20 nodes. Because of the gradual startup of the nodes, this takes about 30-60min. While the nodes are still being built, what happens with data? Is there anything running to replicate data between nodes. Or is it possibly even a requirement to not add data until the cluster is considered "stable"?

Further actions

@idcrosby Sorry for this long mess of text, if a video would make this easier to understand, please let me know and I'll record one. Nevertheless some input based on your experience would help tremendously, especially as we would like to demo the ability to handle large datasets with weaviate to potential customers shortly.

PS: Great work on the Terraform and Helm setup, I had never used terraform before and setting up the cluster with it was super convenient!

Attachment 1: Changes in values.yaml

diff values.yaml values_beefed_up.yaml
36,37c36,37
<     cluster_size: 2
<     seed_size: 2
---
>     cluster_size: 20
>     seed_size: 5
50,51c50,51
<   master.replicas: 2
<   client.replicas: 2
---
>   master.replicas: 10
>   client.replicas: 10
53c53
<     MINIMUM_MASTER_NODES: "2"
---
>     MINIMUM_MASTER_NODES: "5"
66,67c66,67
<       memory: 2Gi
<       cpu: 1
---
>       memory: 24Gi
>       cpu: 7
69,70c69,70
<       memory: 2Gi
<       cpu: 1
---
>       memory: 24Gi
>       cpu: 7

The text was updated successfully, but these errors were encountered:

idcrosby · 2019-01-09T21:54:43Z

Interesting find @etiennedi, appreciate the detailed write-up, makes the issue clear. I'm trying to reproduce now. One comment would be that yes, in general, we would want wait for all clusters (cassandra, elasticsearch) be up and running before initiating weaviate. Practically this shouldn't be an issue, as it's only ever when first creating a cluster. With that being said I wouldn't expect it to break Weaviate.

A more interesting test would be scaling up the cassandra cluster, this would be a more realistic scenario.

My test results (will update this comment as I run more tests)

Large Cassandra values (Cluster Size - 20, Seed Size - 3)
Ran the test script while Cassandra was scaling up, no issues (all 200s)

etiennedi · 2019-01-10T11:05:42Z

Thanks a lot for starting to investigate. Your fist findings make me quite optimistic that the issue might indeed be due to something else.

idcrosby · 2019-01-31T20:57:40Z

Update, I ran tests with the following cluster sizes:

 #Cassandra
cassandra:
  deploy: true
  image:
    tag: 3
  config:
    cluster_size: 20
    seed_size: 5
    start_rpc: true
  resources:
    requests:
      memory: 4Gi
      cpu: 2
    limits:
      memory: 4Gi
      cpu: 2

#Elasticsearch
elasticsearch:
  deploy-so: true
  master:
    replicas: 10
  client:
    replicas: 10
  cluster:
    env: 
      MINIMUM_MASTER_NODES: "5"

#Janus
janusgraph:
...
  replicaCount: 5
  resources:
    requests:
      memory: 10Gi
      cpu: 3
    limits:
      memory: 10Gi
      cpu: 3

Import script with the following args:

"generate", "-t", "128", "-r", "128", "-a", "128", "-v", "1000000", "-c", "1250", "-w", "http://weaviate:80"]

All seemed to go well until weaviate crashed (and wouldn't restart, error below)

2019/01/31 18:53:16 INFO: Sucessfully pinged Gremlin server.
2019/01/31 18:53:16 DEBUG: Initializeing JanusGraph.
2019/01/31 18:53:46 ERROR: Could not initialize connector: Could not initialize the basic Janus schema..
2019/01/31 18:53:46 ERROR: This error needs to be resolved. For more info, check creativesoftwarefdn.org/weaviate. Exiting now...

Attaching logs from validation script @etiennedi
verification-logs.txt

Note: the cluster took a long time get everything up and running (~30 mins)

etiennedi · 2019-02-01T15:14:06Z

That's wonderful news. Thank you!

Since opening the issue, I've pushed quite some changes in weaviate, so there is a chance that either this was a one-off occurrence during my testing or it has been fixed in the meantime. I'm somewhat leaning towards the latter, since the exact error message I got was related to issues around the key auth. And the "keys" feature has been completely removed since then.

All seemed to go well until weaviate crashed (and wouldn't restart, error below)

This is expected, I think there's an issue for it (#37). However this is almost fixed. I have a fix on a branch that I will merge shortly which will remove the configuration from the local file system and put that config to etcd.

I'm a bit surprised that a mere crash was enough to break it. I would have expected that it would take at least a pod recreation. But nevertheless, I'll merge the "horizontal scalabilty" feature shortly and then all these issues are gone. This does however mean, we need to add etcd to our helm chart. I've just opened an issue to include this (#41)

Note: the cluster took a long time get everything up and running (~30 mins)

Did you only start running the verification script after 30min or did you already start while the cluster was still scaling up?

etiennedi added the Helm label Jan 9, 2019

etiennedi mentioned this issue Jan 15, 2019

Reach 5bn vertices #40

Open

idcrosby self-assigned this Jan 31, 2019

idcrosby added a commit that referenced this issue Jan 31, 2019

gh-38: fix elasticsearch value notation

804eef3

idcrosby added a commit that referenced this issue Feb 4, 2019

gh-38: fix elasticsearch value notation

442a007

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Race when using high number of replicas of Elastic and Cassandra #38

Possible Race when using high number of replicas of Elastic and Cassandra #38

etiennedi commented Jan 9, 2019 •

edited

Loading

idcrosby commented Jan 9, 2019

etiennedi commented Jan 10, 2019

idcrosby commented Jan 31, 2019 •

edited

Loading

etiennedi commented Feb 1, 2019

Possible Race when using high number of replicas of Elastic and Cassandra #38

Possible Race when using high number of replicas of Elastic and Cassandra #38

Comments

etiennedi commented Jan 9, 2019 • edited Loading

How to reproduce, i.e. what I've done and noticed

What we should do to investigate this more

Open Questions:

Further actions

Attachment 1: Changes in values.yaml

idcrosby commented Jan 9, 2019

etiennedi commented Jan 10, 2019

idcrosby commented Jan 31, 2019 • edited Loading

etiennedi commented Feb 1, 2019

etiennedi commented Jan 9, 2019 •

edited

Loading

idcrosby commented Jan 31, 2019 •

edited

Loading