-
-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to check key count on partition #130
Comments
the same log met, it's may cause by memberships exception in my environments, I deployed 2 pods in k8s cluster, but not make up cluster because of weird exception, what the exception is the pods status show running, but one pod is exception actually (cannot kubectl exec to pod, but can do docker exec), I add a livenessProbe temporarily |
The same issue from all members:
|
Hi all, I'm so sorry for my late response. The func (c *Client) conn(addr string) (net.Conn, error) {
p, err := c.pool(addr)
if err != nil {
return nil, err
}
var (
ctx context.Context
cancel context.CancelFunc
)
if c.config.PoolTimeout > 0 {
ctx, cancel = context.WithTimeout(context.Background(), c.config.PoolTimeout)
defer cancel()
} else {
ctx = context.Background()
}
conn, err := p.Get(ctx)
if err != nil {
return nil, err
}
if c.config.HasTimeout() {
// Wrap the net.Conn to implement timeout logic
conn = NewConnWithTimeout(conn, c.config.ReadTimeout, c.config.WriteTimeout)
}
return conn, err
} This function tries to acquire a new connection from the pool for the given address, but it cannot do that. In order to inspect the problem, you may want to:
I think the actual problem is much deeper. Because I see the following lines in the logs:
and (from ShawnHsiung)
It seems that the network is unstable for these cases. Could you please give more information about your setup? Is that Kubernetes, right? Source: olric/internal/transport/client.go Line 124 in b41f763
|
Hey, I double-checked the suspected parts of the code. Everything looks fine to me. I'm going to try to create load locally and inject errors into the cluster. It'll take some time. Do you guys give some info about your setup and configuration? Could you please Your help is appreciated |
Another question, does the cluster work seamlessly despite these errors? |
I just released See https://github.com/buraksezer/olric/releases/tag/v0.4.1 Probably there is nothing wrong with the code, but the network is unstable somehow. Because there is no proper and immediate way to close dead TCP sockets in the pool, sometimes Olric may try to use dead TCP sockets to send messages over the wire. This can explain unexpected I need to know the answers to these questions to debug the problem:
|
I think increase maxConn and poolTimeout will just delay the problem, not solve. I can reproduce the problem with single instance so I don't think it's the network unstable problem. And yes, I'm in k8s. And |
The network was interrupted maybe, but it's ok eventually and olric can not recovery(I know this because restart the cluster will fix the problem). And the cluster can not work properly while having these errors. |
There is a configuration option for TCP KeepAlives. If you use the YAML file to configure the nodes: For embedded-member mode: It is I think the root cause behind this story is buraksezer/connpool. It somehow fails to remove dead connections from the pool. I checked everything, but I cannot find any problem with it. We can also replace it with jackc/puddle. You may try to reduce the TCP connection keep-alive. It seems that the Kubernetes networking layer drops connections without the termination phase.
Is there any error message in the logs? I need more details to debug that problem. |
I saw this function in the Spiracle repository: https://github.com/LilithGames/spiracle/blob/ed2af92da1d13f1afa11b6a354db33048930655b/infra/db/db.go#L25 If you still use the same function or similar one, Could you please try to set olric/internal/transport/server.go Line 283 in 23cd56d
This is how Redis handles keepalives: https://redis.io/topics/clients#tcp-keepalive |
Ok, I will try this. However, I can get this error with single embedded instance(all reads and writes are from the same process, port exposed only for debug), did olric keep some tcp connections to itself? |
Hello, I embedded Olric and I have the same issue. At the moment I'm using v0.4.5. This is my log:
I don't understand why there are 270 partitions, because I just use a couple of maps with just a few keys each. Anyway what happens here is that during this phase any operation seems to hang until the error appears for partition 270, then pending operations go on and it starts again from the first partition. But why does olric need to open a tcp connection to itself? This is my initialization code: c := olric_conf.New("lan")
ctx, cancel := context.WithCancel(context.Background())
c.Started = func() {
defer cancel()
}
olricSD := &OlricEmbeddedDiscovery{}
olricSD.olricConfig = c
sd := make(map[string]interface{})
sd["plugin"] = olricSD
c.ServiceDiscovery = sd
var err error
olricDb, err = olric.New(c)
if err != nil {
log.Fatal(err)
}
go func() {
err = olricDb.Start()
if err != nil {
fmt.Printf("********* Error starting Olric: %v\n", err)
olricDb.Shutdown(ctx)
cancel()
}
}()
<-ctx.Done()
// Bootstrapping fs
stats, err := olricDb.Stats()
for id, m := range stats.ClusterMembers {
fmt.Printf("%d -> %s\n", id, m.Name)
}
clusterUuid = stats.Member.ID
ownMap, err = olricDb.NewDMap(fmt.Sprintf("cluster-%d", stats.Member.ID))
if err != nil {
return err
}
ownMap.Put("UUID", GetConfig().UUID)
ownMap.Put("node.uuid", GetConfig().UUID)
ownMap.Put("node.name", GetConfig().Name) |
Hello,
Please consider upgrading to the latest stable version. Connection pooling dependency has received a few bug fixes.
Olric is a partitioned data store. That mechanism distributes all your data among the cluster members. You can increase or decrease partition numbers, but prime numbers are good for equal distribution.
It is not ideal; I know this, but it's completely unimportant. The coordinator node fetches all available nodes from the memberlist instance and sends a request to learn some statistical data from the members. The payload is quite low. I didn't want to add another code path to keep the complexity low. As I mentioned, the payload is only a few bytes long.
Does Olric run your local environment or Docket network? Something should be different in your environment. I need to know that difference to reproduce the problem on my side. Did you check the previous messages? |
For example, do you use any firewall software which blocks Olric? Something should be different in your network. |
Sorry, maybe I had to mention the software embedding olric is running on a Raspberry Pi and I'm testing it with up to three nodes (most of the time just one). All nodes run on my home wifi network (I suppose it causes most of the issues when multiple nodes are involved). Unfortunately the issue usually happens after running for several days (like the issue's creator), so it's very difficult to understand. To make it harder, it happens even if just one node runs all the time, so it doesn't seem to be directly related to network failures... Anyway I will wait for the issue again to analyze it more accurately, because I'm dealing with other changes at the moment. I will let you know. |
Hello, I see. Olric v0.4.7 has this fix: buraksezer/connpool#4. This may fix the problem. |
It looks promising! I will test it in the next days and let you know 😉 |
Closing this issue due to inactivity. |
v0.4.0, after 7 days running, showing error.
The text was updated successfully, but these errors were encountered: