Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to check key count on partition #130

Closed
hulucc opened this issue Nov 30, 2021 · 18 comments
Closed

Failed to check key count on partition #130

hulucc opened this issue Nov 30, 2021 · 18 comments
Assignees
Labels
question Further information is requested

Comments

@hulucc
Copy link

hulucc commented Nov 30, 2021

93e126f2f5148903a09607bf47178a8
v0.4.0, after 7 days running, showing error.

@buraksezer buraksezer self-assigned this Dec 1, 2021
@buraksezer buraksezer added the question Further information is requested label Dec 1, 2021
@ShawnHsiung
Copy link
Contributor

ShawnHsiung commented Dec 5, 2021

Failed to check key count on partition: 254: read tcp 10.233.85.149:48176->10.233.85.149:3320: i/o timeout => distribute.go:65

Failed to check key count on partition: 160: context deadline exceeded => distribute.go:65

the same log met, it's may cause by memberships exception

in my environments, I deployed 2 pods in k8s cluster, but not  make up cluster because of weird exception, what the exception is the pods status show running, but one pod is exception actually (cannot kubectl exec to pod, but can do docker exec), I add a livenessProbe temporarily

@daniel-yavorovich
Copy link

The same issue from all members:

rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:45 [ERROR] Failed to update routing table on 10.31.2.49:3320: EOF => update.go:58
rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:48 [ERROR] Failed to update routing table on 10.31.2.130:3320: context deadline exceeded => update.go:58
rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:48 [ERROR] Failed to update routing table on 10.31.2.20:3320: context deadline exceeded => update.go:58
rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:48 [ERROR] Failed to update routing table on 10.31.2.122:3320: context deadline exceeded => update.go:58
rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:48 [ERROR] Failed to update routing table on cluster: EOF => routingtable.go:238
rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:49 [INFO] Routing table has been pushed by 10.31.5.66:3320 => operations.go:87```

the problem is reproduced only under production traffic, therefore it is difficult to catch it on synthetic tests

@buraksezer
Copy link
Owner

Hi all,

I'm so sorry for my late response. The context deadline exceeded string is raised by the following function.

func (c *Client) conn(addr string) (net.Conn, error) {
	p, err := c.pool(addr)
	if err != nil {
		return nil, err
	}

	var (
		ctx    context.Context
		cancel context.CancelFunc
	)
	if c.config.PoolTimeout > 0 {
		ctx, cancel = context.WithTimeout(context.Background(), c.config.PoolTimeout)
		defer cancel()
	} else {
		ctx = context.Background()
	}

	conn, err := p.Get(ctx)
	if err != nil {
		return nil, err
	}

	if c.config.HasTimeout() {
		// Wrap the net.Conn to implement timeout logic
		conn = NewConnWithTimeout(conn, c.config.ReadTimeout, c.config.WriteTimeout)
	}
	return conn, err
}

This function tries to acquire a new connection from the pool for the given address, but it cannot do that. In order to inspect the problem, you may want to:

  • Disable PoolTimeout,
  • Increase the number of allowed connections in the pool.

I think the actual problem is much deeper. Because I see the following lines in the logs:

rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:45 [ERROR] Failed to update routing table on 10.31.2.49:3320: EOF => update.go:58
...
rpc-proxy-cacher-6c5d77d85-fp5fw rpc-proxy-cacher 2021/12/13 21:15:48 [ERROR] Failed to update routing table on cluster: EOF => routingtable.go:238

and (from ShawnHsiung)

Failed to check key count on partition: 254: read tcp 10.233.85.149:48176->10.233.85.149:3320: i/o timeout => distribute.go:65

It seems that the network is unstable for these cases. Could you please give more information about your setup? Is that Kubernetes, right?

Source:

ctx, cancel = context.WithTimeout(context.Background(), c.config.PoolTimeout)

@buraksezer
Copy link
Owner

Hey,

I double-checked the suspected parts of the code. Everything looks fine to me. I'm going to try to create load locally and inject errors into the cluster. It'll take some time. Do you guys give some info about your setup and configuration? Could you please maxConn and poolTimeout directives from your config? How many members do you have in the cluster?

Your help is appreciated

@buraksezer
Copy link
Owner

buraksezer commented Dec 20, 2021

Another question, does the cluster work seamlessly despite these errors?

buraksezer added a commit that referenced this issue Dec 20, 2021
@buraksezer
Copy link
Owner

I just released v0.4.1. It improves error messages. I hope this version will increase our understanding of this problem.

See https://github.com/buraksezer/olric/releases/tag/v0.4.1

Probably there is nothing wrong with the code, but the network is unstable somehow. Because there is no proper and immediate way to close dead TCP sockets in the pool, sometimes Olric may try to use dead TCP sockets to send messages over the wire. This can explain unexpected EOF or Broken pipe kind error messages in the logs. Olric is designed to overcome this kind of problem. For example, the cluster coordinator constantly inspects key counts on the partitions. This check may fail occasionally. That error should be no problem at all.

I need to know the answers to these questions to debug the problem:

  • Does the network between pods/containers/virtual machines function properly? How do you know?
  • Does the cluster continue to function properly despite these errors?

@hulucc
Copy link
Author

hulucc commented Jan 4, 2022

Hey,

I double-checked the suspected parts of the code. Everything looks fine to me. I'm going to try to create load locally and inject errors into the cluster. It'll take some time. Do you guys give some info about your setup and configuration? Could you please maxConn and poolTimeout directives from your config? How many members do you have in the cluster?

Your help is appreciated

I think increase maxConn and poolTimeout will just delay the problem, not solve. I can reproduce the problem with single instance so I don't think it's the network unstable problem. And yes, I'm in k8s. And v0.4.0-beta.2 was ok.

@hulucc
Copy link
Author

hulucc commented Jan 4, 2022

I just released v0.4.1. It improves error messages. I hope this version will increase our understanding of this problem.

See https://github.com/buraksezer/olric/releases/tag/v0.4.1

Probably there is nothing wrong with the code, but the network is unstable somehow. Because there is no proper and immediate way to close dead TCP sockets in the pool, sometimes Olric may try to use dead TCP sockets to send messages over the wire. This can explain unexpected EOF or Broken pipe kind error messages in the logs. Olric is designed to overcome this kind of problem. For example, the cluster coordinator constantly inspects key counts on the partitions. This check may fail occasionally. That error should be no problem at all.

I need to know the answers to these questions to debug the problem:

  • Does the network between pods/containers/virtual machines function properly? How do you know?
  • Does the cluster continue to function properly despite these errors?

The network was interrupted maybe, but it's ok eventually and olric can not recovery(I know this because restart the cluster will fix the problem). And the cluster can not work properly while having these errors.

@buraksezer
Copy link
Owner

@hulucc

There is a configuration option for TCP KeepAlives.

If you use the YAML file to configure the nodes:
https://github.com/buraksezer/olric/blob/v0.4.2/cmd/olricd/olricd.yaml#L15

For embedded-member mode:
https://github.com/buraksezer/olric/blob/v0.4.2/config/client.go#L55

It is 300s in the sample configuration file but there is no default value currently.

I think the root cause behind this story is buraksezer/connpool. It somehow fails to remove dead connections from the pool. I checked everything, but I cannot find any problem with it. We can also replace it with jackc/puddle.

You may try to reduce the TCP connection keep-alive. It seems that the Kubernetes networking layer drops connections without the termination phase.

And the cluster can not work properly while having these errors.

Failed to check key count on partition... is an unimportant error message. The coordinator node constantly tries to inspect the key count on the partitions. If one of the requests fails, it will try again shortly after.

Is there any error message in the logs? I need more details to debug that problem.

@buraksezer
Copy link
Owner

I saw this function in the Spiracle repository: https://github.com/LilithGames/spiracle/blob/ed2af92da1d13f1afa11b6a354db33048930655b/infra/db/db.go#L25

If you still use the same function or similar one, config.Config.KeepAlivePeriod is zero. That means TCP keepalive is disabled by default. I think we need to set a default value.

Could you please try to set 300s or lower? This is how it works in the server code:

if s.config.KeepAlivePeriod != 0 {

This is how Redis handles keepalives: https://redis.io/topics/clients#tcp-keepalive

@hulucc
Copy link
Author

hulucc commented Jan 6, 2022

I saw this function in the Spiracle repository: https://github.com/LilithGames/spiracle/blob/ed2af92da1d13f1afa11b6a354db33048930655b/infra/db/db.go#L25

If you still use the same function or similar one, config.Config.KeepAlivePeriod is zero. That means TCP keepalive is disabled by default. I think we need to set a default value.

Could you please try to set 300s or lower? This is how it works in the server code:

if s.config.KeepAlivePeriod != 0 {

This is how Redis handles keepalives: https://redis.io/topics/clients#tcp-keepalive

Ok, I will try this. However, I can get this error with single embedded instance(all reads and writes are from the same process, port exposed only for debug), did olric keep some tcp connections to itself?

@ivano85
Copy link

ivano85 commented Sep 29, 2022

Hello, I embedded Olric and I have the same issue. At the moment I'm using v0.4.5. This is my log:

2022/09/29 23:09:14 [ERROR] Failed to check key count on partition: 269: failed to acquire a TCP connection from the pool for: 192.168.1.2:3320 timeout exceeded => distribute.go:65
2022/09/29 23:09:17 [ERROR] Failed to check key count on partition: 270: failed to acquire a TCP connection from the pool for: 192.168.1.2:3320 timeout exceeded => distribute.go:65
2022/09/29 23:09:20 [ERROR] Failed to update routing table on 192.168.1.2:3320: failed to acquire a TCP connection from the pool for: 192.168.1.2:3320 timeout exceeded => update.go:58
2022/09/29 23:09:20 [ERROR] Failed to update routing table on cluster: failed to acquire a TCP connection from the pool for: 192.168.1.2:3320 timeout exceeded => routingtable.go:238

I don't understand why there are 270 partitions, because I just use a couple of maps with just a few keys each. Anyway what happens here is that during this phase any operation seems to hang until the error appears for partition 270, then pending operations go on and it starts again from the first partition.

But why does olric need to open a tcp connection to itself?

This is my initialization code:

c := olric_conf.New("lan")
ctx, cancel := context.WithCancel(context.Background())
c.Started = func() {
    defer cancel()
}

olricSD := &OlricEmbeddedDiscovery{}
olricSD.olricConfig = c
sd := make(map[string]interface{})
sd["plugin"] = olricSD
c.ServiceDiscovery = sd

var err error
olricDb, err = olric.New(c)
if err != nil {
    log.Fatal(err)
}

go func() {
    err = olricDb.Start()
    if err != nil {
        fmt.Printf("********* Error starting Olric: %v\n", err)
        olricDb.Shutdown(ctx)
        cancel()
    }
}()

<-ctx.Done()

// Bootstrapping fs
stats, err := olricDb.Stats()
for id, m := range stats.ClusterMembers {
    fmt.Printf("%d -> %s\n", id, m.Name)
}

clusterUuid = stats.Member.ID

ownMap, err = olricDb.NewDMap(fmt.Sprintf("cluster-%d", stats.Member.ID))
if err != nil {
    return err
}
ownMap.Put("UUID", GetConfig().UUID)
ownMap.Put("node.uuid", GetConfig().UUID)
ownMap.Put("node.name", GetConfig().Name)

@buraksezer
Copy link
Owner

Hello,

At the moment I'm using v0.4.5.

Please consider upgrading to the latest stable version. Connection pooling dependency has received a few bug fixes.

I don't understand why there are 270 partitions because I just use a few maps with just a few keys each.

Olric is a partitioned data store. That mechanism distributes all your data among the cluster members. You can increase or decrease partition numbers, but prime numbers are good for equal distribution.

But why does olric need to open a tcp connection to itself?

It is not ideal; I know this, but it's completely unimportant. The coordinator node fetches all available nodes from the memberlist instance and sends a request to learn some statistical data from the members. The payload is quite low. I didn't want to add another code path to keep the complexity low. As I mentioned, the payload is only a few bytes long.

2022/09/29 23:09:14 [ERROR] Failed to check key count on partition: 269: failed to acquire a TCP connection from the pool for: 192.168.1.2:3320 timeout exceeded => distribute.go:65

Does Olric run your local environment or Docket network? Something should be different in your environment. I need to know that difference to reproduce the problem on my side.

Did you check the previous messages?

@buraksezer
Copy link
Owner

For example, do you use any firewall software which blocks Olric? Something should be different in your network.

@ivano85
Copy link

ivano85 commented Sep 29, 2022

Sorry, maybe I had to mention the software embedding olric is running on a Raspberry Pi and I'm testing it with up to three nodes (most of the time just one). All nodes run on my home wifi network (I suppose it causes most of the issues when multiple nodes are involved).

Unfortunately the issue usually happens after running for several days (like the issue's creator), so it's very difficult to understand. To make it harder, it happens even if just one node runs all the time, so it doesn't seem to be directly related to network failures...

Anyway I will wait for the issue again to analyze it more accurately, because I'm dealing with other changes at the moment. I will let you know.

@buraksezer
Copy link
Owner

Hello, I see. Olric v0.4.7 has this fix: buraksezer/connpool#4. This may fix the problem.

@ivano85
Copy link

ivano85 commented Sep 29, 2022

It looks promising! I will test it in the next days and let you know 😉

@buraksezer
Copy link
Owner

Closing this issue due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants