-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uptake TCP keep-alive support for go-couchbase #859
Comments
See https://issues.couchbase.com/browse/CBSE-1762 for more details. |
Build based on this feature branch was shared with Semgroup for testing (see https://issues.couchbase.com/browse/CBSE-1762#comment-125665). Waiting for feedback on whether it addresses their issue. |
Tested successfully by Semgroup - needs to be ported from master to 1.1.0. Additional work required to pick up latest go-couchbase on 1.1.0, as the 1.1.0 branch doesn't yet include the refactoring of couchbaselabs/go-couchbase to couchbase/go-couchbase. |
I think we may have made a big mistake on the default values. On the go-couchbase side:
This will have the effect of calling Setsockopt with the syscall.TCP_KEEPINTVL flag as seen in tcpsockopt_unix.go The problem though, is that we are now increasing a kernel default from 75 seconds to 30 minutes! Here's why I think 75 seconds is the kernel default: I ssh'd into a box on our perf cluster and ran:
That's also the same default value mentioned in the TCP-Keepalive-HOWTO (search the page for What are the ramifications of changing TCPKeepaliveInterval from 75 seconds to 30 minutes? If you read this section of the TCP-Keepalive-HOWTO:
So basically with the default you get: 2 hours + (75 seconds * 9) == 2 hours and 11 minutes before a dead peer connection is reaped. With our recent changes, you get: 2 hours + (30 minutes * 9) == 6.5 hours before a dead peer connection is reaped. I think we should do the following changes:
There are three TCPKeepalive parameters that can be tuned:
From what I can tell, Go only lets you tune the tcp_keepalive_intvl parameter. In other words, it severely limits the amount of tuning you can do, since |
Correction, after chatting with @adamcfraser, Go is letting you tune these parameters:
With the go-couchbase default, it will still end up making the overall expiration time longer:
Actually it's not that simple, since the Go implementation is using the same value for both the tcp_keepalive_time and tcp_keepalive_intvl values. It's a bit confusing because there are two sets of terminology -- one for Setsockopt and one for the kernel parameters. Here's the mapping from Setsockopt -> kernel:
I do think it's probably not a good thing that the default go-couchbase setting more than doubles the amount of time for dead peer connections to get reaped. I think the default should be changed to 20 minutes, which will keep the total time approximately the same as the Linux kernel defaults, albeit with slightly different behavior:
|
Closing this in favor of couchbase/go-couchbase#48 |
Hey Traun. See : http://felixge.de/2014/08/26/tcp-keepalive-with-golang.html Go allows you to enable TCP keepalive using net.TCPConn's SetKeepAlive. On OSX and Linux this will cause up to 8 TCP keepalive probes to be sent at an interval of 75 seconds after a connection has been idle for 2 hours. Or in other words, Read will return an io.EOF error after 2 hours and 10 minutes (7200 + 8 * 75). Depending on your application, that may be too long of a timeout. In this case you can call SetKeepAlivePeriod. However, this method currently behaves different for different operating systems. On OSX, it will modify the idle time before probes are being sent. On Linux however, it will modify both the idle time, as well as the interval that probes are sent at. So calling SetKeepAlivePeriod with an argument of 30 seconds will cause a total timeout of 10 minutes and 30 seconds for OSX (30 + 8 * 75), but 4 minutes and 30 seconds on Linux (30 + 8 * 30). |
I think this is where there's a misunderstanding, because AFAIK our default SetKeepAlivePeriod appears to be 30 minutes, not seconds. From pools.go:
So on linux, that becomes (30mins + (8 * 30mins)) or (1800secs + (8 * 1800secs)) which equals 4.5 hours. |
Yeah, thats why the idea was to make the keep-alive time configurable. You probably might need to add some OS specific code in your application to ensure that the keepalive timeouts are set correctly. go-couchbase exports a SetTcpKeepalive function using which you can change the value for the keepalive period. couchbase.SetTcpKeepalive(true, 30) |
Regardless of being configurable, it should have sensible defaults, since the majority of the users will be using the defaults. Linux users will probably be caught off guard by the fact that it's being set to 4.5 hours, since that's quite a long time for dead peer sockets to be hanging around. (2 hours is already quite a long time) I think the real bug here is in the way that Go has inconsistently treated the TCPKeepAlive parameters on the different OS'es. |
Related golang-nuts thread: https://groups.google.com/d/msg/golang-dev/W1wRW3V5QrE/wTL0kBmXJDYJ |
Agreed. What do you think should be the default value ? |
I didn't realize that OSX behaved so differently, so it's actually really hard to come up with a sensible default for all platforms. At this point I am just throwing my arms up in despair. Oh well, at least we learned something! |
I guess we can add some platform specific code to set the defaults. if platform == "OSX" |
Good catch! I think those values would make it too short on Linux however. I would try to keep it as close to the kernel defaults as possible. So for Linux, I would suggest using 720 seconds: couchbase.SetTcpKeepalive(true, 720) 720 secs + (720 secs * 9) == 2 hours (from what I can tell, at least on Linux the default |
Cool ! I'll make this change |
Manik has added support for enabling TCP keep-alive for go-memcached connections in go-couchbase. Need to uptake this change on the Sync Gateway side. @zgramana - any preference on whether we enable by default, or add as a config parameter?
The text was updated successfully, but these errors were encountered: