Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go-nats panic #367

Closed
shengery opened this issue May 30, 2018 · 12 comments
Closed

go-nats panic #367

shengery opened this issue May 30, 2018 · 12 comments
Assignees

Comments

@shengery
Copy link

Versions of gnatsd and affected client libraries used:

gnatsd v1.0.4 download from github release binary
client library version 1.3.1

OS/Container environment:
CentOS 6.8 64Bit bare VM

Steps or code to reproduce the issue:
gnatsd is run with single node
The go *nats.Conn is used in multiple goroutines, each of which may publish message and pushsrv subscribe 6 topics with the same connection.
When there are a lot of concurrent users login into the service , the pushsrv panic with the below message :

panic: sync: negative WaitGroup counter

goroutine 9291 [running]:
sync.(*WaitGroup).Add(0xc42121a530, 0xffffffffffffffff)
/usr/local/go/src/sync/waitgroup.go:75 +0x134
sync.(*WaitGroup).Done(0xc42121a530)
/usr/local/go/src/sync/waitgroup.go:100 +0x34
pushsrv/vendor/github.com/nats-io/go-nats.(*Conn).readLoop(0xc4200b6a00, 0xc42121a530)
/data/home/go_workspace/src/pushsrv/vendor/github.com/nats-io/go-nats/nats.go:1554 +0x238
created by pushsrv/vendor/github.com/nats-io/go-nats.(*Conn).spinUpGoRoutines
/data/home/go_workspace/src/pushsrv/vendor/github.com/nats-io/go-nats/nats.go:979 +0xb1

@kozlovic
Copy link
Member

I had fixed a similar issue in 1.3.0, so I will have another look. In the meantime, could you check for me the go client details? There is no 1.3.1, and I looked at few of the releases and the line numbers don't match any.
Again, I am going to try to write a test that can reproduce, but would love some clarification on the client version you are using. Alternatively, I see that you are vendoring: have you tried to update your vendor go-nats directory with the latest release?

@shengery
Copy link
Author

Thanks kozlovic,
There is not a release version named 1.3.1 really in the release category. I clone the client library via go get command in the 2017, I found there is a constant in value Version in the nats.go

package nats

import (
	"bufio"
	"bytes"
	"crypto/tls"
	"crypto/x509"
	"encoding/json"
	"errors"
	"fmt"
	"io/ioutil"
	"math/rand"
	"net"
	"net/url"
	"regexp"
	"runtime"
	"strconv"
	"strings"
	"sync"
	"time"

	"github.com/nats-io/go-nats/util"
	"github.com/nats-io/nuid"
)

// Default Constants
const (
	Version                 = "1.3.1"
	DefaultURL              = "nats://localhost:4222"
	DefaultPort             = 4222
	DefaultMaxReconnect     = 60
	DefaultReconnectWait    = 2 * time.Second
	DefaultTimeout          = 2 * time.Second
	DefaultPingInterval     = 2 * time.Minute
	DefaultMaxPingOut       = 2
	DefaultMaxChanLen       = 8192            // 8k
	DefaultReconnectBufSize = 8 * 1024 * 1024 // 8MB
	RequestChanLen          = 8
	LangString              = "go"
)

// STALE_CONNECTION is for detection and proper handling of stale connections.
const STALE_CONNECTION = "stale connection"
...

Later last night when we were running stress testing on another imsrv, which is using the same copy of the nats client, there was an other panic stack info

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x8763a1]

goroutine 62661 [running]:
imsrv/vendor/github.com/nats-io/go-nats.(*Conn).parse(0xc4200bea00, 0xc42591e000, 0x8000, 0x8000, 0x8000, 0x0)
 /data/home/user_00/go_workspace/src/imsrv/vendor/github.com/nats-io/go-nats/parser.go:278 +0xe61
imsrv/vendor/github.com/nats-io/go-nats.(*Conn).readLoop(0xc4200bea00, 0xc423975e10)
 /data/home/user_00/go_workspace/src/imsrv/vendor/github.com/nats-io/go-nats/nats.go:1545 +0x1d0
created by imsrv/vendor/github.com/nats-io/go-nats.(*Conn).spinUpGoRoutines
 /data/home/user_00/go_workspace/src/imsrv/vendor/github.com/nats-io/go-nats/nats.go:979 +0xb1

@derekcollison
Copy link
Member

We will look into it for sure, but we do recommend using a formal release. If possible use the latest release found here: https://github.com/nats-io/go-nats/releases/tag/v1.5.0

@kozlovic
Copy link
Member

@shengery These are really strange crashes. Would you mind showing how you create the NATS connection and give us as much information as possible as how you use it? Thanks!

@shengery
Copy link
Author

shengery commented Jun 1, 2018

@kozlovic sure ,all the nats library is use in a type Proxy , we set the peerOnMsg and mqOnMsg handle in the New function and the connection is created in Proxy.Register method, the code is below:

func (proxy *Proxy) Register(prefix, selfName string, port, httpPort int) (err error) {
        //selfName meas the server name of application, ex: imsrv or pushsrv
	proxy.selfSrvName = selfName
	localIP, err := proxy.GetLocalAddr()
	if err != nil {
		log.Printf("proxy: register self fail, srvname:%s, error:%s\n", selfName, err.Error())
		return
	}
	selfAddr := fmt.Sprintf("%s:%d", localIP, port)
        //PeerKey is used when the same group of server, like pushsrv to pushsrv, it's a sync action, via Request method .
	proxy.mqPeerKey = proxy.getPeerSubject(selfAddr)
        //MsgKey is used for message route between different server groups such as pushsrv and imsrv
	proxy.mqMsgKey = proxy.getMsgSubject(selfAddr)
	httpAddr := ""
	if httpPort != 0 {
		httpAddr = fmt.Sprintf("%s:%d", localIP, httpPort)
	}
	proxy.selfAddr = selfAddr
	proxy.selfKey = path.Clean(prefix + "/" + selfName + "/" + proxy.selfAddr)
	d := etcdData{
		Addr:     selfAddr,
		Status:   0,
		HTTPAddr: httpAddr,
	}
	bts, _ := json.Marshal(&d)
	err = proxy.etcdClient.Put(proxy.selfKey, string(bts))
	if err != nil {
		log.Printf("proxy: put etcd fail, key:%s, error:%s\n", proxy.selfKey, err.Error())
		err = terror.New(terror.ErrCodeNetwork, err.Error())
		return
	}
	enableMQ := false
	//create the connection here !
	if proxy.natsOpt.natsURL != "" {
		if proxy.natsCli, err = nats.Connect(proxy.natsOpt.natsURL); err != nil {
			log.Printf("proxy: fail to connect mq(nats):%s\n", err.Error())
			err = terror.New(terror.ErrCodeNetwork, err.Error())
			return
		}
		enableMQ = true
	}
	if enableMQ && proxy.natsOpt.enablePeer {
		if proxy.natsOpt.peerOnMsg == nil {
			log.Println("proxy: mq address set but peerOnMsg is nil")
			err = terror.New(terror.ErrCodeBadParam, "peerOnMsg is nil")
			return
		}
		onMsg := proxy.natsOpt.peerOnMsg
		cli := proxy.natsCli
		cli.Subscribe(proxy.mqPeerKey, func(m *nats.Msg) {
			data := onMsg(m.Data)
			cli.Publish(m.Reply, data)
			cli.Flush()
		})
		if err = proxy.ScanDest(prefix, selfName, 100, LoadBalanceHash, false); err != nil {
			log.Printf("proxy : scan dest %s error:%s\n", selfName, err.Error())
			return
		}
	}

	if enableMQ && proxy.natsOpt.onMsg != nil {
		onmsg := proxy.natsOpt.onMsg
		proxy.natsCli.Subscribe(proxy.mqMsgKey, func(m *nats.Msg) {
			data := m.Data
			from := m.Reply
			onmsg(data, from)
		})
		go func() {
                        // the flusher calls nats.Flush() periodically
			proxy.flusher()
		}()
	}
	return
}

The OnMsg handler receive the msg and it's reply key , from which we can know where the message from(the value is mqMsgKey or mqPeerKey) and how we can reply to it.

@kozlovic
Copy link
Member

kozlovic commented Jun 1, 2018

Thanks. Need to dig more into that, but still very confused with the stack. For instance, from your original report, we see that:

goroutine 9291 [running]:
sync.(*WaitGroup).Add(0xc42121a530, 0xffffffffffffffff)
/usr/local/go/src/sync/waitgroup.go:75 +0x134
sync.(*WaitGroup).Done(0xc42121a530)
/usr/local/go/src/sync/waitgroup.go:100 +0x34
pushsrv/vendor/github.com/nats-io/go-nats.(*Conn).readLoop(0xc4200b6a00, 0xc42121a530)
/data/home/go_workspace/src/pushsrv/vendor/github.com/nats-io/go-nats/nats.go:1554 +0x238

Yet, you can see that from the 1.3.0 tag, this line has nothing to do with the report:
nats.go line 1554

The line should point to something like this.

@shengery
Copy link
Author

shengery commented Jun 5, 2018

hi @kozlovic , the commit in my vendor repository should be at the commit a2b049b
I put the client library via go get command , not directly download the release version .

@shengery
Copy link
Author

shengery commented Jun 8, 2018

hi @kozlovic , is there any further investigation ?

@kozlovic
Copy link
Member

Sorry for the delay, I was on vacation. I may refactor a bit the code around that for a fix to #368, but not sure what the problem for this issue is since the rare panic due to waitgroup was already fixed (or at least an attempt to) in 1.3.0.
Have you had a chance to try yourself with the latest release?

@shengery
Copy link
Author

@kozlovic thanks for your reply
We are currently using the latest release version , the problem has not shown yet.

@kozlovic
Copy link
Member

@shengery Thanks for the update. In PR #370, I have fixed an unrelated issue but did again change the way we use the wait group. If you can, it would be nice that you run your tests with the latest from master. Thanks!

@kozlovic
Copy link
Member

kozlovic commented Aug 7, 2018

Closing for now.. we will re-open if you have some updates.

@kozlovic kozlovic closed this as completed Aug 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants