Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't dial error locks up device #66

Open
jaredwolff opened this issue Jan 18, 2018 · 5 comments
Open

Can't dial error locks up device #66

jaredwolff opened this issue Jan 18, 2018 · 5 comments

Comments

@jaredwolff
Copy link

Background:

I have written some code which connects to multiple BLE devices and subscribes to a handful characteristics. Should the connection break or fail at any time, the code periodically checks to see if that client has been disconnected and attempts to reconnect.

The Problem:

When connecting using the snippet below, the device occasionally locks up after the ble.Connect command. The context does not return no matter what. My delay is a 15 second duration.

What's important to mention here is that the lock only happens when I receive a "can't dial" error. That's usually the sign of death and it never returns after its next attempt.

It also seems to happen with more devices. With 3 devices it happens sparingly and with 6 devices it happens much more often.

  // Filter based on address
  filter := func(a ble.Advertisement) bool {
    return strings.ToUpper(a.Address().String()) == m.address
  }

  // Scan for specified durantion, or until interrupted by user.
  log.Infof("[ %s ] Connecting...", m.address)
  ctx := ble.WithSigHandler(context.WithTimeout(context.Background(), *sd))
  cln, err := ble.Connect(ctx, filter)

  //If an error ignore, otherwise set up.
  if err != nil && cln == nil {
    log.Errorf("[ %s ] Cannot connect : %s", m.address, err)
    return err
  } else if err != nil && cln != nil {
    log.Errorf("[ %s ] Weird connect : %s", m.address, err)
    return err
  } else {
    log.Infof("[ %s ] Connected", m.address)

Expected Behavior:

I expect the device to cleanly connect or disconnect and proceed to the next chunk of code. (or return and retry depending on what happens)

Observed Behavior:

Here's a log right before the lockup. Pay attention to where it says Connecting...

time="2018-01-18T04:22:34Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] Connecting..."
time="2018-01-18T04:22:37Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] Connected"
time="2018-01-18T04:22:37Z" level=error msg="[ C0:F2:1E:9F:B5:C9 ] can't discover profile: can't discover services: ATT request failed: input channel closed: io: read/write on closed pipe\n"
time="2018-01-18T04:22:37Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] is disconnected"
time="2018-01-18T04:22:49Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] Connecting..."
time="2018-01-18T04:22:53Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] Connected"
time="2018-01-18T04:22:54Z" level=error msg="[ C0:F2:1E:9F:B5:C9 ] can't discover profile: can't discover services: ATT request failed: input channel closed: io: read/write on closed pipe\n"
time="2018-01-18T04:22:54Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] is disconnected"
time="2018-01-18T04:23:03Z" level=info msg="[ E9:A0:4D:07:38:2B ] Update success"
time="2018-01-18T04:23:04Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] Connecting..."
time="2018-01-18T04:23:07Z" level=info msg="[ E9:A0:4D:07:38:2B ] { 1516249387 } Batt:     92% - Raw: (92)"
time="2018-01-18T04:23:12Z" level=info msg="[ E9:A0:4D:07:38:2B ] { 1516249392 } Batt:     91% - Raw: (91)"
time="2018-01-18T04:23:19Z" level=error msg="[ C0:F2:1E:9F:B5:C9 ] Cannot connect : can't dial: can't dial: context deadline exceeded"
time="2018-01-18T04:23:19Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] Connecting..."

It does not return after the last line. Other go routines though continue to run in the background in the same program.

Note: can't discover profile: can't discover services: ATT request failed: input channel closed: io: read/write on closed pipe Doesn't always happen every time. Sometimes, I'll go to connect, get the can't dial error and then it fails right there.

@roylee17
Copy link
Contributor

roylee17 commented Jan 18, 2018

Can you do some experiments and see how many concurrent connections the device can support. And what happens when it exceeds the maximum? Does it handle it gracefully by returning HCI error, or the HCI channel just stuck or act abnormally?

@jaredwolff
Copy link
Author

jaredwolff commented Jan 19, 2018

The BLE device on the gateway is based on an NRF52832 running a S132 soft device. Supposedly, according to Nordic, it supports up to 20 concurrent connections. There's just lot's of layers that could have a potential bug. (UART-BLE firmware, S132 soft device, this library)

In my case, the HCI channel usually sends an error from the previous connect attempt, and then locks up on the next attempt. This morning though I got it to lock up first thing without any previous errors with only 3 devices connecting.

Trying to reproduce the error has been tricky. I have 6 devices up and running and haven't had it quit yet.

Thanks for your insights Roy

@roylee17
Copy link
Contributor

Sorry, I wasn't clear in the previous comment. I meant the device (a usb dongle) that runs the Go BLE library.

For example, I had a USB dongle which only support one connection at a time. If I try to make a second concurrent connection, it returns HCI error, and the user(app) can retry later or disconnect the other connection first.

I had another USB dongle which can support 2 concurrent connections. But if I try to make a 3rd concurrent connection, the device would just hang. The user(app) has no way to recover from it until reset. In this case, we need to add some logic in the app to make sure the concurrent connections never exceeds 2.

An potentially easier way to test is to have a couple of usb dongles plugged to your laptop.

hci0 makes connections to hci1, hci2, ... and find out how many connections hci0 can support.
(test both cases for hci0 as central and peripheral. It might support different connections when acting as central or peripheral)

@jaredwolff
Copy link
Author

Got it. I am trying that a bit right now.

One thing I'm noticing is that this lock-up happens after a scan during the dial phase. It's as if the context is pulling the rug out from underneath the dial phase and its leaving it in a weird state. My guess is if I make the context a bit longer (right now it's 15 seconds) I may run into this problem much less.

@jaredwolff
Copy link
Author

Looking into it a bit more, I am subscribing to a handful of characteristics per device. That's actually done before connecting to subsequent devices. Because of the async nature, I think they may be causing an issue with the following connect or re-connect attempts. Disabling all the subscriptions and then re-enabling after all devices have re-connected seems to be resolving at least part of the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants