Can't dial error locks up device #66

jaredwolff · 2018-01-18T21:13:30Z

Background:

I have written some code which connects to multiple BLE devices and subscribes to a handful characteristics. Should the connection break or fail at any time, the code periodically checks to see if that client has been disconnected and attempts to reconnect.

The Problem:

When connecting using the snippet below, the device occasionally locks up after the ble.Connect command. The context does not return no matter what. My delay is a 15 second duration.

What's important to mention here is that the lock only happens when I receive a "can't dial" error. That's usually the sign of death and it never returns after its next attempt.

It also seems to happen with more devices. With 3 devices it happens sparingly and with 6 devices it happens much more often.

  // Filter based on address
  filter := func(a ble.Advertisement) bool {
    return strings.ToUpper(a.Address().String()) == m.address
  }

  // Scan for specified durantion, or until interrupted by user.
  log.Infof("[ %s ] Connecting...", m.address)
  ctx := ble.WithSigHandler(context.WithTimeout(context.Background(), *sd))
  cln, err := ble.Connect(ctx, filter)

  //If an error ignore, otherwise set up.
  if err != nil && cln == nil {
    log.Errorf("[ %s ] Cannot connect : %s", m.address, err)
    return err
  } else if err != nil && cln != nil {
    log.Errorf("[ %s ] Weird connect : %s", m.address, err)
    return err
  } else {
    log.Infof("[ %s ] Connected", m.address)

Expected Behavior:

I expect the device to cleanly connect or disconnect and proceed to the next chunk of code. (or return and retry depending on what happens)

Observed Behavior:

Here's a log right before the lockup. Pay attention to where it says Connecting...

time="2018-01-18T04:22:34Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] Connecting..."
time="2018-01-18T04:22:37Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] Connected"
time="2018-01-18T04:22:37Z" level=error msg="[ C0:F2:1E:9F:B5:C9 ] can't discover profile: can't discover services: ATT request failed: input channel closed: io: read/write on closed pipe\n"
time="2018-01-18T04:22:37Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] is disconnected"
time="2018-01-18T04:22:49Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] Connecting..."
time="2018-01-18T04:22:53Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] Connected"
time="2018-01-18T04:22:54Z" level=error msg="[ C0:F2:1E:9F:B5:C9 ] can't discover profile: can't discover services: ATT request failed: input channel closed: io: read/write on closed pipe\n"
time="2018-01-18T04:22:54Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] is disconnected"
time="2018-01-18T04:23:03Z" level=info msg="[ E9:A0:4D:07:38:2B ] Update success"
time="2018-01-18T04:23:04Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] Connecting..."
time="2018-01-18T04:23:07Z" level=info msg="[ E9:A0:4D:07:38:2B ] { 1516249387 } Batt:     92% - Raw: (92)"
time="2018-01-18T04:23:12Z" level=info msg="[ E9:A0:4D:07:38:2B ] { 1516249392 } Batt:     91% - Raw: (91)"
time="2018-01-18T04:23:19Z" level=error msg="[ C0:F2:1E:9F:B5:C9 ] Cannot connect : can't dial: can't dial: context deadline exceeded"
time="2018-01-18T04:23:19Z" level=info msg="[ C0:F2:1E:9F:B5:C9 ] Connecting..."

It does not return after the last line. Other go routines though continue to run in the background in the same program.

Note: can't discover profile: can't discover services: ATT request failed: input channel closed: io: read/write on closed pipe Doesn't always happen every time. Sometimes, I'll go to connect, get the can't dial error and then it fails right there.

The text was updated successfully, but these errors were encountered:

roylee17 · 2018-01-18T22:46:44Z

Can you do some experiments and see how many concurrent connections the device can support. And what happens when it exceeds the maximum? Does it handle it gracefully by returning HCI error, or the HCI channel just stuck or act abnormally?

jaredwolff · 2018-01-19T15:52:29Z

The BLE device on the gateway is based on an NRF52832 running a S132 soft device. Supposedly, according to Nordic, it supports up to 20 concurrent connections. There's just lot's of layers that could have a potential bug. (UART-BLE firmware, S132 soft device, this library)

In my case, the HCI channel usually sends an error from the previous connect attempt, and then locks up on the next attempt. This morning though I got it to lock up first thing without any previous errors with only 3 devices connecting.

Trying to reproduce the error has been tricky. I have 6 devices up and running and haven't had it quit yet.

Thanks for your insights Roy

roylee17 · 2018-01-19T16:52:03Z

Sorry, I wasn't clear in the previous comment. I meant the device (a usb dongle) that runs the Go BLE library.

For example, I had a USB dongle which only support one connection at a time. If I try to make a second concurrent connection, it returns HCI error, and the user(app) can retry later or disconnect the other connection first.

I had another USB dongle which can support 2 concurrent connections. But if I try to make a 3rd concurrent connection, the device would just hang. The user(app) has no way to recover from it until reset. In this case, we need to add some logic in the app to make sure the concurrent connections never exceeds 2.

An potentially easier way to test is to have a couple of usb dongles plugged to your laptop.

hci0 makes connections to hci1, hci2, ... and find out how many connections hci0 can support.
(test both cases for hci0 as central and peripheral. It might support different connections when acting as central or peripheral)

jaredwolff · 2018-01-19T17:45:31Z

Got it. I am trying that a bit right now.

One thing I'm noticing is that this lock-up happens after a scan during the dial phase. It's as if the context is pulling the rug out from underneath the dial phase and its leaving it in a weird state. My guess is if I make the context a bit longer (right now it's 15 seconds) I may run into this problem much less.

jaredwolff · 2018-01-20T21:51:50Z

Looking into it a bit more, I am subscribing to a handful of characteristics per device. That's actually done before connecting to subsequent devices. Because of the async nature, I think they may be causing an issue with the following connect or re-connect attempts. Disabling all the subscriptions and then re-enabling after all devices have re-connected seems to be resolving at least part of the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't dial error locks up device #66

Can't dial error locks up device #66

jaredwolff commented Jan 18, 2018

roylee17 commented Jan 18, 2018 •

edited

Loading

jaredwolff commented Jan 19, 2018 •

edited

Loading

roylee17 commented Jan 19, 2018

jaredwolff commented Jan 19, 2018

jaredwolff commented Jan 20, 2018

Can't dial error locks up device #66

Can't dial error locks up device #66

Comments

jaredwolff commented Jan 18, 2018

roylee17 commented Jan 18, 2018 • edited Loading

jaredwolff commented Jan 19, 2018 • edited Loading

roylee17 commented Jan 19, 2018

jaredwolff commented Jan 19, 2018

jaredwolff commented Jan 20, 2018

roylee17 commented Jan 18, 2018 •

edited

Loading

jaredwolff commented Jan 19, 2018 •

edited

Loading