Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bluetooth connection hangs / too many messages - with v3.3.0b6 #1095

Closed
dlech opened this issue Jun 5, 2023 Discussed in #1091 · 18 comments
Closed

Bluetooth connection hangs / too many messages - with v3.3.0b6 #1095

dlech opened this issue Jun 5, 2023 Discussed in #1091 · 18 comments
Assignees
Labels
bug Something isn't working software: pybricks-micropython Issues with Pybricks MicroPython firmware (or EV3 runtime) topic: bluetooth Issues involving bluetooth

Comments

@dlech
Copy link
Member

dlech commented Jun 5, 2023

Discussed in https://github.com/orgs/pybricks/discussions/1091

Originally posted by Debenben June 3, 2023
I can reproduce something similar to the reported problem with v3.3.0b6 and the following code:

from pybricks.hubs import TechnicHub
from pybricks.parameters import Color

hub = TechnicHub(observe_channels=[0], broadcast_channel=1)

while True:
    hub.ble.observe(0)
    hub.light.on(Color.RED)
    hub.ble.broadcast("hello")
    print("running")

When running the above program the hub stops broadcasting "hello" after around 10 to 40 seconds. If connected to a computer it also stops printing "running" in the pybricks-ide. When connected to the computer and using long-press the hub starts blinking blue very fast and is no longer responsive, I need to remove the batteries to get it back to normal. If not connected to the computer I can stop the program with long-press on the button and then the hub behavior is back to normal.

The hub.light.on and print command is not necessary, the behavior is also there when only using hub.ble.observe and hub.ble.broadcast commands. The crash and weired behavior is gone when adding wait(1) in the loop.

@dlech dlech added bug Something isn't working topic: bluetooth Issues involving bluetooth software: pybricks-micropython Issues with Pybricks MicroPython firmware (or EV3 runtime) labels Jun 5, 2023
@dlech dlech self-assigned this Jun 5, 2023
@dlech
Copy link
Member Author

dlech commented Jun 5, 2023

I am able to reproduce the problem, even with observing disabled. It appears to cause the Bluetooth chip to crash (not only does it stop printing, it also disconnects).

Getting stuck on shutdown seems like a second bug - probably due to use after free due to pybricks/pybricks-micropython@7d34df4 hack.

@NStrijbosch
Copy link

Crashing of the bluetooth chip sounds familiar to me. When I was experimenting with simultaneously scanning and advertising this happened to me as well, and in the end led me to the more stable but less favourable option to switch between scanning and advertising instead. This inherently limited the number of commands to the bt-chip.

Your observation that this also happens with observing disabled on the other hand is new to me. You are still printing to the console and scanning simultaneously?

I unfortunately have never found a way to get feedback from the BT-chip, like an error or warning that indicated that it needed more time to process things before a new command could be send.

@dlech
Copy link
Member Author

dlech commented Jun 5, 2023

Good question. I tested it and it still crashes without the print(). However this is with it connected to a computer, so it is still sending status updates every 500ms which uses the same Bluetooth command as print().

Running the same program while not connected to the computer runs longer but crashes the Bluetooth chip eventually too.

dlech added a commit to pybricks/pybricks-micropython that referenced this issue Jun 5, 2023
Many Bluetooth tasks can get stuck (can't be canceled) if the Bluetooth
chip crashes. Adding checks for this at every yield point would be
very expensive in terms of code size. So for now, we just add another
hack to cancel all tasks on shutdown to avoid undefined behavior.

In some cases, this undefined behavior was causing the hub to not power
off unless the batteries were removed.

Issue: pybricks/support#1095
@dlech
Copy link
Member Author

dlech commented Jun 5, 2023

Getting stuck on shutdown seems like a second bug - probably due to use after free due to pybricks/pybricks-micropython@7d34df4 hack.

This part of the issue is fixed by pybricks/pybricks-micropython@7a8a849.

So at least we don't have to remove the batteries any more. Still investigating how to prevent the Bluetooth chip from crashing in the first place.

@Debenben
Copy link

Debenben commented Jun 6, 2023

Sounds like the bluetooth crash is not an easy topic. Thank you so much @dlech @NStrijbosch for all your continued efforts and not giving up on this.

I did another test and made an observation which I find interesting. Maybe it is something completely different and unrelated, but I imagine in case this is unintentional behavior it might potentially also help with the crash. I use v3.3.0b6 and:

from pybricks.hubs import TechnicHub
from pybricks.tools import wait

hub = TechnicHub(broadcast_channel=1)

count = 0

while(True):
    hub.ble.broadcast("hel", count)
    count = count + 1
    wait(10)

If I run the code while connected to pybricks-ide I recieve a broadcast every 100ms and the number in the message increments by 10 every time. It stays like this when disconnecting the computer while the program is running. This is what I would have expected. When I start the program again, this time with the hub not connected to the computer, the broadcast is 3 to 4 times faster at irregular intervals and eventually crashes. The timestamps that are transmitted until it crashes look like this:
image

To recieve the broadcast I used this nodejs script:
node-pybricks-broadcast.txt
and this is the output of the script which I plotted above:
timings.txt

@Debenben
Copy link

Debenben commented Jun 6, 2023

sorry, one correction: When connected to pybricks-ide it is also irregular, just sometimes the intervals look constant at first glance and in general it seems to take longer until it does:

image

@dlech
Copy link
Member Author

dlech commented Jun 6, 2023

When I start the program again, this time with the hub not connected to the computer, the broadcast is 3 to 4 times faster at irregular intervals and eventually crashes.

The broadcasting 3x as frequently when not connected bug should be fixed now by pybricks/pybricks-micropython@3317369.

@dlech
Copy link
Member Author

dlech commented Oct 28, 2023

Finally managed to get a testing rig setup for this.

IMG_20231028_135933277_HDR

@dlech
Copy link
Member Author

dlech commented Oct 30, 2023

I'm starting to think this is simply not fixable (bug in Bluetooth chip). By reducing the frequency of calling GAP_updateAdvertisingData() we can delay the Bluetooth chip from crashing. For example, adding wait(250) in the loop of the example program here will allow it to run for several hours, but eventually the Bluetooth chip still crashes. I've tried stopping and restarting advertising periodically like in Nard's implementation, but it does seem to make any difference. I've also tried tweaking all sorts of parameters in the various Bluetooth functions.

Is there a known version of Nard's firmware that can run for >5 hours without the Bluetooth chip crashing?

@laurensvalk
Copy link
Member

I checked out pybricks/pybricks-micropython@af8626c and ran the following (outdated/modified) programs to broadcast with just a 10 ms delay:

Technic Hub

from pybricks.hubs import ThisHub
from pybricks.parameters import Color
from pybricks.tools import wait
from pybricks.experimental import Broadcast


hub = ThisHub()
hub.light.on(Color.WHITE)

# Initialize broadcast with two topics.
radio = Broadcast(["counter", "hue"])

while True:
    for i in range(100):

        # Send one byte on the number signal.
        radio.send_bytes("counter", bytes([i]))
        wait(10)

        # Receive two bytes on the hue signal.
        hue_data = radio.receive_bytes("hue")

        # If we received something, set the light.
        if hue_data:
            hue = int.from_bytes(hue_data, "little")
            hub.light.on(Color(hue))

On Prime Hub

from pybricks.hubs import PrimeHub
from pybricks.parameters import Color
from pybricks.tools import wait
from pybricks.experimental import Broadcast


hub = PrimeHub()
hub.light.on(Color.WHITE)

# Initialize broadcast with two topics.
radio = Broadcast(topics=["counter", "hue"])

while True:

    # Read the counter value.
    counter_data = radio.receive_bytes("counter")

    # If we received it, display it, and send a response.
    if counter_data:
        count = counter_data[0] % 100
        hub.display.number(count)

        # Send back a hue value.
        hue_data = int(count * 3.6).to_bytes(2, "little")
        radio.send_bytes("hue", hue_data)

        # Stop if we reach the end.
        if count > 99:
            break

    # Don't need to update all the time.
    wait(10)
test.mp4

I started it 30 minutes ago, and it is still going! To be continued....

@laurensvalk
Copy link
Member

laurensvalk commented Oct 30, 2023

Is there a known version of Nard's firmware that can run for >5 hours without the Bluetooth chip crashing?

Still going after 7 hours! 🕐

At 7 * 60 * 60 * 100 = 2 520 000 messages going both ways this probably shows that it could work?

@dlech
Copy link
Member Author

dlech commented Oct 30, 2023

Would you mind attaching the firmware.zip for that here here?

@laurensvalk
Copy link
Member

primehub firmware.zip

technic firmware.zip

I also pushed https://github.com/pybricks/pybricks-micropython/tree/blebackup

@laurensvalk
Copy link
Member

laurensvalk commented Oct 30, 2023

I picked a build fairly randomly. It could be others work too.

I should probably also test both scripts on the latest firmware.

@laurensvalk
Copy link
Member

Still going after 12 hours. One of the hubs is about to run low on battery though 😄

@NStrijbosch
Copy link

At 7 * 60 * 60 * 100 = 2 520 000 messages going both ways

Just to get the details correct: although you call the broadcast 100 times per second, it will only update the advertising data maximal 20 times per second (depending on which commit you used, I introduced some pseudo random period at some point).

Nonetheless still impressive that it ran for at least twelve hours. This is also about the build Anika was running on a City hub during 7 days of LEGO world without many issues. That one was only receiving data though.

@dlech
Copy link
Member Author

dlech commented Nov 4, 2023

The firmware and test program that Laurens shared still locks up the Bluetooth chip on my Technic hub in about a minute or so when running while connected to Pybricks Code. It seems to be running longer when not connected to Pybricks Code though. So I will need to go back and try the current firmware without being connected to Pybricks code and see how that goes.

dlech added a commit to pybricks/pybricks-micropython that referenced this issue Nov 4, 2023
The Bluetooth chips on the Technic and City hubs seem to lock up if
the advertisement date is updated too many times. So far, we have only
found one condition where this doesn't happen. The Bluetooth chip
can't be connected to any other devices and advertising has to be
stopped and restarted between each update. This is quite limiting,
but seems to be the best we can do for now.

Issue: pybricks/support#1095
@dlech
Copy link
Member Author

dlech commented Nov 4, 2023

I think I finally made a breakthrough. I found some alternate commands to set the advertising data and one of them appears to avoid the lock up!

For those interested in the details, there are GAP Set Advertisement Token, GAP Remove Advertisement Token, and GAP Update Advertisement Tokens commands that update individual advertisement data type values instead of doing the whole thing at once. I tried these commands, but they result in the same lockup. But this leads me to believe that there is some sort of fancy data management in the Bluetooth chip that parses the data for each AD structure. It seems quite plausible that there is a memory leak related to this that is what eventually causes the Bluetooth chip to lock up.

Fortunately though, the chip also supports some standard Bluetooth HCI commands. There is one for setting LE Advertisement Data. I tried this one and it appears to update the data without triggering the memory leak!

Interestingly, with the standard HCI command we a command complete event as if we had called the TI vendor-specific command. But the docs say "Note that while all commands can be sent by the application processor to the Network Processor using HCI, not all events will be returned as they will be trapped and possibly discarded by the Host layer of the
BLE Stack.", so this makes sense. Also interestingly, when calling the token version of the vendor-specific commands we get the same command complete event twice! So maybe another Bluetooth chip bug there too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working software: pybricks-micropython Issues with Pybricks MicroPython firmware (or EV3 runtime) topic: bluetooth Issues involving bluetooth
Projects
None yet
Development

No branches or pull requests

4 participants