Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLE broadcast: communication between pybricks hubs #80

Closed
wants to merge 36 commits into from

Conversation

NStrijbosch
Copy link
Contributor

@NStrijbosch NStrijbosch commented Nov 30, 2021

This PR is a starting point to support communication between Primehubs, Inventorhubs, Essentialhubs, Technichubs and Cityhubs.

The protocol is based on the protocol used by the hub to hub wordblocks of the official LEGO MINDSTORMS app. This implies communication is possible between a hub running pybricks and an inventor hub programmed using wordblock using the official LEGO app.

Code example
When running this code on multiple prime/inventor hubs the screen will sync.

from pybricks.ble import Broadcast
from pybricks.tools import wait
from pybricks.hubs import ThisHub
from pybricks.parameters import Button
from ustruct import pack, unpack

# Initialize the hub.
hub = ThisHub()

# Initialize broadcast
bc = Broadcast(signals=["signal_1"])

pressed = []
i=0
hub.display.number(0)
while True:
    pressed = hub.buttons.pressed()

    # Update screen if updated value is received
    if bc.received("signal_1")!=b'':
        i = unpack('<h', bc.received("signal_1"))[0] # unpack latest received data
        hub.display.number(i)
    
    # Update value if button is pressed
    if Button.LEFT in pressed:
        i=i-1
        hub.display.number(i)
        bc.transmit("signal_1",message=pack("<h",i)) # send updated value of i
        wait(500)
    elif Button.RIGHT in pressed:
        i=i+1
        hub.display.number(i)
        bc.transmit("signal_1",message=pack("<h",i)) # send updated value of i
        wait(500)

    wait(10)

Known issues:
Although this PR contains the essentials to get the communication working. There are a few minor issues:

  • To send data a hub needs to advertise data. To get full support for communication with inventor hubs running official firmware this requires advertising data with the adv_type=ADV_SCAN_IND. Currently, this does not work on the Technic Hub, hence it only sends with adv_type=ADV_IND. Communication between pybricks hubs works fine at the moment, but I am still looking for a fix to advertise with ADV_SCAN_IND. EDIT: BTChip reports error: bleIncorrectMode when sending with ADV_SCAN_IND
  • To have bidirectional communication hubs need to both scan(receive) and advertise (transmit). I am not sure how this is exactly implemented on BT-chip on the Technic Hub/City but the implementation in this PR is as follows:
    • when a broadcast is initiated the hub starts scanning
    • when transmit() is called the hub starts advertising. I think the BT-chip stops scanning temporarily.
    • 1000 ms after calling transmit() the hub stops advertising. Scanning restarts automatically (no explicit call is necessary)
      I implemented the 1000ms delay using an etimer, and it seems to work when starting a program from code.pybricks.com. However when I store the program in main.py the timer never expires, i.e., the hub advertises indefinitely and never restarts scanning. This prevents bidirectional communication when storing the program on the hub. I am not sure what all the process are doing in the background that cause this behaviour. Maybe you have some insights to fix this bug.
  • Some weird behaviour when programs are not started at the same time. This is caused by the following: an index in the advertisement data is used to compare if the received advertisement data is new or old. The current implementation uses the simplest possible comparison. Consequently, my rule of thumb is to restart the programs on each hub at all times to get reproducible behaviour. In the future an update might be useful to allow for a scenario where one hub keeps running the same program while the program on another hub is restarted in the mean time. (The official MINDSTORMS app has similar issues)
  • The current implementation on code.pybricks.com/beta.pybricks.com is (obviously) not yet suitable for multi-hub projects. Hence, when testing my advice is to (where possible considering the above issues) run the program from main.py.

Anyhow, thanks in advance for all feedback you may have!

@laurensvalk
Copy link
Member

Nice work! This could be a great first step in bluetooth communication, and compatibility with the official MINDTORMS app is a nice touch.

It could be a nice addition/alternative to the BLE UART in pybricks/support#262 since both approaches have different use cases.

As you point out, there's still some work to be done before we can merge it, but a PR is just the right spot for that :)

@laurensvalk
Copy link
Member

laurensvalk commented Dec 2, 2021

I've taken a closer look and reworked parts of it to address some of your questions and some stability issues:
https://github.com/pybricks/pybricks-micropython/tree/broadcast2

The aim was to move all the platform-agnostic logic to pbio, while leaving only driver-specific parts in pbdrv. Except for removing the callback setter, nothing has changed in pbdrv yet, so the actual Bluetooth performance should be the same.

The pybricks.ble.Broadcast class is now mostly just a wrapper around the pbio functionality. All it does is convert the qstrs to crc32 hashes on first use, and then it just does a look up when they are used again. pbio uses only the hashes.

With the original PR, the hubs would crash (watchdog timer) quite frequently. While I haven't ruled out all possible causes, I think part of it may be related to how data is allocated and used. When transmitting, I think the process could be trying to access data that no longer exists. When handling advertisement data, it could sometimes write data to heap even after the user program ends.

This update resolves this by allocating a number of read-only signals statically. There is one static object for transmission.

A next step could be to look at how the various processes should work, especially considering (re)starting of scanning and advertising.

Some weird behaviour when programs are not started at the same time. This is caused by the following: an index in the advertisement data is used to compare if the received advertisement data is new or old. The current implementation uses the simplest possible comparison. Consequently, my rule of thumb is to restart the programs on each hub at all times to get reproducible behaviour.

I've added a timestamp to incoming messages, which could be a way to address this.

This is not extensively tested yet, so not everything may work as advertised (ha!).

EDIT: With these updates, the hub is no longer crashing and rebooting with the watchdog timer.

@NStrijbosch NStrijbosch force-pushed the broadcast branch 2 times, most recently from 662a229 to faa1f8c Compare December 3, 2021 13:58
@laurensvalk
Copy link
Member

Thanks for the great (offline) discussion so far @NStrijbosch. I've added my changes on top of your branch.

One of the next steps would be to convert the transmission into a task which updates and waits for the advertising data to be updated. It could (should?) also start another (cancellable) task to stop advertising after a timeout, and not wait for it at the user level.

Combined, this ensures that we 1) don't have to wait a second between transmissions and 2) can safely update the transmitted data without overriding ongoing tasks.

@laurensvalk
Copy link
Member

laurensvalk commented Dec 6, 2021

I've added an update with two additional methods (send and receive) to make communication easier.

You can now send and receive a tuple with up to 8 values at once, as follows:

radio = Broadcast(topics=["info"])
radio.send("info", ("Hello", 1.234, "world!"))

The tuple may contain integers (2 bytes or 4 bytes), floats (4 bytes), or strings ( N + 1 bytes), as long as the total fits within 20 bytes.

This facilitates many use cases, including things like remote control of a vehicle, e.g:

radio.send("info", (steering, speed, light_status))

On the receiving end, you could do:

data = radio.receive("info")
if data is not None:
    steering, speed, light_status = data
    # Apply steering, speed, etc.

The original methods to send raw bytes are still available, though they are renamed to send_bytes and receive_bytes. This can be used with hubs running the official MINDSTORMS firmware, or any other use case where raw bytes are required.

@antonvh
Copy link
Collaborator

antonvh commented Dec 7, 2021

Great stuff going on here. Following with interest...

@NStrijbosch
Copy link
Contributor Author

NStrijbosch commented Dec 9, 2021

To have bidirectional communication hubs need to both scan(receive) and advertise (transmit). I am not sure how this is exactly implemented on BT-chip on the Technic Hub/City but the implementation in this PR is as follows:

  • when a broadcast is initiated the hub starts scanning
  • when transmit() is called the hub starts advertising. I think the BT-chip stops scanning temporarily.
  • 1000 ms after calling transmit() the hub stops advertising. Scanning restarts automatically (no explicit call is necessary)
    I implemented the 1000ms delay using an etimer, and it seems to work when starting a program from code.pybricks.com. However when I store the program in main.py the timer never expires, i.e., the hub advertises indefinitely and never restarts scanning. This prevents bidirectional communication when storing the program on the hub. I am not sure what all the process are doing in the background that cause this behaviour. Maybe you have some insights to fix this bug.

This is fixed through: 7fb0166

Besides this the following claim was false:

I think the BT-chip stops scanning temporarily.

It turns out the BT-chip can scan and transmit at the same time

@laurensvalk
Copy link
Member

We've now added a broadcast process that ensures we don't que up too much stuff if you send things in a tight loop. This makes things quite a bit smoother and more reliable.

That also means some additional work is required to start and stop this background process cleanly. This isn't done yet, so you might see some unexpected behavior after running your first program if you try this build.

@JorgePe
Copy link
Contributor

JorgePe commented Dec 13, 2021

Thanks for this great work!
With some hints from Laurens I got a way to listen to the broadcasts and event broadcast data myself with just the linux Bluez tools (bluetoothctl + btmon on my Ubuntu and hcitool + hcidump on ev3dev because Bluez stack is much older there). Not pretty, not stable, not fast (at least not on EV3) but it works and gives a good idea of what it's possible with connectionless BLE communication.
... but I still think there is a place for Nordic UART Service somewhere on this project ;)

@dlech
Copy link
Member

dlech commented Dec 20, 2021

FYI, LEGO has fixed the advertising data bugs where the length was missing and the byte order of the company ID was wrong in the MINDSTORMS app v10.3.0. Unfortunately, they are still using ADV_SCAN_IND instead of ADV_NONCONN_IND.

@@ -152,7 +153,8 @@ void pbio_broadcast_parse_advertising_data(const uint8_t *data, uint8_t size) {
}

// We only process data with the right header
if (memcmp(data, &transmit_signal.header[0], 3)) {
if ((data[0] != size - 1) && data[1] != transmit_signal.AD_type &&
memcmp(&data[2], &transmit_signal.company_id, 2)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memcmp(&data[2], &transmit_signal.company_id, 2)

Can probably be simplified to something like

pbio_get_uint16_le(&data[2]) != transmit_signal.company_id

@NStrijbosch
Copy link
Contributor Author

I will just leave some issues I am trying to fix here:

When running a program on the technic hub from main.py:

  1. once every about 500 send() commands the hub gets in a blocking state. Some observations in this blocking state:
    • Exiting the program with a press on the button does not work
    • Shutting down the hub with a long press does work
    • When attaching a motor and using the command run_target(50,0) shows the controller is still active when the hub reaches this blocking state.
  2. when using bidirectional communication receiving does not work in some (1 out of 3) tests.
    • using another device I can see that messages are send by the other hub (so something is going wrong on the technic hub side)
    • after stopping and restarting the scan at the start of the program does not seem to improve the success rate.
    • this is something that goes wrong the whole program or goes well the whole program.

Both issues are hard to debug since I am not able to find a case with a 100% chance on either of the issues above. Besides this I only observed the issues when running from main.py, so I cannot print messages to the terminal to see where the program is stuck...

Anyhow, I still have some ideas to test. I will update the observations above accordingly.

@dlech
Copy link
Member

dlech commented Jan 7, 2022

  1. the hub gets in a blocking state.

Does this make the problem go away?

diff --git a/pybricks/util_pb/pb_task.c b/pybricks/util_pb/pb_task.c
index 73b58558..84e20878 100644
--- a/pybricks/util_pb/pb_task.c
+++ b/pybricks/util_pb/pb_task.c
@@ -42,7 +42,7 @@ void pb_wait_task(pbio_task_t *task, mp_int_t timeout) {
         pbio_task_cancel(task);
 
         while (task->status == PBIO_ERROR_AGAIN) {
-            MICROPY_VM_HOOK_LOOP
+            MICROPY_EVENT_POLL_HOOK
         }
 
         nlr_jump(nlr.ret_val);

This isn't a proper fix because it would cause tasks to keep running after an exception and lead to memory corruption, etc But it would tell us that the problem is that a task was requested to be canceled but never properly canceled.

@NStrijbosch
Copy link
Contributor Author

Does this make the problem go away?

The problem is not gone. But it does change the behaviour in the blocking state: With this change a single press will shutdown the hub, instead of the long press before. So at least pb_wait_task is related to the issue.

One of my own test was to change pb_wait_task(&task,-1 to pb_wait_task(&task,1000) to trigger an error if the task wasn't finished within 1 second. Since I expect that the BTchip is not sending the HCI command the driver is waiting for when setting the advertising data. However this timeout error was never triggered, but maybe I interpret the pb_wait_task wrong.

@dlech
Copy link
Member

dlech commented Jan 7, 2022

If the task times out, it still has to be cancelled and wait for cancellation to complete. But some tasks are not currently cancellable at all or in some regions of code, so will cause a lock up. Since the hub powers off/reboots when you make the change I suggested, that indicates that it caused a hard crash, which was expected.

@NStrijbosch
Copy link
Contributor Author

@dlech You are correct. Some of the tasks I added were not cancellable. I am trying to understand what is going on.

I suspect that the assumption that the BTchip will perform all commands without an error is not justified anymore when stretching the limit of fast advertisement updates.

To debug the status of the BTchip I use a printf(). But I also found the following piece of code, and was wondering how to read the debug messages that are made using DBG()? I do not see them in my terminal when starting a program using pybricksdev or do I need other configuration changes other than setting DEBUG==2? Or is extra hardware required?

#elif DEBUG == 2
#include <stdio.h>
#include <stm32l431xx.h>
#define DBG(fmt, ...) { \
char dbg[64]; \
snprintf(dbg, 64, fmt "\r\n",##__VA_ARGS__); \
for (char *d = dbg; *d; d++) { \
while (!(LPUART1->ISR & USART_ISR_TXE)) { } \
LPUART1->TDR = *d; \
} \
}

@dlech
Copy link
Member

dlech commented Jan 9, 2022

I suspect that the assumption that the BTchip will perform all commands without an error is not justified anymore when stretching the limit of fast advertisement updates.

Yes. And in the case of actually connecting to other devices, we need to consider the handle of the remote device as well. Basically, we need to replace all instances of PT_WAIT_UNTIL(pt, hci_command_status); and PT_WAIT_UNTIL(pt, hci_command_complete); with something like

PT_WAIT_UNTIL(pt, {
uint8_t *payload;
uint16_t event;
HCI_StatusCodes_t status;
(payload = get_vendor_event(remote_handle, &event, &status)) && ({
if (event == ATT_EVENT_ERROR_RSP && payload[0] == ATT_READ_BY_TYPE_REQ) {
task->status = PBIO_ERROR_FAILED;
goto disconnect;
}
event == ATT_EVENT_READ_BY_TYPE_RSP;
}) && ({
// hopefully the docs are correct and this is the only possible error
if (status == bleTimeout) {
task->status = PBIO_ERROR_TIMEDOUT;
goto disconnect;
}
// this assumes that there is only one matching characteristic
// (if there is more than one, we will end up with the last)
// it also assumes that it is the only characteristic in the response
if (status == bleSUCCESS) {
remote_lwp3_char_handle = pbio_get_uint16_le(&payload[4]);
}
status == bleProcedureComplete;
});
});
. This does more matching to ensure that the response was actually for this specific request. We can probably get away with leaving the init sequences the way they are for now since no other commands should be running at that time. But anything that can run after that needs this more careful handling.

To debug the status of the BTchip I use a printf().

This is dangerous since printing also uses Bluetooth. It will likely change the behavior of what you are trying to inspect. Also, using printf in any pbio code is dangerous because it can cause reentrany issues because it runs the contiki event loop while we are already in the contiki event loop. One trick I use sometimes is to make the print buffer very big so that printf doesn't block when called. This avoids the reentrancy problems in most cases.

But I also found the following piece of code, and was wondering how to read the debug messages that are made using DBG()?

This uses a UART to print debug messages. This works around the problem of printf using Bluetooth. Although it still isn't perfect because it is blocking and can break things by changing the timing. To use it, you have to change the configuration to make the hub only have 3 ports instead of 4 and then add some code in the early init in platform.c to configure the UART on port D. If you look way back in git history, you can probably find a time where this was implemented.

@NStrijbosch
Copy link
Contributor Author

NStrijbosch commented Jan 10, 2022

This uses a UART to print debug messages.

This sounds like exactly what I hoped for! I look back in the GIT history to check the implementation.

Deep down: 48c442b

@NStrijbosch
Copy link
Contributor Author

  • once every about 500 send() commands the hub gets in a blocking state. Some observations in this blocking state:

    • Exiting the program with a press on the button does not work
    • Shutting down the hub with a long press does work
    • When attaching a motor and using the command run_target(50,0) shows the controller is still active when the hub reaches this blocking state.

Using the UART hack on port D I was able to trace the cause of the blocking state. It seems that the pbdrv_bluetooth_spi_process can get stuck in the PROCESS_WAIT_UNTIL(spi_xfer_complete); here:

// read the remaining message
spi_xfer_complete = false;
pdata->spi_start_xfer(&write_buf[NPI_SPI_HEADER_LEN],
&read_buf[NPI_SPI_HEADER_LEN], xfer_size);
PROCESS_WAIT_UNTIL(spi_xfer_complete);

Consequently, there are no new events registered anymore. When waiting, e.g., for an event that indicates that advertisement data is updated, this event is not triggered. Leading to the blocking state.

@dlech
Copy link
Member

dlech commented Jan 15, 2022

I wonder if we are getting an SPI error. If you override the weak function void HAL_SPI_ErrorCallback(SPI_HandleTypeDef *hspi), is it called?

@NStrijbosch
Copy link
Contributor Author

NStrijbosch commented Jan 15, 2022

I wonder if we are getting an SPI error. If you override the weak function void HAL_SPI_ErrorCallback(SPI_HandleTypeDef *hspi), is it called?

I added the following to lib/pbio/platform/technic_hub/platform.c:

void HAL_SPI_ErrorCallback(SPI_HandleTypeDef *hspi) {
    DBG("SPI error");
}

including the required definition of DBG() (I tested this function in another callback and printing over port D from this file works).

But unfortunately the DBG print in HAL_SPI_ErrorCallback is not exectued, i.e., this function is not called... Or should it be placed somewhere else?

@dlech
Copy link
Member

dlech commented Jan 15, 2022

Hmm... it sounds as if we could be just missing an interrupt then. I don't see how else pbdrv_bluetooth_stm32_cc2640_spi_xfer_irq() would not be called.

@NStrijbosch
Copy link
Contributor Author

NStrijbosch commented Jan 15, 2022

Could it be just a single interrupt?

The hub is scanning, so advertising data should be received by the BTchip all the time. So shouldn't this interrupt be triggered again after missing one? At the moment when it is stuck, the interrupt handler is not called anymore.

In other words: couldnt the cause be that the BTchip just stopped sending stuff?

@dlech
Copy link
Member

dlech commented Jan 15, 2022

The interrupt is for the end of a DMA transfer, so there will never be another interrupt until a new DMA transfer is started. But that doesn't happen because the loop that starts the transfers is locked waiting for the previous transfer to complete.

On the SPI bus, unlike UART, the MCU has to initiate all communications.

@dlech
Copy link
Member

dlech commented Jan 15, 2022

Another thought. We could check the return value of HAL_SPI_TransmitReceive_DMA() to verify that the DMA transfer is actually started.

@NStrijbosch
Copy link
Contributor Author

The interrupt is for the end of a DMA transfer, so there will never be another interrupt until a new DMA transfer is started. But that doesn't happen because the loop that starts the transfers is locked waiting for the previous transfer to complete.

On the SPI bus, unlike UART, the MCU has to initiate all communications.

Clear!

the advertising state `advertising_now` needs to be changed when advertising for broadcasts and connecting to code.pybricks.com.
It also turned out that by changing the maximum number of peripheral devices to two for the primehub removed the implicit advertisement stop after connecting to code.pybricks.com. This is also fixed by an expicit advertisement stop.
when the broadcast process is stopped, also stop scanning and stop advertising.
Since the Bluetooth and BLE roadmap won't be finalized for the
upcoming release, it's better to import this class via experimental.

This means that we can still change it even if we merge it.
Simultaneous advertising and scanning is unreliable with the stm32_cc2640.
This commit introduces a broadcast process in the driver that periodically
switches between scanning (30 ms) and advertising (20 ms).
This enables to get as close to simultaneous broadcast sending and receiving as possible.
For endurance testing this longer test is helpfull
To stop the broadcast process after the program is stopped by the user
stopping the broadcast process is not sufficient. When stopping it it is likely to be still advertising or scanning. Hence, after stopping the process the advertising and scanning also need to be stopped.
Add the info() method. The argument is the name of a topic.
It returns a tuple containing: index, timestamp, and rssi value corresponding to the last received data for this topic.
To comply with the conventional structure of advertising data the number of bytes should correspond to the complete number of bytes (including the meta data).
@massimiliano-mantione
Copy link

massimiliano-mantione commented Dec 28, 2022

Hi!
I have been experimenting with this feature using the build provided in the "hub to hub communication" project page ("https://pybricks.com/projects/tutorials/wireless/hub-to-hub/broadcast/").
Thanks for your work on this, it is amazing!

What I need is bidirectional communication between two technic hubs, ideally with "minimal" lag.

To test responsiveness (a bit empirically) I connected two motors to each hub to ports A and B and used the position of the motor on port A (the angle() reading) to move motor B on the other hub with track_target() (broadcasting positions bidirectionally with two topics).

I noticed that the communications works for a few seconds, then stops for a few seconds, then resumes again...

I load the programs on both hubs from beta.pybricks.com but then I disconnect the browsers to avoid console-related buetooth traffic.

Is there something I can do to troubleshoot this?
Anf BTW, what does "start from main.py" mean?

@NStrijbosch
Copy link
Contributor Author

I noticed that the communications works for a few seconds, then stops for a few seconds, then resumes again...

Thank you for the feedback. I do not recognize the drop in communication that takes seconds. Can you share a small code example such that I can reproduce this?

Anf BTW, what does "start from main.py" mean?

With this implemention it is important to distinguish between the two possible scenario's to run a program:

  • Start from beta.pybricks.com and press the start button on the screen (this way the hub stays connected to your PC)
  • Start by pressing the button on the hub (while being disconnected from your PC). The program that is stored on the hub is used in that case. When I build the experimental firmware I can already store the program in a file name main.py. Hence, I refer to this situation as starting from main.py. But I think this is practically the same as starting the program once from beta.pybricks.com, disconnect from beta.pybricks.com, and then start the program by pressing the button on the hub (This is what you already are doing, which is the most optimal situation for broadcasting)

@massimiliano-mantione
Copy link

massimiliano-mantione commented Dec 28, 2022

Well, the two programs I use are short enough.

Here is the "commander":

from pybricks.hubs import TechnicHub
from pybricks.pupdevices import Motor
from pybricks.parameters import Color, Port
from pybricks.robotics import DriveBase
from pybricks.tools import wait, StopWatch
from pybricks.experimental import Broadcast

hub = TechnicHub()
ma = Motor(Port.A)
mb = Motor(Port.B)

radio = Broadcast(topics=["data", "cmd"])

data_angle = 0
cmd_angle = 0

while True:
    cmd_angle = ma.angle()
    mb.track_target(data_angle)

    radio.send("cmd", (cmd_angle))
    data = radio.receive("data")
    if data:
        data_angle = data
        hub.light.on(Color.GREEN)
    else:
        hub.light.on(Color.RED)

    wait(10)

And here is the "reporter":

from pybricks.hubs import TechnicHub
from pybricks.pupdevices import Motor
from pybricks.parameters import Color, Port
from pybricks.robotics import DriveBase
from pybricks.tools import wait, StopWatch
from pybricks.experimental import Broadcast

hub = TechnicHub()
ma = Motor(Port.A)
mb = Motor(Port.B)

radio = Broadcast(topics=["data", "cmd"])

data_angle = 0
cmd_angle = 0

while True:
    data_angle = ma.angle()
    mb.track_target(cmd_angle)

    radio.send("data", (data_angle))
    cmd = radio.receive("cmd")
    if cmd:
        cmd_angle = cmd
        hub.light.on(Color.GREEN)
    else:
        hub.light.on(Color.RED)

    wait(10)

They are actually mostly identical, with just the topics reversed.
I refer to them as "commander" and "reporter" because in the actual project I plan to keep all the real "logic" on one brick (the "commander") and use the other one like an "extension" to report data and execute commands with the connected motors and sensors (hence the name "reporter").

@massimiliano-mantione
Copy link

BTW, I'd like to know the actual rate limit for bidirectional communication :-)
My project is a robot for a competition, and if at all possible I'd like to push the msg exchange to 100Hz (and even that would be slow for what would be really needed).

It's a bit off-topic, but I wonder if a Technic Hub could be configured as a master for the Nordic UART Service (NUS), and connect to the other hub using the NUS, and if this would result in better performance than this amazing but tricky broadcast machinery (which, I admit, is a stroke of genius in its way of piggybacking on the BLE device advertising data frames...).

@NStrijbosch
Copy link
Contributor Author

Well, the two programs I use are short enough.

Thanks I will try to reproduce your issue in the coming days.

I'd like to know the actual rate limit for bidirectional communication

In the current implementation this is 20 Hz for the Technic/City hubs (these switch between scanning (20ms) and advertising (30ms)). For the Inventor/Prime/Essential hubs this rate might be better, these hubs are able to scan and advertise at the same time. However you have to consider that it is a lossy communication protocol, i.e., increasing the communication rate might lead to a lot of lost data.

It's a bit off-topic, but I wonder if a Technic Hub could be configured as a master for the Nordic UART Service (NUS), and connect to the other hub using the NUS,

I think this should be possible, and it could definitely give you better performance than the current broadcast. At the time this broadcast implementation seemed to be an easier way to get hub-to-hub communication working for a lot of usecases.

@massimiliano-mantione
Copy link

massimiliano-mantione commented Dec 29, 2022

I think I have a reasonable explanation for the behavior I am observing.

In the current implementation this is 20 Hz for the Technic/City hubs (these switch between scanning (20ms) and advertising (30ms)).

The two hubs are not synchronizing.

What I fear is happening is that the 50ms period (20ms scan + 30ms send) is not "exact", and the two hub periods change "phase" periodically (relatively slowly).
So, there are time intervals when the two periods are phased "properly" (one sends and the other receives), and there are time intervals when they are either talking past each other or listening to each other's silence.

I think that this could be verified with a BLE radio scanner on a third device (like a PC or a radio-enabled CPU like the ESP32), and I have seen instructions on how to make it in other related issues (like here) but I still did not have the time to do it.

Now, supposing that I am right, and given the fact that technic hubs are not able to advertise and receive at the same time, to make this work in the bidirectional case we'd need to sync their periods in some way.

I already have a few ideas, but before explaining them I'd like to check a few things about hardware capabilities.
I am reading the code for this PR in full, but there are a few details I am not getting.
Specifically, I have these questions:

  • when the BLE chip is in advertising mode, do we have control over how frequently it sends advertising packets over time?
  • do we have control over every sent packet, in the sense that we can change its contents? (is seems we do, because the 1 byte "counter" is incremented at each retransmission of the same packet, but maybe I am misunderstanding)
  • could we attach a microsecond-resolution timestamp to every received packet (it seems we do it in the last commits)?
  • do we have access to the sender BLE device ID for every received packet?

@massimiliano-mantione
Copy link

massimiliano-mantione commented Dec 29, 2022

My general idea is to give more information to the broadcast system initialization about the "network topology".
Even now, every brick participating in the network must already "share" the topic names with the other bricks.
In the new scheme of things, in every brick, the list of topics should be identical (and ordered), and each topic should have the following info attached:

  • whether this brick is going to publish on it or receive from it
  • a transmission time period (this could have a sensible default, again, shared between brick builds)

Each brick is supposed to follow the topic list cyclically, and for each topic either receive or transmit for the transmission time period of that topic.

Each packet should contain a number that tells how many microseconds the transmitter is inside its transmission time window.
Each brick, during each "reception" time window, should use the timestamps in the received packets to "sync its period" with the transmitter.

Note two more things:

  • also hardware that can receive and transmit at the same time should participate in this dance, otherwise it risks transmitting when others are not listening (and, maybe, produce ratio traffic that can lead to lost packets)
  • in the simple unidirectional case, with a set of topics written by a single brick and received by all others, the transmitter always sends and the readers always receive, and the latency would be the minimal BLE physical latency

I know this complicates things, but I fear that it is the only way to make it work with the technic hub BLE chip.

The technic hub is dear to me because it is likely the best hardware we can have for Lego robotic competitions.
Mindstorms have been discontinued.
Spike Prime Hubs are expensive, and they have built-in 2 cells LiPo batteries that severely limit the performance of the attached motors.
Technic hubs are amazing but they have only four ports, no display, and a single button.
Two technic hubs, properly paired, and talking to an external device for initial robot configuration, would be a perfect replacement for a single EV3 brick :-) (8 ports total)

I care about latency because our robots move at about 1m/s (line followers are even faster).
At those speeds waiting 50ms means reacting 50mm later, which means losing the competition :-/

@massimiliano-mantione
Copy link

One final comment, with one more detail on where we could put the timestamp inside the packet payload.

We are already using every available byte, but I think we could repurpose the topic CRC32 four bytes.
First of all, a few observations/questions:

  • does it really make sense for a topic to be readable and writable by the same device? (I guess no)
  • does it really make sense for a single topic to be writable by more than one device? (I hope not, not for our use cases)
  • do we really want to support more than one published topic per device? (I guess yes :-/ )
  • could we accept filtering messages by sender BLE ID before dealing with topics?
  • could we accept mandating that all the bricks joining a "network" share the full "network topology configuration", including the BLE IDs of all the nodes?

My general idea is to use "sender IDs" to understand "what is being transmitted", and let each sender write ideally to a single topic, and at most a handful of topics (we already have a static limit of four topics for the technic hub...).
This way the 32 bits of the CRC32 would be mostly useless, and they could be repurposed to contain:

  • likely 24 bits of "microseconds inside the time window" timestamp information
  • likely 8 bits to identify the sender topic, if the sending device has more than one

This would leave the whole radio framework unchanged but would support the "sync dance" described in the previous message.

If you think all of this makes sense, I can really help with the implementation on top of this PR!!!

@pbochynski
Copy link

@massimiliano-mantione I am rather skeptical about using broadcast for real-time synchronization between hubs. There is no guarantee that the message is received and many other factors can break the communication. A reliable synchronization protocol would require acknowledging messages and would introduce a substantial delay in the transmission. I see broadcast as more like an asynchronous communication channel. The main benefit of using it is that you don't need to know the topology. New hubs just join the network and you don't need to pair them. So my suggestion would be to support sharing topics by multiple hubs and collect responses from all devices (now responses are collected per topic hash and the last response wins).
@laurensvalk @dlech I am not sure the discussion should happen in the PR itself - maybe it should be moved to a separate issue. Anyway, when version 3.2 is out you promised to come back to the Bluetooth topic again. What is your plan for merging this PR?

@laurensvalk
Copy link
Member

Thanks for all the great discussions and ideas everyone! Feel free to open new discussions if you'd like to explore certain topics further.

This test uses the LED to indicate data dropouts.
The broadcast process periodically switches between scanning and advertising.
Previously this was done with a fixed timing interval (50ms).
This could lead to this process being fully synchronized, i.e., both hubs are scanning at the same time and advertising at the same time, which makes it imposible to transfer data via the broadcast.

This commit introduces a pseudo random timing of both the scanning and advertising interval. Each in the interval [15,35] ms.
This makes it virtually impossible to have fully synchronization for multiple periods in a row.
@NStrijbosch
Copy link
Contributor Author

The two hubs are not synchronizing.

That is correct, and I think the issue you notice is indeed related to this.

Ideally the advertising and scanning are completely opposite, i.e., the broadcast process is asynchronous as possible.
With a constant period length and two hubs this could be possible. The solution for three or more hubs is not as trivial.

In the current version available here the constant period length is not exactly constant, leading to the process getting slowly in sync and slowly out of sync again. This can be observed as data drop outs for multiple seconds.

One solution I could think of to minimize the chance of the process being in sync is to make the period random, which is included by f747335. This solution should work for any number of hubs.

I did some experiments and this greatly improved the bidirectional communication between two technic hub that transfer data at a high rate. Of course there are the expected data dropouts, but they don't take seconds anymore.

@laurensvalk
Copy link
Member

I rebased this branch here: https://github.com/pybricks/pybricks-micropython/tree/broadcast-rebase.

The basics still seem to work, but for Prime Hub it now only works if it is not connected to the computer. So I did not force push this branch yet.

laurensvalk referenced this pull request Apr 14, 2023
This adds new methods to broadcast and observe advertising data using
Bluetooth Low Energy. The communication protocol uses the LEGO
manufacturer-specific data similar to the experimental protocol
from the official Robot Inventor firmware. Multiple Python objects
are serialized using a compressed format to squeeze as many objects
into the 31 byte advertising data as possible.

Since the BLE object is a singleton initialized on the hub object, the
initialization parameters are added to the Hub constructors.

Each hub can only broadcast on a single "channel" and can receive on
multiple channels. Channels are limited to 16 to avoid allocating too
much memory.
@dlech dlech mentioned this pull request Apr 14, 2023
@laurensvalk
Copy link
Member

A pull request with similar functionality inspired by this one has been merged in #158. Thanks everyone for all the contributions, especially @NStrijbosch !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants