Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Repeated calling of hub.display.animate() would cause Spike Hub to hang. #1295

Closed
davidwu226 opened this issue Nov 21, 2023 · 16 comments · Fixed by pybricks/pybricks-micropython#268
Assignees
Labels
bug Something isn't working hub: primehub/inventorhub Issues related to the LEGO SPIKE Prime hub and LEGO MINDSTORMS Robot Invetor hub software: pybricks-micropython Issues with Pybricks MicroPython firmware (or EV3 runtime)

Comments

@davidwu226
Copy link

Describe the bug

If I call hub.display.animate() repeatedly, the Spike Hub sometimes hang. Hitting the center button does not stop the program nor does hitting the stop button in the IDE. I will have to take out the battery in order to reset the hub.

To reproduce

from pybricks.hubs import PrimeHub
from pybricks.tools import wait, Matrix

hub = PrimeHub()
delay = 10

x = Matrix([
    [255,255,255,255,255],
    [255,  0,  0,  0,  0],
    [255,  0,  0,  0,  0],
    [255,  0,  0,  0,  0],
    [255,  0,  0,  0,  0],
])

while True:
    hub.display.animate([x, x], 10)
    wait(delay)

Run the above and try to hit the center button to stop the program. If it works, try again (or turn off the hub and start it again).

Expected behavior
I should be able to stop the hub using the center button or by hitting stop in the IDE.

@davidwu226 davidwu226 added the triage Issues that have not been triaged yet label Nov 21, 2023
@laurensvalk
Copy link
Member

Thank you.

Can you please report the result of this script?

from pybricks import version
print(version)

We recently fixed something like this, so hopefully we're all good when we release this new version.

@laurensvalk laurensvalk added hub: primehub/inventorhub Issues related to the LEGO SPIKE Prime hub and LEGO MINDSTORMS Robot Invetor hub and removed triage Issues that have not been triaged yet labels Nov 22, 2023
@laurensvalk
Copy link
Member

laurensvalk commented Nov 22, 2023

I can confirm that this is still an issue with the latest release candidate as well. Thanks for reporting!

This also appears to be another case where the MicroPython system abort is not working, so maybe we should revert that one at some point too.

@laurensvalk laurensvalk added bug Something isn't working software: pybricks-micropython Issues with Pybricks MicroPython firmware (or EV3 runtime) labels Nov 22, 2023
@laurensvalk
Copy link
Member

The issue seems to only occur if it is called again before the first frame update.

This adapted script always works correctly for any delay > 0.

from pybricks.hubs import PrimeHub
from pybricks.tools import wait, Matrix

hub = PrimeHub()
delay = 5

x = Matrix([
    [255,255,255,255,255],
    [255,  0,  0,  0,  0],
    [255,  0,  0,  0,  0],
    [255,  0,  0,  0,  0],
    [255,  0,  0,  0,  0],
])

while True:
    hub.display.animate([x, x], delay)
    wait(delay + 1)

@laurensvalk
Copy link
Member

Simple color animations are also affected, so we must have had this bug for a long time 😄

from pybricks.hubs import PrimeHub
from pybricks.parameters import Color
from pybricks.tools import wait

hub = PrimeHub()

while True:
    hub.light.animate([Color.RED, Color.GREEN, Color.NONE], interval=10)
    wait(10)

@laurensvalk
Copy link
Member

laurensvalk commented Nov 22, 2023

This also appears to be another case where the MicroPython system abort is not working, so maybe we should revert that one at some point too.

This is only partially true. Running the color animation is escapable with the old system abort (3.3b6) but the matrix animation isn't. This was already broken in Pybricks 3.2.

It could also be another case where running the pbio event loop happens while renewing the memory for the animation in common_LightMatrix_animate.

@laurensvalk
Copy link
Member

This is still happening today. Both the light matrix animation and the basic color animation above.

pbio does keep running, so this is more likely the SystemExit not working.

Shutdown is also triggered as evidenced by the Bluetooth light turning off. But the main task keeps running until the program exists, which doesn't happen.

@laurensvalk
Copy link
Member

laurensvalk commented Oct 2, 2024

Some findings:

from pybricks.hubs import PrimeHub
from pybricks.parameters import Color
from pybricks.tools import wait, StopWatch

hub = PrimeHub()

while True:
    hub.light.animate([Color.RED, Color.GREEN, Color.NONE], interval=10)

    # Variant 1: Exit not handled when pressing button. Have to pull battery.
    wait(10)
from pybricks.hubs import PrimeHub
from pybricks.parameters import Color
from pybricks.tools import wait, StopWatch

hub = PrimeHub()

timer = StopWatch()

while True:
    hub.light.animate([Color.RED, Color.GREEN, Color.NONE], interval=10)

    # Variant 2: Allows exit.
    timer.reset()
    while timer.time() < 10:
        pass

So it seems to get stuck in the mp_hal_delay?

Could it be related to this?

void pb_event_poll_hook_leave(void) {
    // There is a possible race condition where an interrupt occurs and sets the
    // Contiki poll_requested flag after all events have been processed. So we
    // have a critical section where we disable interrupts and check see if there
    // are any last second events. If not, we can call __WFI(), which still wakes
    // up the CPU on interrupt even though interrupts are otherwise disabled.
    mp_uint_t state = disable_irq();
    if (!process_nevents()) {
        __WFI();
    }
    enable_irq(state);
}

And strange things are happening with events here if you try to exit with a button Edit. Unrelated. That part is due to #1863

@dlech
Copy link
Member

dlech commented Oct 2, 2024

So it seems to get stuck in the mp_hal_delay?

mp_hal_delay_ms calls MICROPY_EVENT_POLL_HOOK which checks for pending exceptions, so it doesn't seem like it could be getting stuck.

Could it be related to this?

We should be getting an interrupt at least every 1 ms so this should not be long-blocking (__WFI() still works when interrupts are disabled).

@BertLindeman
Copy link

Laurens,

similar to your example above in remark where you add '1' to the wait.

This also helps this program:

from pybricks.hubs import PrimeHub
from pybricks.parameters import Color
from pybricks.tools import wait, StopWatch

hub = PrimeHub()

while True:
    hub.light.animate([Color.RED, Color.GREEN, Color.NONE], interval=10)

    # Variant 1: Exit not handled when pressing button. Have to pull battery.
    wait(10 + 1)  # allows stop - no need to drop the battery

So the wait (or some other task?) should be longer than the interval.

@laurensvalk
Copy link
Member

laurensvalk commented Oct 2, 2024

Yes, interval+1 is safe. See #1295 (comment). The problem mainly occurs with equal delay.

mp_hal_delay_ms calls mp_hal_delay_ms

Recursive typo?

@laurensvalk
Copy link
Member

laurensvalk commented Oct 2, 2024

So it seems to get stuck in the mp_hal_delay?

mp_hal_delay_ms calls mp_hal_delay_ms which checks for pending exceptions, so it doesn't seem like it could be getting stuck.

Could it be related to this?

We should be getting an interrupt at least every 1 ms so this should not be long-blocking (__WFI() still works when interrupts are disabled).

I'm not sure, but the reason for posting that was this experiment. This gets stuck at 'B':

from pybricks.hubs import PrimeHub
from pybricks.parameters import Color
from pybricks.tools import wait

hub = PrimeHub()

while True:
    hub.display.char('A')
    hub.light.animate([Color.RED, Color.GREEN, Color.NONE], interval=10)
    hub.display.char('B')
    wait(10)
    hub.display.char('C')

@dlech
Copy link
Member

dlech commented Oct 2, 2024

Recursive typo?

something like that, I fixed the comment to say MICROPY_EVENT_POLL_HOOK

@laurensvalk laurensvalk self-assigned this Oct 3, 2024
@laurensvalk
Copy link
Member

laurensvalk commented Oct 3, 2024

It looks like the problem is as follows:

void pbio_light_animation_stop(pbio_light_animation_t *animation) {
     (...)
            process_exit(&pbio_light_animation_process);
     (...)
}

void pbio_color_light_start_animation(pbio_color_light_t *light, uint16_t interval, const pbio_color_compressed_hsv_t *cells) {
    pbio_color_light_stop_animation(light);
    (...)
    pbio_light_animation_start(&light->animation);
}

pbio_color_light_start_animation will call stop, resulting in calling process_exit. But this does not stop synchronously, it just posts an exit event to be handled later. However, pbio_light_animation_start is synchronously starting the process. So we may be starting it and then after that schedule stopping immediately.

I'm not sure how that would lead to MicroPython not exiting, but this race condition does seem like the plausible origin of the issue.

This hypothesis seems to be confirmed as follows.

EDIT: The exit is not the issue. If the light matrix animation keeps spinning then the issue with the light pattern is separate and the process never even has to stop.

To following still works, so now to figure out why: If we have it stop and then handle all pending events before starting the new pattern, it works fine. This patch also works with the light matrix.


+extern void pbio_color_light_stop_animation(pbio_color_light_t *light);
+
 // pybricks._common.ColorLight.animate
 static mp_obj_t common_ColorLight_internal_animate(size_t n_args, const mp_obj_t *pos_args, mp_map_t *kw_args) {
     PB_PARSE_ARGS_METHOD(n_args, pos_args, kw_args,
@@ -98,14 +102,18 @@ static mp_obj_t common_ColorLight_internal_animate(size_t n_args, const mp_obj_t
         PB_ARG_REQUIRED(colors),
         PB_ARG_REQUIRED(interval));
 
+    pbio_color_light_stop_animation(self->light);
+    MICROPY_EVENT_POLL_HOOK;
+
     mp_int_t colors_len = mp_obj_get_int(mp_obj_len(colors_in));
 
     size_t cells_size = sizeof(pbio_color_compressed_hsv_t) * (colors_len + 1);

Instead of hacking it into the module, we can probably fix it in pbio by reworking the process logic a bit.

@laurensvalk
Copy link
Member

laurensvalk commented Oct 3, 2024

Another finding. Can reproduce with a 100ms interval which makes debugging easier.

When it gets stuck, the light turns violet (or any mix of the animation colors). So something is pushing it through the frames at a tight loop instead of respecting the interval.

from pybricks.hubs import PrimeHub
from pybricks.parameters import Color
from pybricks.tools import wait

hub = PrimeHub()

while True:
    hub.light.animate([Color.RED, Color.BLUE], interval=100) # When stuck, turns violet.
    wait(100)

This still happens if the animation interval is 100 and the wait time is 1000.

@laurensvalk
Copy link
Member

laurensvalk commented Oct 3, 2024

It occurred to me that for most etimer loops in pbio we have:

PROCESS_WAIT_EVENT_UNTIL(ev == PROCESS_EVENT_TIMER && etimer_expired(&timer));

We don't have that in the light animation process since we don't have a direct reference to the timer. But we can do:

        PROCESS_WAIT_EVENT_UNTIL(ev == PROCESS_EVENT_TIMER);
        struct etimer *timer = data;
        pbio_light_animation_t *animation = PBIO_CONTAINER_OF(timer, pbio_light_animation_t, timer);
+       if (!etimer_expired(&animation->timer)) {
+           continue;
+       }

And this also fixes the issue! That does still beg the question what was firing timer events in the first place?

Perhaps put differently @dlech, can you remind me why do we have it as (ev == PROCESS_EVENT_TIMER && etimer_expired(&timer) everywhere else? Aren't timer events always only posted to that one process in which the timer was set?

I suppose what could have been happening is that the timer event was fired, but then the timer is stopped or reset before it is handled, so we have to check for expiration again.

@dlech
Copy link
Member

dlech commented Oct 3, 2024

Aren't timer events always only posted to that one process in which the timer was set?

Yes, but there could be multiple timers for that process. So it is always best to check that the timer is expired too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hub: primehub/inventorhub Issues related to the LEGO SPIKE Prime hub and LEGO MINDSTORMS Robot Invetor hub software: pybricks-micropython Issues with Pybricks MicroPython firmware (or EV3 runtime)
Projects
None yet
4 participants