Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when USB connection is terminated #393

Closed
krichardsson opened this issue Feb 18, 2019 · 21 comments
Closed

Crash when USB connection is terminated #393

krichardsson opened this issue Feb 18, 2019 · 21 comments
Milestone

Comments

@krichardsson
Copy link
Contributor

From issue bitcraze/crazyflie-lib-python#110

How to reproduce:

  1. Start a Roadrunner or Crazyflie
  2. Connect via USB to a computer
  3. Make sure no Crazradio is connected to the computer
  4. Run the https://github.com/bitcraze/crazyflie-lib-python/blob/0.1.7/examples/basiclog.py script and terminate it with CTRL+C. Make sure the script connects via the USB cable (some thing like Connecting to usb://0 in terminal)

Symptoms:

  • The Roadrunner/Crazyflie crashes and restarts
@AlexisTM
Copy link

This is what happens (wireshark) when the device is disconnected.

  • The roadrunner has the ID 1.14.1
  • basiclog.py is stopped (Ctrl+Z)
  • The roadrunner has the ID 1.15.0

pcap file: https://pcap.honeynet.org.my/v1/submission/view.php?id=1111.php

image

@AlexisTM
Copy link

The "crash" comes from the RCC_FLAG_IWDGRST watchdog.

By commenting on the following line, we can experience that the roadrunner suspends when the USB connection suspends; Yet the roadrunner recovers once we start the connection again.

@krichardsson
Copy link
Contributor Author

Yes, the restart is caused by the watchdog. You can turn it off by compiling with DEBUG=1 if you like.

I did a quick test this morning with a DEBUG compiled firmware and it seemed to hand instead of restarting. I did not have time to dig any more into this though.

Interesting that it seems to recover for you.

@whoenig
Copy link
Contributor

whoenig commented Feb 19, 2019

I have seen the same issue with crazyflie_ros/crazyflie_cpp. I did debug the problem and found that it is caused by the logging task keep filling up the message queue with new data even though USB is disconnected (I think one can also reproduce by pulling the cable). Unfortunately, I was not able to figure out why this causes a crash (the queue has a finite size), and how to fix it.

For a CF 2.0, I have not seen recovery after the issue (a reboot is required).

@AlexisTM
Copy link

@whoenig To have a recovery of the CF2.0, I disabled the IWDGRST watchdog (see above). This could also be done using DEBUG=1.

Are you sure it's due to overflow? I experience the CF2.0 to "wait" for the USB connection to come back. I feel it is more like a feature of STM32 for the USB driver.

@whoenig
Copy link
Contributor

whoenig commented Mar 7, 2019

I debugged this more today and found a much simpler reproducible. It's essentially enough to just query for the platformVersion once. There will be a response, but afterwards the heart beat LED is not blinking anymore (DEBUG build, I guess in release the watchdog would kick in). The CF will still respond to other messages, but all other tasks seem to be suspended until messages over USB are exchanged. A debugger just shows that time is spent in various USB related interrupts.

So far, I could not find the underlying issue (and I don't know much about the STM32 USB OTG driver that is being used). However, this shows that this is not necessarily a logging related issue.

@krichardsson
Copy link
Contributor Author

Great! It is good to have a quick way to trigger the problem.

We suspect that it is related to task levels or interrupts taking too long but we have not investigated it further yet. The watchdog is reset on idle and if there is no idle time it will not be reset.

@whoenig
Copy link
Contributor

whoenig commented Mar 7, 2019

Is there an easy way/documentation on how to debug such issues? The task priority for the CRTP tasks is lower compared to some of the other tasks. I am not sure how to profile time spent in interrupts.

@krichardsson
Copy link
Contributor Author

@whoenig unfortunately not. There is some support in FreeRTOS to monitor tasks but nothing for interrupts. Instrumenting the code is maybe the way forward?

@AlexisTM
Copy link

AlexisTM commented Mar 8, 2019

Related issue?: https://www.freertos.org/FreeRTOS_Support_Forum_Archive/November_2015/freertos_After_Running_USB_OTG_Code_FreeRTOS_hangs_WDT_Reboot_306ca43bj.html

Following the proposed solution:

All I do is call this: _getPRIMASK(); before the initialization of the USB interface. Can anyone can explain how that fixes the problem?

@tobbeanton
Copy link
Member

I think this is a ST Library issue. Found this in st forum. I will try the fix and get back.

@tobbeanton
Copy link
Member

Tried the fix and now the it is not able to connect USB at all = FAIL

@whoenig
Copy link
Contributor

whoenig commented Mar 15, 2019

The forum post and the USB device lib used in the crazyflie firmware are both pretty old. Do you think it would help to upgrade to the latest STM libraries? I assume this is a bigger change, since there might be a bunch of dependencies that need to be resolved.

@tobbeanton
Copy link
Member

@whoenig Yes I think updating the libraries is the next thing to try. @krichardsson did a try last Friday but apparently there are quite some things to be resolved so didn't get all the way. We will try again as soon as we have time.

@krichardsson
Copy link
Contributor Author

krichardsson commented Apr 25, 2019

It seems as the problem occurs when the Crazyflie tries to send data over USB when the USB connection is terminated. The USB library continues to request data (from an interrupt) after the USB connection is terminated and this seems to cause some form of infinite loop in the interrupt code (implemented in the USB lib). I think the CF code is OK and that the problem lies in the USB lib.

@whoenig, I have tried to upgrade to V2.2.1 / 17-March-2018 of the USB library (https://www.st.com/content/st_com/en/products/embedded-software/mcu-mpu-embedded-software/stm32-embedded-software/stm32-standard-peripheral-library-expansion/stsw-stm32046.html) but have not yet managed to get it to work.

@whoenig
Copy link
Contributor

whoenig commented Apr 26, 2019

Thanks for the update Kristoffer! What is the issue when you try to upgrade the USB library? Does it compile and simply not run, or are there issues with compiling because they changed their API in the meantime?

@krichardsson
Copy link
Contributor Author

It did compile, the API/lib is fairly similar, the problem was that I not could get normal operation to work correctly. It worked as expected for a few packets (around 5 if I remember correctly) but after that some packets were "stuck" and interleaved with garbage. I got the feeling that it might be related to some form of buffer overrun in a circular buffer or similar, maybe cased by miss-configuration?

@evoggy spent some more time on this yesterday, not sure of the outcome yet.

@evoggy
Copy link
Member

evoggy commented May 2, 2019

I think I've found a fix for this, but there's still some debugging to be done. I'll write more details once it's cleaned up and pushed.

@evoggy
Copy link
Member

evoggy commented May 3, 2019

This should be fixed now, but please test it out a bit and re-open the issue if there's still problems.

The main issue was the fifo empty interrupt flag not being cleared correctly by the ST lib. Previously the flag would be cleared once the transfer is completed. So you would send data tough the USB, the data would be moved to the USB peripheral and the ISR would constantly run until the host had fetched the data. Since this is host dependent the time spent here may vary and we've seen some long latency propagate though the system when running logging on USB.

As a side-effect of this if the cable is pulled or the host program terminated while we're waiting to the transfer to complete the ISR would continue to run which would have nasty side-effects like queues filing up and the SYSLINK assert mentioned above being hit and the watchdog triggering.

The fix for this is detailed here (https://community.st.com/s/question/0D50X00009XkbWBSAZ/bug-in-stm32usbotgdriver-causes-100-cpu-in-fifo-empty-interrupts). Instead of waiting until the transfer is done the fifo empty interrupt is switched off after the data is sent to the USB peripheral.

But even with fixed the above there's some congestion in the system since logging continues to run, lots of data has been moved to the USB peripheral that will not be sent and so on. So together with clearing out the queues for sending data, flushing the TX in the USB peripheral and stopping the logging by fixing #428 the connection performance has been improved and is now very stable on connect/disconnect.

@evoggy evoggy closed this as completed May 3, 2019
@krichardsson
Copy link
Contributor Author

USB connection in the python client does not seem to work on OSX after this change.

Not 100% sure but it seems likely that it is related to this issue

@krichardsson krichardsson reopened this May 16, 2019
@evoggy
Copy link
Member

evoggy commented Jun 5, 2019

The fix for this worked well on Linux, but not at all on Mac OSX or Windows. Looking into this again to see if we can fix the fix...

@evoggy evoggy closed this as completed in 1becef7 Jun 5, 2019
ataffanel added a commit that referenced this issue Jun 5, 2019
@krichardsson krichardsson added this to the next-release milestone Aug 13, 2019
cafeciaojoe pushed a commit to cafeciaojoe/crazyflie-firmware that referenced this issue Sep 27, 2024
cafeciaojoe pushed a commit to cafeciaojoe/crazyflie-firmware that referenced this issue Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants