Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STM32 I2CSlave race condition causing timeouts #15498

Closed
agausmann opened this issue Mar 12, 2024 · 0 comments · Fixed by #15499
Closed

STM32 I2CSlave race condition causing timeouts #15498

agausmann opened this issue Mar 12, 2024 · 0 comments · Fixed by #15499

Comments

@agausmann
Copy link
Contributor

agausmann commented Mar 12, 2024

Description of defect

When using I2CSlave on STM32 targets, it is possible for read or write to timeout and return an error even if the transfer succeeded.
This can cause significant stalls (0.3 sec or more) in the thread that called the I2CSlave read or write.

The cause seems to be an ABA problem, using the same flag value for two distinct states: "addressed, transfer pending" and for "transfer in progress". As a result, the driver is unable to distinguish between those two states. If the hardware transitions quickly enough from "in progress" to "idle" to "addressed", so that the driver never sees the idle state, then it will never realize that the transfer had completed and timeout with an error.

Example execution flow in the write case:

  • i2c_slave_write called by Thread X
    • Enters loop waiting for pending_slave_tx_master_rx to be cleared
  • Transfer completes, HAL_I2C_SlaveTxCpltCallback called from ISR
    • clears pending_slave_tx_master_rx
  • [Some other thread may be executed first, delaying the return to Thread X]
  • Master addresses the slave again, HAL_I2C_AddrCallback gets called from ISR
    • sets pending_slave_tx_master_rx
  • When Thread X resumes, pending_slave_tx_master_rx is set, and is not cleared until timeout

This seems to be fixed by using separate flags for "addressed" and "transfer in progress" states. I will be creating a PR in a moment to demonstrate this.

Target(s) affected by this defect ?

STM

Toolchain(s) (name and version) displaying this defect ?

GCC_ARM

What version of Mbed-os are you using (tag or sha) ?

baf6a30

What version(s) of tools are you using. List all that apply (E.g. mbed-cli)

  • mbed-tools 7.59.0
  • arm-none-eabi-gcc (GNU Arm Embedded Toolchain 10.3-2021.10) 10.3.1 20210824 (release)

How is this defect reproduced ?

https://github.com/agausmann/mbed-i2c-stall-repro

This has master and slave combined in one device firmware. I've also encountered the problem with separate devices, but this is the easiest way to demonstrate it.

This example logs to the ST-Link console (9600 baud) each time the slave receives a transfer, with the format <returncode> <payload>

If returncode is 1, then the I2C slave HAL timed out inside the loop. This can be confirmed by enabling DEBUG_STDIO in the file TARGET_STM/i2c_api.c, it will print "TIMEOUT or error in i2c_slave_read". Before it prints 1, you will also see a noticeable pause in the output.

asciicast

The expected behavior is all 0 (success) return codes, and a more consistent output rate in the console with no significant pauses.

agausmann added a commit to agausmann/mbed-os that referenced this issue Mar 28, 2024
Fixes ARMmbed#15498

Adds 2 boolean flags to the STM32 `i2c_s` object
to indicate whether a transfer is in progress,
separate from the existing "transfer pending" flags.

`i2c_slave_write`, `i2c_slave_read` and their associated callbacks
are modified to use these flags in addition to the pending flags.
The original behavior of the pending flags is preserved.
multiplemonomials pushed a commit to mbed-ce/mbed-os that referenced this issue Jul 20, 2024
Fixes ARMmbed#15498

Adds 2 boolean flags to the STM32 `i2c_s` object
to indicate whether a transfer is in progress,
separate from the existing "transfer pending" flags.

`i2c_slave_write`, `i2c_slave_read` and their associated callbacks
are modified to use these flags in addition to the pending flags.
The original behavior of the pending flags is preserved.
multiplemonomials added a commit to mbed-ce/mbed-os that referenced this issue Jul 21, 2024
* remove stdio checks in serial init if no console is available

* STM32WL fix set preamble length to 8

* Add target support for XDOT_MAX32670

* TARGET_STM: only mask CAN rx interrupt after rx interrupt, not all CAN interrupts

* Sleep Radio in between DC scheduled

* Nuvoton HUSBD support endpoint write ZLP

* USBCDC: support ZLP

* Don't overlap STM32 FDCAN RAM sections

* allow to override antenna gain

* TFM: Fix undeclared function tfm_ns_interface_init

ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]

* NUVOTON: CAN: Fix filter mask

NOTE: This fix only targets CAN (M453/M487), not CAN-FD (M467).
NOTE: NUC472 CAN doesn't support filter.

* NUVOTON: CAN: Fix Rx interrupt doesn't work

Major modifications:
1. Handle Rx interrupt based on Message Object interrupt (CAN_IIDR=0x0001~0x0020) instead of CAN_STATUS.RxOK
2. Also handle Tx interrupt following above for consistency

Other related modifications:
1. Fix signature type error in CAN_CLR_INT_PENDING_BIT()
2. Add CAN_CLR_INT_PENDING_ONLY_BIT() which doesn't clear NewDat flag so that user can fetch received message in thread context

NOTE: This fix only targets CAN (NUC472/M453/M487), not CAN-FD (M467).

* NUVOTON: CAN: Fix Message Object number for Tx and recognition of Rx interrupt

1.  The same Message Object number cannot use for both Tx and Rx simultaneously.
    For Tx, Message Object number 31 is reserved instead of 0.
    For Rx, Message Object numbers 0~30 are used and for filters.
2.  NewDat bit (CAN_IsNewDataReceived()) isn't exclusive to Rx.
    Recognize Rx interrupt by Message Object number other than 31.

NOTE: This fix only targets CAN (NUC472/M453/M487), not CAN-FD (M467).

* NUVOTON: CAN: Fix filter mask being zero

On mask being zero, it means any match, not exact match.

NOTE: This fix only targets CAN (M453/M487), not CAN-FD (M467).
NOTE: NUC472 CAN doesn't support filter.

* Allow custom TCXO control parameter

Allow custom TCXO control parameter

* NUVOTON: EMAC: Fix undeclared function mbed_error_printf

ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]

* NUVOTON: AnalogIn: Fix undeclared function gpio_set

ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]

* NUVOTON: CAN: Fix undeclared function gpio_set

ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]

* NUVOTON: AnalogOut: Fix undeclared function gpio_set

ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]

* NUVOTON: SPI: Fix undeclared function gpio_set

ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]

* NUVOTON: I2C: Fix undeclared function gpio_set

ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]

* NUVOTON: Serial: Fix undeclared function gpio_set

ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]

* Add separate flags for I2C slave transfer in progress

Fixes ARMmbed#15498

Adds 2 boolean flags to the STM32 `i2c_s` object
to indicate whether a transfer is in progress,
separate from the existing "transfer pending" flags.

`i2c_slave_write`, `i2c_slave_read` and their associated callbacks
are modified to use these flags in addition to the pending flags.
The original behavior of the pending flags is preserved.

* ESP8266: Fix accessing uninitialized variable

* Added missing delete

* Add ability to change number of status registers for macronix QSPIF devices

* correct scan parameters types

* skip CRC when initializing TDBStore

Problem: The build_ram_table() function of TDBStore loops over every entry, calculates the checksum and compares them to the stored checksum in the entry header to ensure integrity. For larger TDBStores (e.g. 8 MiB or more) in external single-SPI flash devices this check can take very long, thus rendering it unusable in some cases.

Solution: The suggested solution skips the time consuming CRC of the data. After reading the key and calculating its CRC, it sets next_offset to the beginning of the next entry, thereby skipping the data. While this skips the integrity check, it significantly reduces the initial building of the RAM table.

The data CRC can be enabled or disabled with a compiler flag.

Contribution is provided on behalf of BIOTRONIK.

* Added missing check for replay protection pointer before allocating new variable

Problem: If a key with write-once flag is being set in a SecureStore without rollback-protection store (i.e. _rbp_kv == NULL), additional memory will be allocated for the variable _ih->key. The memory will not be deleted, though, as the delete in line 434 only happens if a rollback-protection store exists (i.e. _rbp_kv != NULL)

Solution: Only allocate the memory if _rbp_kv != NULL

Contribution is provided on behalf of BIOTRONIK.

* Increase AT timeout to 10s in AT_CellularSMS::send_sms

For some devices sending can be slow (as an example see SIM800, it can be up to 60s), command is being run properly but default timeout is returning an invalid error.
See https://www.elecrow.com/wiki/images/2/20/SIM800_Series_AT_Command_Manual_V1.09.pdf

* Increase AT timeout to 10s in AT_CellularSMS::get_sms

When SMS list is big and baudrate is not fast enough, with default timeout we can suffer from timeout error while getting a sms because method is parsing the full list and this takes long.

* Fix AT_CellularSMS::list_messages breaking in text mode when CRLF is contained in SMS payload text

When parsing SMS, it can happen that we receive CRLF in the SMS payload (happened to me when receiving provider texts).
As an example, we can receive:

"""
Hello <CR><LF>
World!
"""

With previous implementation, second consume_to_stop_tag was stopping in <CR><LF> and rest of the code was failing for obvious reasons.
With this commit we consume the full payload as bytes.

* Add missing SPDX identifier to a bajillion Nuvoton source files + some others

* More license fixes, upgrade M451 legacy PinNames.h, add MCU description

* Fix some more legacy pin names

---------

Co-authored-by: Jost, Chris <[email protected]>
Co-authored-by: Charles <[email protected]>
Co-authored-by: Leon Lindenfelser <[email protected]>
Co-authored-by: Pavel Sorejs <[email protected]>
Co-authored-by: cyliang tw <[email protected]>
Co-authored-by: jmcloud <[email protected]>
Co-authored-by: Chun-Chieh Li <[email protected]>
Co-authored-by: Adam Gausmann <[email protected]>
Co-authored-by: Mingjie Shen <[email protected]>
Co-authored-by: Matthias Goebel <[email protected]>
Co-authored-by: danielzhang <[email protected]>
Co-authored-by: Mathieu Camélique <[email protected]>
Co-authored-by: David Alonso de la Torre <[email protected]>
@github-project-automation github-project-automation bot moved this to Untriaged in Issue Severity Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Untriaged
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants