Handling of mTestIndex in chip-tool tests is racy #7493

bzbarsky-apple · 2021-06-09T17:00:03Z

Problem

https://github.com/cecille/connectedhomeip/runs/2785367140?check_suite_focus=true shows a test failure: We read the on/off attribute value after we should have sent an "off" command and get 1, not 0. The log shows this sequence of command executions:

2021-06-09T15:55:30.8680260Z [1623254130842] [0x74c4] CHIP: [TOO] On/Off - Check on/off attribute value is false when starting: Sending command...
2021-06-09T15:55:30.8700580Z [1623254130843] [0x74c7] CHIP: [TOO] On/Off - Check on/off attribute value is false when starting: Sending command...
2021-06-09T15:55:30.8728420Z [1623254130843] [0x74c7] CHIP: [TOO] On/Off - Send On Command: Sending command...
2021-06-09T15:55:30.8777500Z [1623254130843] [0x74c7] CHIP: [TOO] On/Off - Check on/off attribute value is true after on command: Sending command...
2021-06-09T15:55:30.8801290Z [1623254130846] [0x74c7] CHIP: [TOO] On/Off - Check on/off attribute value is false after off command: Sending command...

Compare that to a passing run:

2021-06-09T15:55:30.5481560Z [1623254130526] [0x749c] CHIP: [TOO] On/Off - Check on/off attribute value is false when starting: Sending command...
2021-06-09T15:55:30.5504230Z [1623254130526] [0x749f] CHIP: [TOO] On/Off - Send On Command: Sending command...
2021-06-09T15:55:30.5554260Z [1623254130527] [0x749f] CHIP: [TOO] On/Off - Check on/off attribute value is true after on command: Sending command...
2021-06-09T15:55:30.5577860Z [1623254130527] [0x749f] CHIP: [TOO] On/Off - Send Off Command: Sending command...
2021-06-09T15:55:30.5629300Z [1623254130527] [0x749f] CHIP: [TOO] On/Off - Check on/off attribute value is false after off command: Sending command...

My best guess is that what happens is this:

The initial RunNextTest happens on the chip-tool main thread. It calls TestSendClusterOnOffCommandReadAttribute_0 and then loses the timeslice before incrementing mTestIndex.
The read response comes in on the CHIP message thread, calls NextTest(). Since mTestIndex is still 0, this runs the initial read-attribute test again and increments mTestIndex to 1.
The second read response comes in on the message thread, triggers running of TestSendClusterOnOffCommandOn_1 and increments mTestIndex to 2.
The status response to the On command comes in on the message thread, triggers running of TestSendClusterOnOffCommandReadAttribute_2 (the "Check on/off attribute value is true after on command") bit. Increments mTestIndex to 3.
The main thread finally wakes up again, and does its increment of mTestIndex that never happened. mTestIndex is now 4.
The read response comes in on the CHIP message thread, and we now run TestSendClusterOnOffCommandReadAttribute_4 (which is "Check on/off attribute value is false after off"), but we never did the part of the test that should send the Off command, so we fail.

Proposed Solution

#7478 would presumably fix this.

If we want a shorter-term fix, need to figure out how to ensure that mTestIndex gets incremented before we start sending packets for that test, not after. Probably just using std::atomic_uint16_t would be enough.

The text was updated successfully, but these errors were encountered:

We could end up sending a message and getting a response to it before we ever incremented mTestIndex (if our call into NextTest() was on a thread other than the message thread). If that happened, we would end up running some subtest twice, and then later whenever we incrememented mTestIndex would end up skipping some subtest. Fixes project-chip#7493

We could end up sending a message and getting a response to it before we ever incremented mTestIndex (if our call into NextTest() was on a thread other than the message thread). If that happened, we would end up running some subtest twice, and then later whenever we incrememented mTestIndex would end up skipping some subtest. Fixes #7493

mrjerryjohns · 2021-06-09T21:34:21Z

#7478 should fix it, since:

The platform lock is held before NextTest() is called from the application thread.
The same lock is also grabbed on response call-flow, which runs on the chip thread.

This mutual exclusion ensures safety.

Probably just using std::atomic_uint16_t would be enough.

It might be enough to protect mTestIndex, but it's not enough to protect all the other accesses to the stack that are being made from the application thread through the NextTest() call.

bzbarsky-apple · 2021-06-09T21:53:09Z

It might be enough to protect mTestIndex, but it's not enough to protect all the other accesses to the stack

Yes, I agree. I absolutely think we should fix #7478 and then can probably make mTestIndex non-atomic. In the meantime we were definitely getting random failures from the race on mTestIndex specifically...

mrjerryjohns · 2021-06-10T04:45:07Z

As part of fixing #7478, can I revert the change in #7494?

bzbarsky-apple · 2021-06-10T13:10:44Z

Yes.

We could end up sending a message and getting a response to it before we ever incremented mTestIndex (if our call into NextTest() was on a thread other than the message thread). If that happened, we would end up running some subtest twice, and then later whenever we incrememented mTestIndex would end up skipping some subtest. Fixes project-chip#7493

bzbarsky-apple mentioned this issue Jun 9, 2021

Fix data race on mTestIndex. #7494

Merged

bzbarsky-apple closed this as completed in #7494 Jun 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of mTestIndex in chip-tool tests is racy #7493

Handling of mTestIndex in chip-tool tests is racy #7493

bzbarsky-apple commented Jun 9, 2021

mrjerryjohns commented Jun 9, 2021

bzbarsky-apple commented Jun 9, 2021

mrjerryjohns commented Jun 10, 2021

bzbarsky-apple commented Jun 10, 2021

Handling of mTestIndex in chip-tool tests is racy #7493

Handling of mTestIndex in chip-tool tests is racy #7493

Comments

bzbarsky-apple commented Jun 9, 2021

Problem

Proposed Solution

mrjerryjohns commented Jun 9, 2021

bzbarsky-apple commented Jun 9, 2021

mrjerryjohns commented Jun 10, 2021

bzbarsky-apple commented Jun 10, 2021