Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of mTestIndex in chip-tool tests is racy #7493

Closed
bzbarsky-apple opened this issue Jun 9, 2021 · 4 comments · Fixed by #7494
Closed

Handling of mTestIndex in chip-tool tests is racy #7493

bzbarsky-apple opened this issue Jun 9, 2021 · 4 comments · Fixed by #7494

Comments

@bzbarsky-apple
Copy link
Contributor

Problem

https://github.com/cecille/connectedhomeip/runs/2785367140?check_suite_focus=true shows a test failure: We read the on/off attribute value after we should have sent an "off" command and get 1, not 0. The log shows this sequence of command executions:

2021-06-09T15:55:30.8680260Z [1623254130842] [0x74c4] CHIP: [TOO] On/Off - Check on/off attribute value is false when starting: Sending command...
2021-06-09T15:55:30.8700580Z [1623254130843] [0x74c7] CHIP: [TOO] On/Off - Check on/off attribute value is false when starting: Sending command...
2021-06-09T15:55:30.8728420Z [1623254130843] [0x74c7] CHIP: [TOO] On/Off - Send On Command: Sending command...
2021-06-09T15:55:30.8777500Z [1623254130843] [0x74c7] CHIP: [TOO] On/Off - Check on/off attribute value is true after on command: Sending command...
2021-06-09T15:55:30.8801290Z [1623254130846] [0x74c7] CHIP: [TOO] On/Off - Check on/off attribute value is false after off command: Sending command...

Compare that to a passing run:

2021-06-09T15:55:30.5481560Z [1623254130526] [0x749c] CHIP: [TOO] On/Off - Check on/off attribute value is false when starting: Sending command...
2021-06-09T15:55:30.5504230Z [1623254130526] [0x749f] CHIP: [TOO] On/Off - Send On Command: Sending command...
2021-06-09T15:55:30.5554260Z [1623254130527] [0x749f] CHIP: [TOO] On/Off - Check on/off attribute value is true after on command: Sending command...
2021-06-09T15:55:30.5577860Z [1623254130527] [0x749f] CHIP: [TOO] On/Off - Send Off Command: Sending command...
2021-06-09T15:55:30.5629300Z [1623254130527] [0x749f] CHIP: [TOO] On/Off - Check on/off attribute value is false after off command: Sending command...

My best guess is that what happens is this:

  1. The initial RunNextTest happens on the chip-tool main thread. It calls TestSendClusterOnOffCommandReadAttribute_0 and then loses the timeslice before incrementing mTestIndex.
  2. The read response comes in on the CHIP message thread, calls NextTest(). Since mTestIndex is still 0, this runs the initial read-attribute test again and increments mTestIndex to 1.
  3. The second read response comes in on the message thread, triggers running of TestSendClusterOnOffCommandOn_1 and increments mTestIndex to 2.
  4. The status response to the On command comes in on the message thread, triggers running of TestSendClusterOnOffCommandReadAttribute_2 (the "Check on/off attribute value is true after on command") bit. Increments mTestIndex to 3.
  5. The main thread finally wakes up again, and does its increment of mTestIndex that never happened. mTestIndex is now 4.
  6. The read response comes in on the CHIP message thread, and we now run TestSendClusterOnOffCommandReadAttribute_4 (which is "Check on/off attribute value is false after off"), but we never did the part of the test that should send the Off command, so we fail.

Proposed Solution

#7478 would presumably fix this.

If we want a shorter-term fix, need to figure out how to ensure that mTestIndex gets incremented before we start sending packets for that test, not after. Probably just using std::atomic_uint16_t would be enough.

bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Jun 9, 2021
We could end up sending a message and getting a response to it before
we ever incremented mTestIndex (if our call into NextTest() was on a
thread other than the message thread).  If that happened, we would end
up running some subtest twice, and then later whenever we
incrememented mTestIndex would end up skipping some subtest.

Fixes project-chip#7493
bzbarsky-apple added a commit that referenced this issue Jun 9, 2021
We could end up sending a message and getting a response to it before
we ever incremented mTestIndex (if our call into NextTest() was on a
thread other than the message thread).  If that happened, we would end
up running some subtest twice, and then later whenever we
incrememented mTestIndex would end up skipping some subtest.

Fixes #7493
@mrjerryjohns
Copy link
Contributor

#7478 should fix it, since:

  1. The platform lock is held before NextTest() is called from the application thread.
  2. The same lock is also grabbed on response call-flow, which runs on the chip thread.

This mutual exclusion ensures safety.

Probably just using std::atomic_uint16_t would be enough.

It might be enough to protect mTestIndex, but it's not enough to protect all the other accesses to the stack that are being made from the application thread through the NextTest() call.

@bzbarsky-apple
Copy link
Contributor Author

It might be enough to protect mTestIndex, but it's not enough to protect all the other accesses to the stack

Yes, I agree. I absolutely think we should fix #7478 and then can probably make mTestIndex non-atomic. In the meantime we were definitely getting random failures from the race on mTestIndex specifically...

@mrjerryjohns
Copy link
Contributor

As part of fixing #7478, can I revert the change in #7494?

@bzbarsky-apple
Copy link
Contributor Author

Yes.

nikita-s-wrk pushed a commit to nikita-s-wrk/connectedhomeip that referenced this issue Sep 23, 2021
We could end up sending a message and getting a response to it before
we ever incremented mTestIndex (if our call into NextTest() was on a
thread other than the message thread).  If that happened, we would end
up running some subtest twice, and then later whenever we
incrememented mTestIndex would end up skipping some subtest.

Fixes project-chip#7493
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants