Race condition in "select" test #707

jphickey · 2020-12-28T17:56:44Z

Describe the bug
Running the OSAL select test, I ran into a deadlock situation where the "multi" test got stuck and never finished.

To Reproduce
Hit or miss... Run test repeatedly on a system with other loads (e.g. parallel builds)

Expected behavior
Test should complete

Code snips
Checking the test status/backtrace it looks like two tasks (main + "Server_Fn") are waiting on the binary sem. In particular the Server_Fn is stuck here:

osal/src/tests/select-test/select-test.c

Line 162 in d698a4d

status = OS_BinSemTake(bin_sem_id);

While the main task is waiting in the teardown code (the TestSelectMultipleRead has completed, and it has invoked Teardown_Multi which in turn invokes Teardown_Single here):

osal/src/tests/select-test/select-test.c

Line 273 in d698a4d

status = OS_BinSemTake(bin_sem_id2);

System observed on:
Ubuntu 20.04

Additional context
This is likely related to the use of OS_BinSemFlush. We should probably deprecate this function, as I cannot see how this can ever be used safely without it being a race condition. VxWorks offers it which (I think) is why OSAL also offers it, but its a fundamentally broken concept.

I can confirm that looking at the traceback in gdb, the flush_count is indeed already 1 - meaning the flush had already happened by the time the Server_Fn entered the bin sem take routine.

Reporter Info
Joseph Hickey, Vantage Systems, Inc.

The text was updated successfully, but these errors were encountered:

jphickey · 2020-12-28T21:51:55Z

Discovered another race condition in here - this uses the same TCP port number as the "network-api-test" does. This means when running ctest with a "-j8" option (parallel) that it might run at the same time as network-api-test, which causes it to fail.

Still not clear on why this wasn't part of the network-api-test as I originally suggested - which would have avoided this - but nonetheless it has been implemented separately so this becomes a race condition issue. Should use a different port number for this test so it doesn't interfere with the other test.

zanzaben · 2020-12-29T14:51:30Z

It was not put into the network api test because not all the select tests use network features. As for selecting a new port number, is there a list of all the ports that are already being used or do I just have to search through the tests to find them.

change port numbers to be different from network test

change sem flush to solve race condition change port numbers to be different from network test

Fix #707 change sem flush to solve race condition

Changes Message Key from uint16 to uint32 to avoid rollover

Fix nasa#707, Resolve highest MsgID of 0xFFFF bug

jphickey added the bug label Dec 28, 2020

skliper assigned zanzaben Dec 29, 2020

skliper added the unit-test Tickets related to the OSAL unit testing (functional and/or coverage) label Dec 29, 2020

skliper added this to the 6.0.0 milestone Dec 29, 2020

zanzaben added a commit to zanzaben/osal that referenced this issue Dec 29, 2020

Fix nasa#707 change sem flush to solve race condition

15de179

change port numbers to be different from network test

zanzaben mentioned this issue Dec 30, 2020

Fix #707 change sem flush to solve race condition #721

Merged

zanzaben added a commit to zanzaben/osal that referenced this issue Jan 4, 2021

Fix nasa#707 Resolve issues of auto tests in parallel build

ae888d4

change sem flush to solve race condition change port numbers to be different from network test

astrogeco added a commit that referenced this issue Jan 6, 2021

Merge pull request #721 from zanzaben/fix707_Select_Test_Race_condition

13461bd

Fix #707 change sem flush to solve race condition

astrogeco mentioned this issue Jan 6, 2021

osal Integration Candidate: 2021-01-05 #744

Merged

astrogeco closed this as completed in #744 Jan 7, 2021

zanzaben mentioned this issue Jan 12, 2021

"Select Test" still hanging #755

Closed

jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022

Fix nasa#707, Resolve highest MsgID of 0xFFFF bug

71f3242

Changes Message Key from uint16 to uint32 to avoid rollover

jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022

Merge pull request nasa#708 from skliper/fix707-msgkey-type-bug

4d78023

Fix nasa#707, Resolve highest MsgID of 0xFFFF bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in "select" test #707

Race condition in "select" test #707

jphickey commented Dec 28, 2020

jphickey commented Dec 28, 2020

zanzaben commented Dec 29, 2020

Race condition in "select" test #707

Race condition in "select" test #707

Comments

jphickey commented Dec 28, 2020

jphickey commented Dec 28, 2020

zanzaben commented Dec 29, 2020