Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in "select" test #707

Closed
jphickey opened this issue Dec 28, 2020 · 2 comments · Fixed by #721 or #744
Closed

Race condition in "select" test #707

jphickey opened this issue Dec 28, 2020 · 2 comments · Fixed by #721 or #744
Assignees
Labels
bug unit-test Tickets related to the OSAL unit testing (functional and/or coverage)
Milestone

Comments

@jphickey
Copy link
Contributor

Describe the bug
Running the OSAL select test, I ran into a deadlock situation where the "multi" test got stuck and never finished.

To Reproduce
Hit or miss... Run test repeatedly on a system with other loads (e.g. parallel builds)

Expected behavior
Test should complete

Code snips
Checking the test status/backtrace it looks like two tasks (main + "Server_Fn") are waiting on the binary sem. In particular the Server_Fn is stuck here:

status = OS_BinSemTake(bin_sem_id);

While the main task is waiting in the teardown code (the TestSelectMultipleRead has completed, and it has invoked Teardown_Multi which in turn invokes Teardown_Single here):

status = OS_BinSemTake(bin_sem_id2);

System observed on:
Ubuntu 20.04

Additional context
This is likely related to the use of OS_BinSemFlush. We should probably deprecate this function, as I cannot see how this can ever be used safely without it being a race condition. VxWorks offers it which (I think) is why OSAL also offers it, but its a fundamentally broken concept.

I can confirm that looking at the traceback in gdb, the flush_count is indeed already 1 - meaning the flush had already happened by the time the Server_Fn entered the bin sem take routine.

Reporter Info
Joseph Hickey, Vantage Systems, Inc.

@jphickey jphickey added the bug label Dec 28, 2020
@jphickey
Copy link
Contributor Author

Discovered another race condition in here - this uses the same TCP port number as the "network-api-test" does. This means when running ctest with a "-j8" option (parallel) that it might run at the same time as network-api-test, which causes it to fail.

Still not clear on why this wasn't part of the network-api-test as I originally suggested - which would have avoided this - but nonetheless it has been implemented separately so this becomes a race condition issue. Should use a different port number for this test so it doesn't interfere with the other test.

@skliper skliper added the unit-test Tickets related to the OSAL unit testing (functional and/or coverage) label Dec 29, 2020
@skliper skliper added this to the 6.0.0 milestone Dec 29, 2020
@zanzaben
Copy link
Contributor

It was not put into the network api test because not all the select tests use network features. As for selecting a new port number, is there a list of all the ports that are already being used or do I just have to search through the tests to find them.

zanzaben added a commit to zanzaben/osal that referenced this issue Dec 29, 2020
change port numbers to be different from network test
zanzaben added a commit to zanzaben/osal that referenced this issue Jan 4, 2021
change sem flush to solve race condition
change port numbers to be different from network test
astrogeco added a commit that referenced this issue Jan 6, 2021
Fix #707 change sem flush to solve race condition
jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022
Changes Message Key from uint16 to uint32 to avoid rollover
jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unit-test Tickets related to the OSAL unit testing (functional and/or coverage)
Projects
None yet
3 participants