-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] org.opensearch.search.backpressure.SearchBackpressureIT.testSearchShardTaskCancellationWithHighCpu is flaky #7972
Comments
This should be resolved by #7978 |
Flaky test failures for |
Sadly, not fixed: #7988
|
Tragic... Gonna take a further look. |
@PritLadani and @ketanv3, I hope all is well. I am tagging you since I see you were two of the main authors of the SearchBackpressure changes. I am reaching out about this issue because I believe the issue we are running into right now is a race condition. I have done some digging after the initial change to the threshold time did not work and have come to the conclusion that the most likely cause is that a thread swap causes an issue. Here is an example of some manual debugging I did to try to identify the issue:
Notice that the statement goes from having a reason to not having one and back
The ordering of these processes should not allow this if there is thread safe operation. We can compare this to the log from a different run:
If you have any knowledge of the thread safety implementations you added it would be appreciated so we can try to diagnose this issue. |
Hi @scrawfor99 , can you please share the files and methods where you added the below logs?
|
Hi @PritLadani, the location of those logging messages were Hopefully this helps. I am still running into issues diagnosing the issue so any input from you would be appreciated :) |
This seems to be due to multiple threads trying to access a volatile variable |
Thanks @scrawfor99 for working on a previous fix. Reducing the CPU time threshold from 1000 ms to 50 ms should not be the correct fix as this code executes a busy-wait loop to simulate CPU cycles. It keeps running the loop until the threshold is breached and an exception is thrown. We should revert it to 1000 ms. As @PritLadani rightly pointed out, the race-condition is due to the integration test thread reading the cancellation reason (here) before the server has updated it (here). We need to fix this by making the updates to the fields |
Hi @ketanv3 and @PritLadani, thank you for following up so quickly. I will try what you suggested. I initially attempted to decrease the threshold since I noticed the optional would be thrown if it did not reach it and attempted a quick fix locally. Since it passed, I assumed that had been the issue. That did not work though so I did an actual RCA and found the thread issue. I will go ahead and try to correct the thread issue as you suggested. |
* fix thread issue Signed-off-by: Stephen Crawford <[email protected]> * fix thread issue Signed-off-by: Stephen Crawford <[email protected]> * Fix thresholds Signed-off-by: Stephen Crawford <[email protected]> * Swap to object based Signed-off-by: Stephen Crawford <[email protected]> * Spotless Signed-off-by: Stephen Crawford <[email protected]> * Swap to preserve nulls Signed-off-by: Stephen Crawford <[email protected]> * Spotless Signed-off-by: Stephen Crawford <[email protected]> * Resolve npe Signed-off-by: Stephen Crawford <[email protected]> * remove final declerations Signed-off-by: Stephen Crawford <[email protected]> * spotless Signed-off-by: Stephen Crawford <[email protected]> * add annotations Signed-off-by: Stephen Crawford <[email protected]> * push to rerun tests Signed-off-by: Stephen Crawford <[email protected]> * Fix idea Signed-off-by: Stephen Crawford <[email protected]> * Fix idea Signed-off-by: Stephen Crawford <[email protected]> --------- Signed-off-by: Stephen Crawford <[email protected]>
* fix thread issue Signed-off-by: Stephen Crawford <[email protected]> * fix thread issue Signed-off-by: Stephen Crawford <[email protected]> * Fix thresholds Signed-off-by: Stephen Crawford <[email protected]> * Swap to object based Signed-off-by: Stephen Crawford <[email protected]> * Spotless Signed-off-by: Stephen Crawford <[email protected]> * Swap to preserve nulls Signed-off-by: Stephen Crawford <[email protected]> * Spotless Signed-off-by: Stephen Crawford <[email protected]> * Resolve npe Signed-off-by: Stephen Crawford <[email protected]> * remove final declerations Signed-off-by: Stephen Crawford <[email protected]> * spotless Signed-off-by: Stephen Crawford <[email protected]> * add annotations Signed-off-by: Stephen Crawford <[email protected]> * push to rerun tests Signed-off-by: Stephen Crawford <[email protected]> * Fix idea Signed-off-by: Stephen Crawford <[email protected]> * Fix idea Signed-off-by: Stephen Crawford <[email protected]> --------- Signed-off-by: Stephen Crawford <[email protected]> (cherry picked from commit 63dc6aa) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* fix thread issue * fix thread issue * Fix thresholds * Swap to object based * Spotless * Swap to preserve nulls * Spotless * Resolve npe * remove final declerations * spotless * add annotations * push to rerun tests * Fix idea * Fix idea --------- (cherry picked from commit 63dc6aa) Signed-off-by: Stephen Crawford <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…rch-project#8063) (opensearch-project#8217) * fix thread issue * fix thread issue * Fix thresholds * Swap to object based * Spotless * Swap to preserve nulls * Spotless * Resolve npe * remove final declerations * spotless * add annotations * push to rerun tests * Fix idea * Fix idea --------- (cherry picked from commit 63dc6aa) Signed-off-by: Stephen Crawford <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…rch-project#8063) * fix thread issue Signed-off-by: Stephen Crawford <[email protected]> * fix thread issue Signed-off-by: Stephen Crawford <[email protected]> * Fix thresholds Signed-off-by: Stephen Crawford <[email protected]> * Swap to object based Signed-off-by: Stephen Crawford <[email protected]> * Spotless Signed-off-by: Stephen Crawford <[email protected]> * Swap to preserve nulls Signed-off-by: Stephen Crawford <[email protected]> * Spotless Signed-off-by: Stephen Crawford <[email protected]> * Resolve npe Signed-off-by: Stephen Crawford <[email protected]> * remove final declerations Signed-off-by: Stephen Crawford <[email protected]> * spotless Signed-off-by: Stephen Crawford <[email protected]> * add annotations Signed-off-by: Stephen Crawford <[email protected]> * push to rerun tests Signed-off-by: Stephen Crawford <[email protected]> * Fix idea Signed-off-by: Stephen Crawford <[email protected]> * Fix idea Signed-off-by: Stephen Crawford <[email protected]> --------- Signed-off-by: Stephen Crawford <[email protected]> Signed-off-by: Rishab Nahata <[email protected]>
…rch-project#8063) * fix thread issue Signed-off-by: Stephen Crawford <[email protected]> * fix thread issue Signed-off-by: Stephen Crawford <[email protected]> * Fix thresholds Signed-off-by: Stephen Crawford <[email protected]> * Swap to object based Signed-off-by: Stephen Crawford <[email protected]> * Spotless Signed-off-by: Stephen Crawford <[email protected]> * Swap to preserve nulls Signed-off-by: Stephen Crawford <[email protected]> * Spotless Signed-off-by: Stephen Crawford <[email protected]> * Resolve npe Signed-off-by: Stephen Crawford <[email protected]> * remove final declerations Signed-off-by: Stephen Crawford <[email protected]> * spotless Signed-off-by: Stephen Crawford <[email protected]> * add annotations Signed-off-by: Stephen Crawford <[email protected]> * push to rerun tests Signed-off-by: Stephen Crawford <[email protected]> * Fix idea Signed-off-by: Stephen Crawford <[email protected]> * Fix idea Signed-off-by: Stephen Crawford <[email protected]> --------- Signed-off-by: Stephen Crawford <[email protected]> Signed-off-by: Shivansh Arora <[email protected]>
Describe the bug
org.opensearch.search.backpressure.SearchBackpressureIT.testSearchShardTaskCancellationWithHighCpu
is flakyhttps://build.ci.opensearch.org/job/gradle-check/17145/
#7969 (comment)
The text was updated successfully, but these errors were encountered: