-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIXED] cleaning up sanitize=thread found several races #771
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #771 +/- ##
==========================================
+ Coverage 68.71% 68.79% +0.08%
==========================================
Files 39 39
Lines 15207 15227 +20
Branches 3143 3153 +10
==========================================
+ Hits 10449 10476 +27
+ Misses 1700 1684 -16
- Partials 3058 3067 +9 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good catches with missing unlocking in tests. I have a question regarding the need of verbose output for sanitize thread and about the reconnect implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove redirect to temp file? By the way, if you would have preferred having different PRs, you could have had one with just the close->shutdown change, that's it, and then once merged, rebase the PR that was just CI/Tests.
Finally, I had disabled sanitize=thread in Travis due to some bug in Ubuntu/travis combination. We may want to reactivate from time to time to see if that works now.
test/CMakeLists.txt
Outdated
@@ -45,6 +45,10 @@ foreach(name ${listOfTestNames}) | |||
# Make sure the test passes | |||
set_tests_properties(Test_${name} PROPERTIES PASS_REGULAR_EXPRESSION "ALL PASSED") | |||
|
|||
# Set TSAN_OPTIONS for the test | |||
set_tests_properties(Test_${name} PROPERTIES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have an effect on any other builds? (where we don't need/want sanitize)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, since TSAN is for Thread SANitizer, and it's an environment variable for running the tests, not a build flag but I put it in an if for NATS_SANITIZE
only just in case.
@kozlovic The travis build still does sanitize=thread for clang. I tried to enable gcc, and got the same cmake configuration-time failure, so disabled it back. I added a fuller matrix of sanitize to GH Actions, will see what it produces. |
@kozlovic The fuller matrix found a couple more, in the code, missing libDlvWorker lock; please check. |
sanitize: address | ||
server_version: main | ||
|
||
san-thread: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Not sure why you would shorten to san-
here while you used santize-
for the address matrix run above :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shorter is better? :) (will change for consistency)
@@ -1137,17 +1137,29 @@ _resendSubscriptions(natsConnection *nc) | |||
|
|||
adjustedMax = 0; | |||
natsSub_Lock(sub); | |||
if (sub->libDlvWorker != NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are macro for lock/unlock of this in sub.c
. Maybe we move the definition of these in sub.h
(we include this header in conn.c
), and so you could replace these 2 lines with:
SUB_DLV_WORKER_LOCK(sub);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kozlovic I stayed away from further exposing these macros. I am about to submit a re-factor (just passed all tests now without kicking!) that hopefully eliminates the need for double-locking, leaving the worker lock almost exclusively to protect the message queue. Ditto for the sub->closed
comment.
It's ok with me if we keep this PR "hanging" for now until you see the upcoming solution, and then we can decide which way to fix? I just want to have it as my base branch to get the CI/unrelated test fixes out of the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the last part, that is, do you want to merge this PR to be the base for the other work or the other work will be the (re)base for the CI changes. Either way, I am going to LGTM this PR so that you proceed as you wish.
// If JS ordered consumer, trigger a reset. Don't check the error | ||
// condition here. If there is a failure, it will be retried | ||
// at the next HB interval. | ||
if ((sub->jsi != NULL) && (sub->jsi->ordered)) | ||
{ | ||
jsSub_resetOrderedConsumer(sub, sub->jsi->sseq+1); | ||
if (sub->libDlvWorker != NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and:
SUB_DLV_WORKER_UNLOCK(sub);
@@ -2045,6 +2045,10 @@ _hbTimerFired(natsTimer *timer, void* closure) | |||
natsStatus s = NATS_OK; | |||
|
|||
natsSub_Lock(sub); | |||
if (sub->libDlvWorker != NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same than for conn.c
. We also include sub.h
here, so that would work.
if (sub->libDlvWorker != NULL) | ||
{ | ||
natsMutex_Lock(sub->libDlvWorker->lock); | ||
} | ||
if (!sub->closed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm... there are may places where we check this under the sub's lock only. Sure, it is set under both locks and checked only under the lib's lock in nats.c
(the code that deals with the delivery's pool thread), but if we find many other reports, then I think it would be simpler to modify _deliverMsgs
so that we capture sub->closed
under the sub's lock after releasing the delivery's lock. Anyway, for now change is ok, but again, let's keep that in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
note: I've been working on a bigger PR for async fetch (aka Consume), it is not ready yet but since it will be a big PR I'll try to send in parts that are ready to simplify the review. This is part 1.
msgDlvWorker->lock
where we were not before.natsSock_Shutdown
rather thannatsSock_Close
fromnatsConnection_Reconnect
to avoid a potential data race with the readloop.-fsanitize=thread
#672Before this PR warnings generated by
-fsanitize=thread
in CI were ignored, since the detailed test logs were dropped and not scanned. This PR adds the log scanning.It also uncovered several issues with the existing tests, all of them but one were relatively trivial. In
Test_ForceReconnect
detected a race caused by us forecefullyclose
-ing anfd
which is still being read on by the readloop. I could not think of a simple solution that would not have performance implications, so I disabled the test for thread sanitizer. The code seems to be doing the right thing anyway, and the readloop thread would error out, as designed.The following tests were fixed:
This PR also adds a few more CI jobs, including (experimentally for now) DEV_MODE. I will remove the DEV_MODE jobs if they prove flaky and not easily fixable. Same for the (added) debug/sanitize jobs, will monitor and remove/fix if too slow/flaky.