-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dbnode] Get rid of excessive locking when adding new namespaces #3765
Conversation
…pleted, because it can take a while.
Codecov Report
@@ Coverage Diff @@
## master #3765 +/- ##
========================================
- Coverage 57.1% 56.8% -0.4%
========================================
Files 552 552
Lines 63468 63100 -368
========================================
- Hits 36303 35874 -429
- Misses 23961 24020 +59
- Partials 3204 3206 +2
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
…`onCompleteFn` might be invoked for the second bootstrap while another bootstrap is still in progress. This is bad because bootstrap should work when fileOps are disabled and `onCompleteFn` might be enabling fileOps.
Looks like there is a related test failure @soundvibe: https://buildkite.com/uberopensource/m3-monorepo-public/builds/10113#be0da0a5-486d-413a-ae61-0a6bdd257285/773-953 |
Yes, already working on a fix for |
If db was not yet bootstrapped and new shardSet is assigned, do this immediately (no need to enqueue). New test for add new namespace using enqueue.
…e enqueueing and waiting is fully async.
…he callback functions (#3810) * Simplify callback logic by having the mediator own the lifecycle of the callback functions * Remove no longer required BootstrapAsyncResult * Remove unused require.NoError * Fix lint
Fixed possible race in `context.StartSampledTraceSpan()`. Use zap error logging where needed. Make sure fileOps are disabled/enabled only when really needed (bootstraps > 0 and mediator is opened), otherwise just use simpler logics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few minor suggestions before merging, simplifications and more use of lock/defer unlock since enqueueing bootstraps and turning back on file ops can be done while holding a lock (since both async/very fast):
#3818
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM logic wise, I would want to merge the simplifications into this branch from the suggested PR I opened targeting this branch before we land this PR however: #3818
Stamping the PR so can land asynchronously.
…aces (#3818) * Simplify some code paths further and use lock/defer unlock in more places * Fix test
Merged the simplifications into this branch. |
What this PR does / why we need it:
Currently, when the new namespace is being added to m3db node, the database level lock is held throughout the whole update namespaces process. This might potentially take quite a lot of time because it needs to enqueue bootstrap and wait for it to actually start. We are talking about waiting for potentially a several minutes here. During this time dbnode is unable to handle new requests so goroutine count and memory usage begins to increase, eventually leading to OOM.
This PR ensures that namespaces are updated when all background file ops are disabled and completed and also reduces database level locking (lock is held only when namespaces are being updated in the internal map).
Before:
After:
Special notes for your reviewer:
Does this PR introduce a user-facing and/or backwards incompatible change?:
Does this PR require updating code package or user-facing documentation?: