-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue3373 storage exit crash #3553
Issue3373 storage exit crash #3553
Conversation
46dcfa5
to
cbc618f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job.
- I think
localCacheLock_
is useless now, maybe we could remove it. - move
killedPlans_
andkilledPlans_
to the same rcu
1f53617
to
a5f3f65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for test
bf32da3
to
7081904
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A long story... Good job~~ LGTM
* use rcu replace thread local fix storage exit crash format address some comment * fix bug * fix bug
* Fix typos (#3615) Co-authored-by: kyle.cao <[email protected]> * fix fetch edges tostring (#3613) Co-authored-by: Sophie <[email protected]> Co-authored-by: Yichen Wang <[email protected]> * fix create space assign offline host (#3583) * fix create space * fix test case Co-authored-by: Harris.Chu <[email protected]> * Disable ARM version docker image since related third party not ready (#3618) * Unify raft error code (#3620) * Meta upgrader v3 (#3540) * Replace group when create space * Support white list * fix test case * support zone operations * fix * Support meta upgrade v3 * add more check about parse host result (#3628) * Ut fix (#3611) * Enable ut and fix chaindelete * Add mock server default worker * fix service crash (#3616) * Cleanup branch param in package script (#3622) * fix crash when the expression exceed the depth (#3606) * Enhance login password check (#3629) * fix_batch_insert_problem (#3627) * filter data before batch insert * add test cases * add more testcase * add notifyStop() for metaClient (#3621) * add notifyStop() for metaClient * do clean * Fix removeSession() (#3651) Co-authored-by: Yee <[email protected]> * Issue3373 storage exit crash (#3553) * use rcu replace thread local fix storage exit crash format address some comment * fix bug * fix bug * Fix coalesce bug (#3653) * fix coalesce * fix test * add test * add tck * fix * fix * fix * delete double check agg in where clause (#3647) Co-authored-by: Yee <[email protected]> Co-authored-by: cpw <[email protected]> * fix meta crash after create space (#3660) Co-authored-by: Yichen Wang <[email protected]> Co-authored-by: Yichen Wang <[email protected]> Co-authored-by: kyle.cao <[email protected]> Co-authored-by: jimingquan <[email protected]> Co-authored-by: yaphet <[email protected]> Co-authored-by: Harris.Chu <[email protected]> Co-authored-by: Yee <[email protected]> Co-authored-by: Doodle <[email protected]> Co-authored-by: Alex Xing <[email protected]> Co-authored-by: endy.li <[email protected]> Co-authored-by: [email protected] <[email protected]> Co-authored-by: hs.zhang <[email protected]> Co-authored-by: jakevin <[email protected]> Co-authored-by: cpw <[email protected]>
* use rcu replace thread local fix storage exit crash format address some comment * fix bug * fix bug fix bug fix bug Co-authored-by: hs.zhang <[email protected]>
What type of PR is this?
What does this PR do?
Use RCU replace ThreadLocal in MetaClient
Which issue(s)/PR(s) this PR relates to?
#3373
#3497
Special notes for your reviewer, ex. impact of this fix, etc:
At the beginning, we found that storage would crash after running for a long time (a large number of insert edge operations were performed at the same time). At the same time, Storage's memory usage will be very high. So we guess that there is a memory leak after the system OOM. However, it was later discovered that this is not the problem. Even if the Storage does not have OOM, it will crash when it is stopped. All coredump stacks destruct a static thread variable when the thread exits. This is a variable of type folly::SingletonThreadLocal introduced in MetaClient.
At the same time, in another scenario, if compaction is triggered when storage is started, it will crash directly, and the coredump stack and stop will be the same.
After a long time of investigation, we did not find the specific cause of this problem, but we found that this was a problem that only appeared after the introduction of folly::SingletonThreadLocal, so we chose to deprecate folly::SingletonThreadLocal and replace it with RCU it.
After using RCU, there is indeed no crash. I am not sure whether it was really fixed or just because the probability of crash has decreased and I did not find it.
In addition, the performance of using RCU should also be better than the performance of ThreadLocal, because no read-write lock means no blocking
Additional context/ Design document:
Checklist:
Release notes:
Please confirm whether to be reflected in release notes and how to describe: