-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation faults in moka-cht under heavy workloads on a many-core machine #34
Comments
Based on my test results, it might be worth to downgrade crossbeam-epoch from v0.9.5 to v0.8.2 to workaround the issue. I am preparing Moka v0.5.2 release with moka-cht v0.4.2 and crossbeam-epoch v0.8.2. |
Released Moka v0.5.2 with moka-cht v0.4.2 and crossbeam-epoch v0.8.2. Unfortunately, the same segmentation fault (the pattern 1) occurred when I was running mokabench on Moka v0.5.2. I released v0.5.2 anyway as earlier versions of Moka may have the same issue already, and I feel segmentation faults is less frequent with crossbeam-epoch v0.8.2. |
Just for sure, I tried Rust 1.53.0 to compile mokabench + Moka v0.5.3. I did it because I have never tried Rust 1.53.0 since Moka v0.5.1 was released. The result was the same; it got a segmentation fault after running mokabench for ~2 hours. I used the EC2 instance type with 36 vCPUs.
|
Any progress on this? Does it ever happen on something with 18 vCPUs? |
No 😞. I spent a few more days for running different tests, doing code review, etc., but could not find any clue. I am currently constraint by time (I have to run the test at least for a few hours to reproduce) and money (36 vCPU instance is expensive; $1.926/hour). I will revisit this issue when I have more time.
No. It has never happened on a Are you holding off on using Moka because of this problem? If so, perhaps I will add an optional Cargo feature to use an alternative hash table. It will spoil concurrent performance but will be safer. |
Here are some updates on this issue. It has been five moths since I first saw this issue, but (fortunately) no user of this crate has reported segfaults:
On January 5th, 2022, I ran the same load tests (mokabench) against Moka v0.7.0 on the following EC2 instances and had some segfaults only on the instances with 32 vCPUs:
I ran the same but shorter load tests as a part of pre-release testing for v0.7.1 (January 12th, 2022) and v0.7.2 (February 6th, 2022). There was no segfault for v0.7.2:
v0.7.2 has fixes and enhancements for #72. It might have mitigated the issue but I am not 100% sure because I still have not figured out the root cause of those segfaults. |
Here are some updates on this issue.
Our internal HashMap is lock-free container and heavily depends on atomic operations such as compare-and-swap (CAS). It seems parallelism is the key to trigger the issue; e.g. more than one processor cores to execute CAS on the same memory location at the same time. It also heavily depends on crossbeam-epoch's epoch-based memory reclamation (garbage collection, GC), which also relies on CAS. I think the most suspicious area is rehashing, which is used to extend HashMap capacity and to run epoch-GC on deleted keys. There should be lots of CAS conflicts and retries, and epoch-GCs occurs during rehashing. Action Plans
|
This will reduce the chance of issue #34 occurring.
This workaround is added via #129. |
Lines 52 to 55 in 8f61b35
crossbeam-epoch 0.8.2 depends on crossbeam-utils 0.7.x, which is affected by GHSA-qc84-gqf4-9926 Is the work around in #129 to upgrade moka’s dependency of crossbeam-epoch? |
Hi @SimonSapin,
Thank you for the information.
No. I do not think so, unfortunately. I have another Moka repository here and it has crossbeam-epoch upgraded to v0.9.9: and I ran the same test on both Moka with crossbeam-epoch v0.8.2 and v0.9.9. I found Moka with crossbeam-epoch v0.9.9 is still having the same issue. Moka with crossbeam-epoch v0.9.9 Had segfault four times in about four hours. $ rg '(Segmentation fault|Bus error)' epoch09-2022-0618.log
271:./run-tests-insert-once.sh: line 26: 94446 Segmentation fault: 11 ./target/release/mokabench --invalidate --insert-once
283:./run-tests-insert-once.sh: line 30: 94453 Segmentation fault: 11 ./target/release/mokabench --invalidate-entries-if --insert-once
$ rg '(Segmentation fault|Bus error)' epoch09-2022-0619A.log
243:./run-tests-insert-once.sh: line 18: 99154 Segmentation fault: 11 ./target/release/mokabench --insert-once --size-aware
326:./run-tests-insert-once.sh: line 30: 99301 Segmentation fault: 11 ./target/release/mokabench --invalidate-entries-if --insert-once
$ cat epoch09-2022-0618.log
...
cargo tree --all-features
...
│ ├── crossbeam-epoch v0.9.9
│ │ ├── cfg-if v1.0.0
│ │ ├── crossbeam-utils v0.8.9 (*) Moka with crossbeam-epoch v0.8.2 Had segfault three times in about four hours. $ rg '(Segmentation fault|Bus error)' epoch08-2022-0619.log
349:./run-tests-insert-once.sh: line 26: 95369 Segmentation fault: 11 ./target/release/mokabench --invalidate --insert-once
$ rg '(Segmentation fault|Bus error)' epoch08-2022-0619B.log
339:./run-tests-insert-once.sh: line 30: 478 Segmentation fault: 11 ./target/release/mokabench --invalidate-entries-if --insert-once
385:./run-tests-insert-once.sh: line 38: 536 Segmentation fault: 11 ./target/release/mokabench --ttl 3 --tti 1 --invalidate --insert-once --size-aware
$ cat epoch08-2022-0619.log
...
cargo tree --all-features
...
│ ├── crossbeam-epoch v0.8.2
│ │ ├── cfg-if v0.1.10
│ │ ├── crossbeam-utils v0.7.2 NOTE: To make segfault occurs more often, I used modified Moka to set the number of Anyway, I will continue evaluating crossbeam-epoch v0.9.9 in parallel to v0.8.2, and will upgrade Moka's dependency with v0.9.9 once I feel v0.9.9 will not increase the chance of segfaults. I am also watching every releases of crossbeam-* and parking_lot crates, and testing them if they have any fixes on memory safety issues. I am reviewing Moka and their source codes when I have time. I hope I can isolate the code causing the issue. |
FYI, I created a draft pull request #157 to upgrade crossbeam-epoch from v0.8.2 to v0.9.9. I scheduled it for next patch release Moka v0.8.7. As I wrote in the PR, I will run some mokabench tests before merging it. I will be able to run mokabench for 6 hours a day (during night), so if everything goes well, the test will complete in 4 days (total 24 hours). |
Unfortunately, I found that upgrading crossbeam-epoch to v0.9.9 would actually make this issue worse on Linux x86_64. It occurred ~15% more often with v0.9.9 than v0.8.2. So I am hesitate to merge the PR. Just for sure, I will do the same test again during this weekend. |
- Add a lock to the rehash function of the concurrent hash table (`moka::cht`) to ensure only one thread can participate rehashing at a time. - To prevent potential inconsistency issues in non x86 based systems, strengthen the memory ordering used for `compare_exchange_weak` (`Release` to `AcqRel`).
Finally, I believe I fixed this issue via #157. Last week, I got a new x86_64 based Linux PC with 20 logical cores (Intel Core i7-12700F), and it helped me a lot to reproduce and investigate the issue. I found the cause of the issue last night and fixed it. After the fix, I have never been able to reproduce the issue again on both the PC (Linux x86_64) and Mac (macOS arm64). The cause was race conditions when many threads are concurrently rehashing (extending or shrinking) internal hash table Also I found the memory ordering used for #157 also upgrades crossbeam-epoch to the latest version (v0.9.9). |
I have published v0.9.2 with this fix to crates.io. |
I have seen segmentation faults a few times when I am running mokabench on Moka v0.5.1. It seems it is randomly happening while
get_or_insert_with
method is heavily called concurrently from many threads.I am using Amazon EC2 for running mokabench. After spending few days, I found it is related to the version of crossbeam-epoch and number of CPU cores.
crossbeam-epoch is used by moka-cht, the concurrent hash table use by Moka.
I examined stack traces from core dumps and found there are two patterns. I have not identified the root cause yet. Perhaps a
crossbeam_epoch::Owned<T>
, which is very similar toBox<T>
, stored in moka-cht became a dangling pointer by some reason?Pattern 1: At
Arc::ne()
(Click to expand)Pattern 2: At
atomic_sub()
inArc::drop()
(Click to expand)The text was updated successfully, but these errors were encountered: