-
Notifications
You must be signed in to change notification settings - Fork 682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use mmap on MARF connections #2900
Conversation
… signal handler report the signal received to the user-supplied callback.
…trlc crate's platform-specific deps
…ng -- this is an inevitable consequence of having multiple runloops
Hey Jude.. Thanks for a fast change. How are you testing this? |
So, there is one thing that warrants further investigation here -- a thread that triggers SIGBUS needs to be terminated immediately. Not at a rendezvous or cancellation point, but at the CPU instruction which caused SIGBUS. This is because SIGBUS is triggered by an unaligned memory access, or in The reason we have to be this strict is because of how SIGBUS works. When the hardware raises the page fault to the kernel, the kernel will suspend the executing task at the offending CPU instruction, set up and run the signal handler right then and there (i.e. SIGBUS is handled synchronously in the program execution), and on signal handler return, attempt to re-run the offending CPU instruction. So if the thread's execution isn't terminated immediately, we'd be setting ourselves up for an infinite loop -- the offending thread will re-attempt to load data from an unbacked page, triggering another SIGBUS, causing the signal handler to run and exit, causing the same offending instruction to be re-run, over and over forever. The offending thread will never reach a cancellation point, nor will an attempt to join with it work. So, while the other threads will gracefully terminate, the offending thread never will. There are a couple ways we can address this:
Let me know if not crashing-and-burning is still something we want to do here. |
I think crash-and-burn is the preferred behavior here. If possible, though, the node operator should be able to figure out that there was a system I/O error that led to the crash. In terms of testing, I think this PR needs two things:
|
Hey @jcnelson , I see I was re-added for review, but my question last time was about the testing plan. Are you able to recreate these problem cases by interacting with the server? Or in tests? |
Oh whoops.. I thought I was re-added but I guess GitHub just leaves me with the "yellow dot" status if I just add comments at the bottom. :/ Never mind my last comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to try this again (sorry for the spam)..
I'm wondering how we are going to test this.
I'm unaware of a safe way to test signal handlers that doesn't also break the test runner, but I'll try. I could make it so that there's test-specific paths in the signal handler that causes it to just set a global somewhere instead of crashing the process, but that doesn't really test the "crash-and-burn" property. |
…ibc methods for recording that a signal has been caught.
Okay, I updated the tests to run the original ctrlc tests. But, we can only set the signal handler once in the test runner's execution lifetime, so I think we should just leave it at that. I've manually tested that the node will print out what kind of signal caused it to die. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Before merging, can you add an entry to the CHANGELOG.md?
Thanks; added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the change!
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
This PR against
develop
activates the mmap pragma on all sqlite connections. In support of this, this PR also vendors thectrlc
crate we had been using, and extends it to handle a SIGBUS signal.Sqlite may trigger SIGBUS signals if the underlying database file is mmap'ed and becomes unavailable at runtime (e.g. suppose it's on a network drive and the network goes down). SIGBUS is only triggered on an attempt to read the unavailable file; an attempt to write to an invalid address that was mapped will (correctly) trigger a SIGSEGV and lead to a crash. This PR makes it so that the node treats SIGBUS like SIGTERM, SIGINT, and SIGHUP -- it triggers a graceful shutdown.
I'm open to making the node simply crash with a panic as well. In fact, I think that it would be preferable if the process synchronously terminated on SIGBUS. But, I'd like confirmation that this is desired before making this happen (since it can lead to chainstate corruption).
I chose to vendor
ctrlc
because (a) it's a pretty stable crate at this point -- it's only been receiving PRs to update dependency versions -- and (b) it's very simple, especially compared to alternative signal handler crates, and it already does 99% of what we need.