-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.11 RC1: Nessus Scan crash the Windows-Client. #7431
Comments
Addition: To be sure, we have scanned 3 times. |
Since we don't have access to your environment (this implies an NDA/support contract), we need all the logs, as always. Also, the entries from the windows event console. Are you 100% certain that you did restart the agent after upgrading to 2.11 RC? |
Yes I did also a reinstall of the windows Client to be sure. |
My college did the security scan yesterday between 15:31 and 15:44 for the test environment. I filtered and anonymize the icinga2.log for this period. There was also only one entry in the Windows Event log. Icinga didn't produce a crash log. For me it looks like, that the windows client has problems with the GET-requests at 15:39:21, like GET %24%7B%7B57550614+16044095%7D%7D//. After that the client stopped working. |
Thanks, the logs unfortunately don't tell much except for the connections and requests made. The last log line before the restart doesn't unveil much. I've performed the HTTP GET request against my Windows 10 VM, works as expected.
Just our of interest - I read that Nessus is a free scanner, but obviously has enterprise modules and so on. Which version of it are you using and do you think that there is a chance to reproduce this just with using the free scanner? |
Ok, than I ask my college for another scan and I try to "clean" (and stop all other Services from our other software on that Server) to get good logs; inclusive debug.log. Also I ask him which version and modules we use from Nessus. What I know we use the commercial version. |
So this is the Information I got from my colleague about the version:
logs are coming later |
So my colleague from the security department and I did a lot of scans with various scenarios. note: Scans Logs for this: an interesting fact from yesterday's scan (before wie read the issue with the debuglog) Conclusion: I hope this helps you to find the bug. If you need more, pls tell us. |
I did a little reasearch about this dll. It's from Microsoft so if this is the fault auf the crash, it's maybe a little bit difficult to fix. With google I found some postings in forums, that also other people has problems with this DLL. They also have crashes. They didn't write the reason. But maybe this is helpful for you:
|
Thanks, I'll talk to you on Tuesday when back from vacation. |
We'll continue offline. I'll update the issue once all data is collected and whether we've found something from the analysis. |
Both Stefan and Philipp are doing a marvelous job here - they provided me with a full self-written PDF on how to install and configure Nessus Essentials to trigger the full scan. Previously I tested with a macOS install, but with a wrong subnet Nessus locked me out of the trial (yeah, enterprise software which always wants you to buy something). Now I am using a CentOS 7 Vagrant box. I can see that Icinga stops running when such a scan happens. This is good news, so I can dig deeper. For security reasons, I am not sharing any details about Nessus and its configuration nor the exact logs for triggering things until I have found and implemented a fix. |
The referenced PR doesn't entirely solve the problem. We're now somewhere in the event polling loops in Boost Asio on Windows with socket cancel actions. So this definitely only is a Windows specific problem, only with the RC1 and Git Master. |
That are a Little bit bad news. Because that means for you more hard work to fix. But you can do that, we're sure. |
http::async_read_header() may return an error during liveness disconnect. For some reason, reading the header doesn't return in this case. Calling cancel() in Disconnect() does not cancel running operations, nor would it react shuttingDown. Either we're overlapping buffers here (which I highly doubt), or there's a race condition involved where Boost asio fails. Or something fails with coroutines and strands here. Upgrade to Boost 1.71 would also be a test option for next week. Code resides in bugfix/windows-nessus-crash-http now @Al2Klimov. Last resort: Disable IOCP on Windows and force to use select() for IO handling with Boost Asio. The error is somewhat different, at some point the thrown error is somehow broken. Maybe there is an overlap with the error codes as well. Looked into ceph's handling here too https://github.com/ceph/ceph/blob/6171399fdedd928b4249d135b4036e3de25079aa/src/rgw/rgw_asio_frontend.cc Btw, you cannot even go into the Windows specific DLLs to read source code. Only Assembler. Closed source sucks. |
When run within a coroutine, exceptions on Windows may influence bad behaviour here. Instead, we'll check for the error code and extract the message from memory. In contrast to exceptions which are stored on the stack frame and then return, this costs a little more memory but simplifies the logic. This doesn't fix the linked issue, but is related to the analysis. refs #7431
…them in sbin/ This comes with the motivation to avoid bundling Boost context into our main binary. Whenever the context library allocates stack memory, or does something weird, we want to see this in the stack traces as origin from an external DLL and not the main icinga2.exe binary. TL;DR - this doesn't fix #7431 but makes debugging and development a bit more easy. And it follows the Windows approach with DLLs.
Exceptions in Disconnect() might be thrown (this has been reworked into error_code locally) which are swallowed inside the Destructor for being dangerous. On the other hand, swallowing them may corrupt the stack unwinding operation from the coroutine layer. The best is to avoid Defer inside lib/remote and call Disconnect() directly after breaking from other operations. refs #7351 refs #7431
…error::operation_aborted refs #7431
…eptions This is required to - catch all exceptions and wrap them into Boost exceptions. They are the only ones allowed with Boost.Coroutine. - set a dedicated coroutine stack size for Windows. refs #7431
…eptions This is required to - catch all exceptions and wrap them into Boost exceptions. They are the only ones allowed with Boost.Coroutine. - set a dedicated coroutine stack size for Windows. refs #7431
TL;DR - 2 weeks with emotional ups and downs and a lot of technical background learned. Icinga now survives a Nessus scan. AnalysisSetupScanner SetupBoth Nessus and OpenVAS work. Thanks to Stefan & Philipp for providing a setup documentation for Nessus Essential. Won't be provided for security reasons here. Wireshark TCP Dumphttps://www.netways.de/blog/2018/07/26/ssl-geknackt-naja-fast/ Change the cipher_list, add the node's private key for following the TLS traffic, extract the sent malformed requests. Request AnalysisOnly failing requests would lead to stack buffer underrun or memory corruption errors. Memory Analysis
OriginBoost Beast HTTP Parserhttp::async_read_header may throw exceptions leading into crashes. In order not to populate the exception Boost ASIO Socket IOSame strand in m_IoStrand as with other operations. No special influence here. boostorg/asio#154 http://www.crazygaze.com/blog/2016/03/17/how-strands-work-and-why-you-should-use-them/ Boost CoroutinesASIO wraps them and makes them easier to use. https://stackoverflow.com/questions/30557582/what-does-boostasiospawn-do Coroutine Stack Unwinding on Complete
Swallowing exceptions is evil with Boost Coroutines. Root CauseCoroutines work in the way that they allocate their own stack. Whenever an exception is thrown, Exceptions are asynchronous and put onto the stack. This stack is also used by the Coroutine to At some point, the thrown exceptions corrupt the stack. Only on Windows, only with Visual Studio as compiler. In order to trigger this problem, either fire many Nessus Scan Attacks against the HTTP API. The easiest way to reproduce this is to use the debug console and evaluate a script expression
Coroutines and Context StackOur (exception) stack is foobar at some point. Either Beast exceptions throw, our own ScriptError exceptions throw and something asio related itself. One can manipulate the coroutine stack size allocated by Boost.Context. This defaults to 64KB.
https://gist.github.com/chfast/bfc1e1af4176960b4722#file-context-cpp-L85
Sets the default stack size to 8 MB. 64MB. Does not immediately crash, but later. https://www.boost.org/doc/libs/1_68_0/libs/context/doc/html/context/stack.html Exceptions
https://asio-users.narkive.com/QP9g9dj9/can-t-catch-boost-asio-exceptions-inside-coroutines
http://boost.2283326.n4.nabble.com/context-Crash-in-case-of-exception-from-a-context-td4672953.html
https://www.acodersjourney.com/top-15-c-exception-handling-mistakes-avoid/
https://docs.microsoft.com/en-us/cpp/cpp/exceptions-and-stack-unwinding-in-cpp?view=vs-2019
https://groups.google.com/forum/#!topic/boost-list/6oHRdLmuQ5k
Incompatibility between std, boost and ScriptError exceptions?I've found a library which wraps the coroutine exceptions from their different types. The implementation is safe and good to use - since I've now learned that coroutines only support Compiler OptimizationThe Visual Studio Compiler optimizes several bits in a way we as developers do not expect it. In terms of Boost, we need to disable certain flags already:
These optimizations don't have much of influence here. Instead, I've found some compiler optimizations going on to reduce the binary sizes. https://habr.com/en/company/microsoft/blog/442996/
ASIO IO Threads manipulating memory?https://stackoverflow.com/questions/14981291/boostasio-internal-threads
Tested this, but it did not fix the crash. Won't apply the patch on Windows therefore. Windows StackWell, turns out that the default allocated stack size for a coroutine is 64 KB. Whenever it has exceptions inside, this stack may be too low. Raising the default coroutine stack on Windows actually fixes the problem. https://docs.microsoft.com/en-us/cpp/build/exception-handling-x64?view=vs-2019 Final Analysis
Whenever valid requests are made, everything is fine.
Boost Beast http::async_read_header can throw an exception. We can re-build this into error codes. Doesn't change. Not all of them are boost compatible exceptions.
This includes a stack deallocate() call which calls std::free(). This crashes with pointing to memory which is corrupted.
Yes. Rewriting EnsureValidHeaders() to not use these exceptions allows the code to complete. No fucking idea what the Visual Studio compiler renders from this code bit. It may originate from static INLINE too.
The stack size with 64 KB is too low. Why? Look into what's stored inside the exception. The whole DebugInfo struct is massively bloated. It may also be the case that exceptions inside coroutines cause an overhead which is
Yes, Boost Coroutine doesn't work with the unwind exception.
Yes, these two patterns are incompatible within the new network stack. Avoid them at all cost.
Tests have proven something between 8 and 64 MB. This still is too high, but mitigates the issue. FixesEnsure that Coroutine Stack unwinding is fully supported within the API.
Ensure that HTTP Disconnects don't fire I/O.
Wrap thrown exceptions into Boost compatible exceptions in boost::asio::spawn().
Side Fixes
References and URLsEverything I have had opened and shed some light into finding the cause. https://geidav.wordpress.com/2013/03/21/anatomy-of-dynamic-stack-allocations/ Librarieshttps://github.com/owt5008137/libcopp Google SearchesNewest first.
|
Thank you very much @dnsmichi, @Al2Klimov and everybody in background who helped to fix this. That's really amazaing that you find the cause of this Problem. Thank you for your hard and awesome work! You have our big respect! Chapeau! |
Thanks for the kind feedback, now we can start preparing the 2.11 release. As written on Twitter already, the intention with all the details is to not only help an Icinga developer, but also everyone developing with Boost, Coroutines, etc. on Windows and debugging the same issue. Sharing knowledge is what makes open source strong, not only the source code. |
Describe the bug
We tested the 2.11 RC1 against the Nessus Scan again. So it's a litte bit a follow up from #6559: the Linux Client survived on CentOS 7 - Great! But not on our Windows System. Here the service is stopped after the scan. 2.11 RC1 is installed on a Windows Server 2012 R2 Standard.
If you Need something more pls tell us.
To Reproduce
Scan with a security scanner against a Windows system
Expected behavior
The Windows Client should also not Crash like the Linux Client.
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
): 2.11 RC1The text was updated successfully, but these errors were encountered: