-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nallocfuzz: fuzzing engine to test allocations failure #9902
base: master
Are you sure you want to change the base?
Conversation
Very cool!! We've used a similar technique in fluent-bit as well https://github.com/fluent/fluent-bit/blob/55fc5673fc6a5e52fa49ba37cd21c44c2680408f/include/fluent-bit/flb_mem.h#L67-L74 where it has found probably 10+ null dereferences. An interesting experience was it also increased code coverage a lot, e.g. this calltree has a lot of red areas: https://storage.googleapis.com/oss-fuzz-introspector/fluent-bit/inspector-report/20230110/fuzz_report.html#Fuzzer:-flb-it-fuzz-config_map_fuzzer_OSSFUZZ but after adding possibility of failing allocations the callgraph has a lot more code explored https://storage.googleapis.com/oss-fuzz-introspector/fluent-bit/inspector-report/20230310/fuzz_report.html#Fuzzer:-flb-it-fuzz-config_map_fuzzer_OSSFUZZ From the fluent-bit experience, we rolled this technique out gradually as it can trigger a bit of issues. Another OSS-Fuzz project has used it too #8302 and from the comment it seems the technique found some issues. Should be an engine though? Maybe a custom-sanitizer would be more appropriate? or, simply something that will be enabled in ASAN runs? |
:-)
I have also done this kind of thing for one project, but this nalloc fuzz is now generic
I find that fuzz targets often have more bugs mishandling allocations failures than the projects being fuzzed
Thanks for the reference
The complexity to make it a sanitizer (or an ASAN option) is the need to hook/wrap |
By the way @DavidKorczynski you should also add that to |
Hmm -- am not sure tbh. In this engine, is the probability of failing an allocation determined by some value in the fuzzer input (i.e. In the Fluent Bit case the two important things we looked for were (1) that the fuzzer can control the probability of when to fail allocations and (2) that the forcing of allocation failures are included in the code coverage reports as it's useful to see which error handling code has been covered/not covered. We achieved (1) in the Fluent Bit case by using the first bytes of In order to get this included in the code coverage report ((2) above) would the code coverage generation have to be run using the new engine? |
Kind of, a crc32 is computed from the fuzzing data to seed some pseudo random for the allocations failures
There is an environment variable to do that with nallocfuzz Coverage reports are indeed not tested yet. |
Agreed. Those codepaths are usually undertested (or aren't tested at all). I haven't taken a look at how it's implemented yet but the implementation details aside I agree that it would be really useful.
I think there are a lot of fuzz targets where something like m = malloc(...)
assert(m) is used because they don't expect Also there are projects where allocation failures are considered fatal and handled by panicking or something like that. There should probably be a way to turn this off for projects like that. |
Thanks @evverx
Indeed like openssl for instance.
I think this should be off by default, and enabled by only the projects which are ready for it |
Agreed. I just thought that it would be opt-out but indeed it would be better if it was opt-in. cc @mrc0mmand just in case. Looking at systemd/systemd#21872 it seems you played with allocation failures in systemd. |
cc @IvanNardi as I see that there is fuzz_ndpi_reader_alloc_fail |
Thanks for pointing me this thread: interesting stuff. nDPI already uses some simple logic to fuzz allocation failures (see ntop/nDPI@ada4fe4 and ntop/nDPI@5e8c1eb); an important detail is the ability to enable/disable this feature at runtime via a trivial API: this way, for example, you can be sure that the "init/configuration" phase is always fine and trigger the failures only when processing the data (if you want that) As expected, testing allocations failure triggered a huge number of issues in the error paths (NULL deferences or simple leaks) and greatly improved the coverage. +1 on trying to integrate somehow alloc failures into oss-fuzz/asan/sanitizer/..., but definitely above my area of expertise |
Friendly ping oss-fuzz people @jonathanmetzman @oliverchang @alan32liu ? |
I finally got time to play around with this and it's definitely an interesting tool! I attempted to do something similar (albeit in a very dumbed down way) for systemd a while back and it yielded quite interesting results. A quick note: I had to monkey-patch[0] the infra helper, otherwise it kept fetching the "official" systemd build image instead of using the just built one when running The first main issue I see is that the fuzzers need to be written in a way that accepts allocation failure as a "success". For example, in systemd we make heavy use of
One option would be to ignore SIGABRTs and only look for SIGSEGVs, but unfortunately ASan throws SIGABRT if it finds something interesting as well. Now for some results: when poking around, Nallocfuzz managed to trigger a couple of interesting issues in systemd (systemd/systemd#27719), so this is definitely something worth pursuing further. Apart from the issues above I encountered a couple of segfaults I'm not sure who to blame for (yet), but I'll try to sort them out in the next couple of days. Anyway, that's all for now, thanks for the nifty tool! [0] diff --git a/infra/helper.py b/infra/helper.py
index 050f1ed5..414f66ec 100755
--- a/infra/helper.py
+++ b/infra/helper.py
@@ -679,8 +679,7 @@ def docker_run(run_args, print_output=True, architecture='x86_64'):
"""Calls `docker run`."""
platform = 'linux/arm64' if architecture == 'aarch64' else 'linux/amd64'
command = [
- 'docker', 'run', '--rm', '--privileged', '--shm-size=2g', '--platform',
- platform
+ 'docker', 'run', '--rm', '--privileged', '--shm-size=2g'
]
# Support environments with a TTY.
if sys.stdin.isatty(): |
My guess would be that |
Bingo :) And I have a weird sense of deja-vu that you've already told me about this issue in the past, hah. Anyway, as for the segfaults I don't now who to blame for, all of them have the same root cause - if int foo(...) {
...
f = fopen("foo", "r");
if (!f)
return -errno;
... and later on checks only if #include <assert.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include "alloc.h"
int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
FILE *f = NULL;
errno = 0;
f = fopen("foo", "r");
if (!f) {
assert(errno != 0);
return 0;
}
fclose(f);
return 0;
}
Which is both interesting and unfortunate, as it breaks a lot of stuff. |
Looks like
|
Interesting, didn't know about this. And that would also explain why I didn't encounter it with my experiments, since I always returned So, with a simple Nallocfuzz patch[0] and by replacing asserts with less aggressive error handling in one of the systemd fuzzers, everything seems to be working and running correctly, nice! [0] diffdiff --git a/nallocfuzz.c b/nallocfuzz.c
index 21f9330..fb1e254 100644
--- a/nallocfuzz.c
+++ b/nallocfuzz.c
@@ -1,3 +1,4 @@
+#include <errno.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
@@ -235,6 +236,7 @@ static bool fuzz_nalloc_fail(size_t size, const char *op) {
void *calloc(size_t nmemb, size_t size) {
if (fuzz_nalloc_fail(size, "calloc")) {
+ errno = ENOMEM;
return NULL;
}
return __interceptor_calloc(nmemb, size);
@@ -242,6 +244,7 @@ void *calloc(size_t nmemb, size_t size) {
void *malloc(size_t size) {
if (fuzz_nalloc_fail(size, "malloc")) {
+ errno = ENOMEM;
return NULL;
}
return __interceptor_malloc(size);
@@ -249,6 +252,7 @@ void *malloc(size_t size) {
void *realloc(void *ptr, size_t size) {
if (fuzz_nalloc_fail(size, "realloc")) {
+ errno = ENOMEM;
return NULL;
}
return __interceptor_realloc(ptr, size); |
@mrc0mmand it would be great if you could point it to the "networkd" fuzz targets. With issues like systemd/systemd#25883 and systemd/systemd#25891 (where huge memory leaks can be triggered by friendly neighbours sending router advertisements) the scenario where memory allocations fail isn't far-fetched. I think another scenario where the OOM-killer destroys networkd (or everything slows down) is more likely though but who knows how people configure their systems? :-) |
I'm slowly going through all the systemd fuzzers and managed hit another non-systemd issue, this time in #define _GNU_SOURCE
#include <assert.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
void freep(void *p) {
free(*(void**) p);
}
int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
__attribute__((__cleanup__(freep))) char *str = NULL, *a = NULL;
str = calloc(size + 1, sizeof(*str));
if (!str)
return 0;
memcpy(str, data, size);
str[size] = 0;
assert(strlen(str) >= 0);
asprintf(&a, "foo %s bar", str);
return 0;
}
(This was hit by one of the resolved fuzzers - fuzz-resource-record.) Looks like we're getting all the good stuff :) |
I wonder if it's always 101? Looking at |
Yep, it's always 101. |
I guess before rolling it out globally issues like that should be weeded out first. Until then |
To judge from https://sourceware.org/git/?p=glibc.git;a=commit;h=af7f4165512ea242b5f711ee03a04f6afe22232d that whole thing was rewritten recently so whatever it is it is no longer relevant upstream :-) |
Having taking a closer look it seems it's a bug in glibc after all. That being said I'm not sure I can say that it's easy to track down the sequences of mallocs/reallocs leading to crashes and the ASan backtraces aren't particularly helpful. I'm not sure how it can be improved though. |
I guess another question would be how to make its findings reproducible. As far as I can tell in its current form it can't reliably reproduce crashes so if it was integrated into, say, some sort of CI its findings couldn't be used for regression testing (or for reproducing the same issues with the same inputs). |
The mysterious crash with no backtrace I mentioned was from 2021 as far as I can remember and I didn't mean to say that they are still like that :-) I actually somewhat keep track of Suricata because it's the only project on OSS-Fuzz where both Rust and C are used so it paved the way and it's probably the only reason why fuzzing stuff written in Rust and C isn't totally painful.
I think that "global" coverage reports on OSS-Fuzz can be built by merging several coverage reports. I'm not sure about the local "helper.py coverage" command (where building projects twice would be annoying). Then again I'm not sure whether it makes much sense to special-case it at this point because it isn't the most common scenario. Anyway let's wait for the OSS-Fuzz maintainers to weigh in here. |
In the meantime @mrc0mmand keeps finding various systemd issues using Nallocfuzz. They keep coming so instead of linking this PR to all the them https://github.com/search?q=repo%3Asystemd%2Fsystemd+Nallocfuzz&type=code can be used. Hopefully it should help to decide whether Nallocfuzz should be integrated into OSS-Fuzz or not. |
See https://github.com/search?q=nallocfuzz&type=pullrequests for more projects ;-) |
@catenacyber I wonder - given my past experiments, would it be possible to cover |
Nice, I did not know this one, just added ;-) |
/gcbrun trial_build.py flac fluent-bit libpng ndpi suricata systemd --fuzzing-engine nallocfuzz |
Running a trial build to test it in the gcloud environment : ) |
Some low hanging fruits found using nallocfuzz. See: https://github.com/catenacyber/nallocfuzz See: google/oss-fuzz#9902 Most of these errors are quite trivial to fix; the only exception is the stuff in the uthash. If the insertion fails (because of an allocation failure), we need to avoid some memory leaks. But the only way to check if the `HASH_ADD_*` failed, is to perform a new lookup: a bit costly, but we don't use that code in any critical data-path.
Some low hanging fruits found using nallocfuzz. See: https://github.com/catenacyber/nallocfuzz See: google/oss-fuzz#9902 Most of these errors are quite trivial to fix; the only exception is the stuff in the uthash. If the insertion fails (because of an allocation failure), we need to avoid some memory leaks. But the only way to check if the `HASH_ADD_*` failed, is to perform a new lookup: a bit costly, but we don't use that code in any critical data-path.
Some low hanging fruits found using nallocfuzz. See: https://github.com/catenacyber/nallocfuzz See: google/oss-fuzz#9902 Most of these errors are quite trivial to fix; the only exception is the stuff in the uthash. If the insertion fails (because of an allocation failure), we need to avoid some memory leaks. But the only way to check if the `HASH_ADD_*` failed, is to perform a new lookup: a bit costly, but we don't use that code in any critical data-path.
@alan32liu looks good, right ? |
Yep, it passed the GCB trial build. |
It's quite costly to support custom fuzzing engines on our end. Could we ask that this be integrated into FuzzBench instead? We have further work planned for this year to get the benefits of all FuzzBench engines into OSS-Fuzz. |
Thanks for your answer Oliver
Oh. How so ?
Does this include optional opt-in for some of these FuzzBench engines ? |
Can't argue with that. But it's costly for projects like systemd to support this locally as well. I think @mrc0mmand already maintains stuff that can be replaced with #7343 and that's not even a fuzzing engine. Stuff like systemd/systemd#26151, systemd/systemd#23894, systemd/systemd#23873 doesn't go through OSS-Fuzz either because custom stuff was rejected (though in that particular case I agree it would be too costly).
I'm not sure I understand how it would work but if the same issues would be reported by OSS-Fuzz as usual I think it should be fine. |
init_decoder() should not leave ctx->frame_worker partially allocated. It should fully allocate ctx->frame_worker on success, and set ctx->frame_worker to NULL on failure. This bug was found by Philippe Antoine <[email protected]> using nallocfuzz (see google/oss-fuzz#9902). Bug: aomedia:3458 Change-Id: I1ab5bb26e396f2f1d9f7e42f570563403f0e2be2
Any updates on this topic? |
@IvanNardi I have no news on oss-fuzz side, but I have been running this on some projects lately like google/wuffs#135 (comment) |
init_decoder() should not leave ctx->frame_worker partially allocated. It should fully allocate ctx->frame_worker on success, and set ctx->frame_worker to NULL on failure. This bug was found by Philippe Antoine <[email protected]> using nallocfuzz (see google/oss-fuzz#9902). Change-Id: I1ab5bb26e396f2f1d9f7e42f570563403f0e2be2
init_decoder() should not leave ctx->frame_worker partially allocated. It should fully allocate ctx->frame_worker on success, and set ctx->frame_worker to NULL on failure. This bug was found by Philippe Antoine <[email protected]> using nallocfuzz (see google/oss-fuzz#9902). Change-Id: I1ab5bb26e396f2f1d9f7e42f570563403f0e2be2
``` SCARINESS: 12 (1-byte-read-heap-buffer-overflow) #0 0x557f3a5b5100 in ndpi_get_host_domain /src/ndpi/src/lib/ndpi_domains.c:158:8 ntop#1 0x557f3a59b561 in ndpi_check_dga_name /src/ndpi/src/lib/ndpi_main.c:10412:17 ntop#2 0x557f3a51163a in process_chlo /src/ndpi/src/lib/protocols/quic.c:1467:7 ntop#3 0x557f3a469f4b in LLVMFuzzerTestOneInput /src/ndpi/fuzz/fuzz_quic_get_crypto_data.c:44:7 ntop#4 0x557f3a46abc8 in NaloFuzzerTestOneInput (/out/fuzz_quic_get_crypto_data+0x4cfbc8) ``` Some notes about the leak: if the insertion into the uthash fails (because of an allocation failure), we need to free the just allocated entry. But the only way to check if the `HASH_ADD_*` failed, is to perform a new lookup: a bit costly, but we don't use that code in the fast-path. See also efb261a Credits for finding the issues to Philippe Antoine (@catenacyber) and its `nallocfuzz` fuzzing engine See: https://github.com/catenacyber/nallocfuzz See: google/oss-fuzz#9902
``` SCARINESS: 12 (1-byte-read-heap-buffer-overflow) #0 0x557f3a5b5100 in ndpi_get_host_domain /src/ndpi/src/lib/ndpi_domains.c:158:8 #1 0x557f3a59b561 in ndpi_check_dga_name /src/ndpi/src/lib/ndpi_main.c:10412:17 #2 0x557f3a51163a in process_chlo /src/ndpi/src/lib/protocols/quic.c:1467:7 #3 0x557f3a469f4b in LLVMFuzzerTestOneInput /src/ndpi/fuzz/fuzz_quic_get_crypto_data.c:44:7 #4 0x557f3a46abc8 in NaloFuzzerTestOneInput (/out/fuzz_quic_get_crypto_data+0x4cfbc8) ``` Some notes about the leak: if the insertion into the uthash fails (because of an allocation failure), we need to free the just allocated entry. But the only way to check if the `HASH_ADD_*` failed, is to perform a new lookup: a bit costly, but we don't use that code in the fast-path. See also efb261a Credits for finding the issues to Philippe Antoine (@catenacyber) and his `nallocfuzz` fuzzing engine See: https://github.com/catenacyber/nallocfuzz See: google/oss-fuzz#9902
@oliverchang do you want to do something with this ? |
Hello @inferno-chromium @jonathanmetzman @oliverchang @alan32liu
Here is a new pull request to find new vulnerabilities : when allocations fail.
This is proposed through a fuzzing engine cf https://github.com/catenacyber/nallocfuzz
This fuzzing engine is simply libFuzzer using
LLVMFuzzerRunDriver
withalloc
hooks where we sometimes return NULL, and the rest of times alloc normallyLLVMFuzzerTestOneInput
is wrapped into a function that makes the fuzz crash and alloc failures reproducible from the second run (hence the choice of a fuzzing engine)This work is inspired by this issue in Suricata https://redmine.openinfosecfoundation.org/issues/5701 (still private for now) where an allocation failure incomplete handling led to a NULL pointer dereference (in rust code).
The issue was fixed here OISF/suricata#8379 (by setting some size to 0 as well as the pointer to NULL)
After fixing the fuzz targets cf curl/curl-fuzzer#74 this engine has found a first bug in curl after a few hours of fuzzing from the seed corpus cf curl/curl#10733 (and led to many variant findings)
What do you think of it ?
You can test it with
(or with other projects)