Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuzz tests for Arrow/Parquet #5332

Open
emkornfield opened this issue Jan 25, 2024 · 7 comments
Open

Fuzz tests for Arrow/Parquet #5332

emkornfield opened this issue Jan 25, 2024 · 7 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@emkornfield
Copy link
Contributor

I'm generally new to the code base but it looks like existing fuzz tests might be generating random data and making sure it can be read back, but we don't have any fuzz tests for malformed data. I think in the context of Rust the goal would be to avoid panics?

If this is accurate, I'd propose creating fuzz tests that check succeed as long as there is no panic. A beginning corpus arrow-testing repo 2 for each file type.

A second step would be integrate with oss-fuzz (Arrow C++ already does so).

If this is of interest and i'm correct this isn't already done, I can try to see if I can prototype something.

@emkornfield emkornfield added the enhancement Any new improvement worthy of a entry in the changelog label Jan 25, 2024
@tustvold
Copy link
Contributor

It isn't an issue IMO if the reader panics on malformed data, this is a perfectly safe and well-defined behaviour. We should try to avoid it, but its not like UB where it would indicate a bug. Panics are just exceptions.

The bigger issue with untrusted/malicious inputs is avoiding the reader getting stuck in infinite loops or exploding the memory usage. I'm not sure how easy such things are to catch using a fuzz testing framework.

With regards to parquet, I can't help feeling the format is sufficiently complex that supporting untrusted input is essentially a fools errand though...

That's all to say adding fuzzing support would be a nice add. I'm not too familiar with the Rust ecosystem's support for it, but @crepererum may know more.

.

@crepererum
Copy link
Contributor

Fuzzing is a good thing, even when you accept panics as an outcome. The fuzzer then has two wrap the method call accordingly.

Regarding the toolchain: We should use cargo-fuzz. That gives us the option to use multiple fuzzers. Then you need to choose a fuzzer. I would suggest you use libFuzzer which comes w/ LLVM, since it is the least invasive one, however it has entered maintenance mode. I see the following options:

That said, the choice can easily be changed later, since the fuzzer is effectively just a "run some code on this blackbox [u8] input" (like "parse parquet from bytes").

Footnotes

  1. Note that AFL was abandoned for a while, but development is now open and active under the AFL++ project.

@emkornfield
Copy link
Contributor Author

We've had good experience in parquet-cpp of integrating with https://github.com/google/oss-fuzz I haven't looked into what supporting Rust would look like exactly but that seems like a reasonable place to start?

I think one important item is a philosophical question, of if malformed data should trigger a panic. Can I interpret the comment above:

"We should try to avoid it, but its not like UB where it would indicate a bug. Panics are just exceptions." as if the fuzzer uncovers a panic, then submitting a patch to avoid the panic would be accepted?

@crepererum
Copy link
Contributor

We've had good experience in parquet-cpp of integrating with google/oss-fuzz I haven't looked into what supporting Rust would look like exactly but that seems like a reasonable place to start?

Before we integrate anything into oss-fuzz, we need to write the fuzz harness. We also need to fix the bugs that occur when you run this for an hour or so on your developer machine (which can be a laptop), just to make sure we sure that we don't end up with SPAM when handing to to oss-fuzz.

Have a look at https://rust-fuzz.github.io/book/introduction.html . Despite what I said in #5332 (comment) , I think combining that with libFuzzer is still the best choice1. It results in a reasonably efficient fuzzer with a somewhat low barrier of entrance. You may want to combine this with arbitrary if you need more than just a few bytes, e.g. you also want to fuzz parameters or something.

I think one important item is a philosophical question, of if malformed data should trigger a panic. Can I interpret the comment above:

"We should try to avoid it, but its not like UB where it would indicate a bug. Panics are just exceptions." as if the fuzzer uncovers a panic, then submitting a patch to avoid the panic would be accepted?

I think there are two levels of safety here:

  1. no crash: It can panic, but it should segfault or something. We have some unsafe code in our code base and even with safe code you can still trigger out-of-memory errors (e.g. when the input file header says it's 10000TB of data but the file is only 1kb large, you shouldn't try to allocate a vector/buffer with 10000TB). I think everyone agrees that this is the absolute minimum we have to provide.
  2. no panic: Here different developers may have different opinions. I personally think not panicking for a bunch of seemingly random kernel computations (esp. in the DataFusion context) is hard to avoid, but merely deserializing a parquet file to arrow should never panic and only use proper errors.

That said, I would be OK if we start with a fuzz harness that catches panics, stabilize that (e.g. it doesn't crash if you run it on your machine for a while), and then in a follow-up remove the panic-catch and see how far we get.

Footnotes

  1. I was somewhat hoping that libAFL would evolve a bit faster but you still have to manually build it and last time I've checked -- which was a few months ago -- it was rather complicated to integrate into cargo-fuzz, even with the libFuzzer shim.

@jp0317
Copy link
Contributor

jp0317 commented Nov 6, 2024

not panicking for a bunch of seemingly random kernel computations (esp. in the DataFusion context) is hard to avoid,

i agree that it's tricky to fully eliminate panics, though fuzz can help reduce panics. IMHO the current codes include many unnecessary panics: e.g., overflow panics due to not using checked operation (such as shift, add, etc.), explicit panics from unimplemented!(), assert!(), panic!().

It seems we can either replace those with proper errors, or maybe just add some sanity checks while keeping the panic codes, e.g., for the bit_width panic we can reject invalid bit_width earlier here such that the panic codes would never get executed)

@jp0317 jp0317 mentioned this issue Nov 16, 2024
@westonpace
Copy link
Member

With regards to parquet, I can't help feeling the format is sufficiently complex that supporting untrusted input is essentially a fools errand though...

FWIW, I have always assumed proper handling of untrusted inputs to be an intentional feature of Parquet. Many data services accept Parquet as input. Being an open format it has become a de-facto interchange format between systems.

I do agree with Rafael though that infinite loops and memory overflows are in a more severe category than panics and it would be acceptable (though mildly unfortunate) to clearly state that invalid input may cause panics and users should catch unwind parquet routines.

In a way, this is actually similar to C++ where Parquet throws exceptions while the rest of the Arrow library uses Result. Arrow has "catch unwinds" at all the boundaries.

@tustvold
Copy link
Contributor

Many data services accept Parquet as input. Being an open format it has become a de-facto interchange format between systems.

Different applications will have different threat models and should make their own judgements, but I would certainly encourage any applications accepting truly untrusted parquet data to rewrite them in some sandboxed environment before handing them off to other systems. This is fairly standard practice when it comes to other media files, e.g. images, video, etc... even where there are extremely mature and well tested transcoders.

However, many systems will instead be accepting files from other internal systems, at which point perhaps the thread model is different.

Security concerns aside, I would recommend rewriting parquet files anyway because of the sheer variety of parquet implementations - two files with the same data may behave very differently depending on how they've been written

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

5 participants