Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implement cuIO fuzz tests #6001

Closed
7 tasks done
vuule opened this issue Aug 17, 2020 · 14 comments · Fixed by #6571
Closed
7 tasks done

[FEA] Implement cuIO fuzz tests #6001

vuule opened this issue Aug 17, 2020 · 14 comments · Fixed by #6571
Assignees
Labels
cuIO cuIO issue feature request New feature or request Python Affects Python cuDF API. tests Unit testing for project

Comments

@vuule
Copy link
Contributor

vuule commented Aug 17, 2020

Depends on #6000

Python tests:

To get started, tests can only vary the data profile, and keep the reader/writer parameters to defaults.

cuIO fuzz tests reader
cuIO fuzz tests

@vuule vuule added feature request New feature or request Needs Triage Need team to review and classify labels Aug 17, 2020
@harrism harrism added cuIO cuIO issue tests Unit testing for project labels Aug 17, 2020
@vuule vuule added tech debt and removed Needs Triage Need team to review and classify labels Aug 18, 2020
@vuule
Copy link
Contributor Author

vuule commented Aug 18, 2020

In this design we have the data generators implemented only in C++, including the parameter serialization to/from an XML file. The tests implemented in Python/Java are supposed to use the generator API through interop. The idea is to avoid duplicating the serialization code and get the best performance when generating inputs. @kkraus14 does this sound viable, at least for Python?

@vuule
Copy link
Contributor Author

vuule commented Aug 18, 2020

@nvdbaranec you suggested that there needs to be a way to replay the failing tests in C++ environment to facilitate debugging. The design here should allow that since the data generators are deterministic and the data profile and seed can be saved and passed to a C++ API to regenerate the same input.
Do you have any feedback/concerns about the approach?

@nvdbaranec
Copy link
Contributor

That seems fine to me. It would be particularly useful if any failure message also included a quick blurb about how to reproduce. Something like:

"Test XYZ failed :
[test output]

Test was run with these parameters [5236783, 34217, etc]. To reproduce, call ABC() with these values"

@vuule
Copy link
Contributor Author

vuule commented Aug 18, 2020

Yeah, I neglected the call parameters in the diagram and the description here because we can get started with random data + defaults. You're right, the API and its parameters should definitely be included.
In the ideal case we would save the call parameters in a machine readable format and have a utility that executes the test based on all of this input.

@nvdbaranec
Copy link
Contributor

Random comment on the kinds of things that seem to dig up bugs:

  • Nulls and empties, especially in combination.

  • Tweakable format meta-parameters such as page size and row group size in Parquet. Things that don't necessarily affect the formal format itself but squeeze the data into interestingly small or large blobs. The classic example from the parquet reader was page sizes sufficiently small to cause entire nested rows to neither begin, nor end within a single row-group page. I also strongly suspect we will see some (fixable) performance issues with very large page sizes.

  • Very large and very small files.

  • Tiny or otherwise fiendishly constructed reads. Reading 1 row out of millions in the middle of a file. Reading 2 rows that deliberately span units of parallelization (eg, 1 block per page in Parquet).

  • Super-compressible vs terribly-compressible data. Billions of rows all containing the same value from a dictionary. That kind of thing.

  • Things that deliberately break the reader, like files that will cause out-of-memory issues. The idea is to figure out how gracefully things fail, and how we might produce more useful output when it does.

  • Files with huge numbers of columns vs huge numbers of rows.

@nvdbaranec
Copy link
Contributor

"In the ideal case we would save the call parameters in a machine readable format and have a utility that executes the test based on all of this input."

I am a big fan of being able to reproduce the testing outside of the testing framework itself. I like working in my own little (debuggable) seperate test app. So consideration for providing simple code hooks that can be called would be great.

@vuule
Copy link
Contributor Author

vuule commented Aug 18, 2020

Out of the cases you listed, I see a potential problem with one:

Reading 1 row out of millions in the middle of a file. Reading 2 rows that deliberately span units of parallelization (eg, 1 block per page in Parquet).

For this type of testing (en masse) we have to rely on the reference implementation to validate the output. And I don't think the other Parquet implementations support row selection (outside of row group selection). I guess in this case we can crop the reference output before comparison.

I think we will hit all of the listed corner cases eventually with random parameter sampling as long as we set up the data profile parameters' ranges correctly.

@nvdbaranec
Copy link
Contributor

nvdbaranec commented Aug 18, 2020

True. If this really saturates the parameter space (implying long scale run times as you mentioned) it'll probably catch the interesting stuff. I would reiterate the point to make sure that we parameterize things that aren't specifically the data itself (eg page sizes).

@nvdbaranec
Copy link
Contributor

Thought of another thing. This is probably not super relevant right now, but if we ever did overlapped reads with decoding (that is, read/decompress blob N+1 while decoding blob N) having tests that use both memory mapped files and true spinning-disk files might yield some race condition stuff.

@nvdbaranec
Copy link
Contributor

Item : One of the tests I added to the python code for the parquet reader went long enough that it occasionally was causing the build machines to time out. The conclusion was that in the future we should be doing this kind of exhaustive/stress testing in the cpp tests instead of python, which makes sense.

As such, it would be pretty handy if whatever reference reader/writer code you use for this new test framework was also readily available in the normal cudf test framework.

@kkraus14
Copy link
Collaborator

Item : One of the tests I added to the python code for the parquet reader went long enough that it occasionally was causing the build machines to time out. The conclusion was that in the future we should be doing this kind of exhaustive/stress testing in the cpp tests instead of python, which makes sense.

Just a note that the 5 minute timeout is specifically defined for the Python unit tests and isn't a limitation of the CI framework / machine.

galipremsagar added a commit that referenced this issue Nov 12, 2020
Partially resolves: #6001 , #6260
This PR:

Adds support for Orc fuzz workers(Both reader and writer)
Utilize pyorc to write a pandas dataframe to orc files
Adds varying test parameter combinations for cudf.read_orc and df.to_orc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request Python Affects Python cuDF API. tests Unit testing for project
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants