Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #1181: use apache/parquet-testing files #1183

Merged
merged 5 commits into from
Mar 10, 2022
Merged

Conversation

reuster986
Copy link
Collaborator

@reuster986 reuster986 commented Mar 7, 2022

This PR incorporates standard test files from github.com/apache/parquet-testing and adds a unit tests that checks ak.get_datasets and ak.read_parquet against them. Attribution and license are included.

Currently, the new unit test is skipped by default because it fails, but it can be run manually with python3 -m pytest tests/parquet_test.py --optional-parquet.

@reuster986 reuster986 requested a review from bmcdonald3 March 7, 2022 16:02
@reuster986 reuster986 linked an issue Mar 7, 2022 that may be closed by this pull request
2 tasks
Copy link
Contributor

@Ethan-DeBandi99 Ethan-DeBandi99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pytest.mark.optional_parquet needs to be updated to @pytest.mark.run_optional_parquet, or config.addinivalue_line("markers", "run_optional_parquet: mark test as slow to run") should be updated to config.addinivalue_line("markers", "optional_parquet: mark test as slow to run"). Otherwise, you will get the following warning that the custom marker is not registered:

tests/parquet_test.py:124
  /Users/ethandebandi/Documents/git/arkouda/tests/parquet_test.py:124: PytestUnknownMarkWarning: Unknown pytest.mark.optional_parquet - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/mark.html
    @pytest.mark.optional_parquet

tests/conftest.py Outdated Show resolved Hide resolved
Copy link
Member

@stress-tess stress-tess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good once the marker and decorator are updated to match like Ethan suggests

tests/parquet_test.py Show resolved Hide resolved
tests/parquet_test.py Outdated Show resolved Hide resolved
* Add UsedModules.cfg to .gitignore (#1179)

The UsedModules.cfg file, which is generated from --saveUsedModules
was not being ignored by git, and I don't think we would ever want
a UsedModules.cfg pushed into the repo, so moving it into the
gitignore will resolve any annoyance from having to selectively not
include that file.

* Add line to pass Arrow float to Chapel side
This PR adds a check for the Arrow float type on the Chapel side,
which was left out previously, allowing float data types to be
read with `ak.read_parquet()`. With this addition, the file
`alltypes_plain.parquet` from the Apache website now runs through
`ak.read_parquet()` without fail.

* Closes #1165 drop on axis (#1177)

* Modified drop() to allow for row and column dropping. Added functionality for dropping rows in new function called by drop. Moved column drop to new function called by drop.

Updated the drop row algorithm to set the new column values in place using the superclass's __setitem__.

Updated docstrings to match new functionality.

Set default axis to 0 to conform to pandas.

* Added ak.DataFrame.drop() documentation.

* Resolving Merge Conflicts

* Added test to validate axis can only be 1 or 0.

* Updated code based on PR comments. Removed unneeded if blocks and a few other style changes.

* Updating unknown axis error.

* Correcting typo

* Update Arkouda to work with CTypes (#1185)

In Chapel 1.26, various C types from 'SysCTypes', 'SysBasic' and
'CPtr' are being brought together into a single (new) module,
'CTypes'.  In this PR, I'm adding a new user-level 'CTypes'
compatibility module that brings the pieces that Arkouda relies upon
together so that it can use the new names and organization yet still
be compiled with older compilers.

---
Signed-off-by: Brad Chamberlain <[email protected]>

* Catch all Parquet errors and report them to the client (#1188)

* Catch all Parquet errors and report them to the client
In the initial Parquet error handling code, the only errors that
were being reported were errors that came from status codes of
calls to the Parquet API, but all errors in the C++ code went
unhandled and would result in a server crash.

This PR wraps all the C++ functions in a try/catch to catch all
errors to be reported back to the client rather than crashing the
server. This approach seems to align with the philosophy of
"The server should never crash, errors should just be reported to
the client".

Additionally, a bug was identified with throwing errors in `forall`
loops within a function, which would cause a try/catch wrapping
the throwing function to not catch the resulting `TaskErrors` error
throw, but a workaround for that is to wrap `forall` loops in a
try/catch which just throws the error, which then allows the
overarching try/catch to catch and report the error, where previously
it would crash the server.

* Add unsupported type error message to getType function

* Remove the ARROWUNDEFINED variable and add some throws

* add parquet test files with license and readme

* added and skipped test against parquet standard files

* add option for running parquet file tests

* Update marker name to match decorator

* updated error message string in ParquetMsg with wrong format specifier and updated columns1 to pass assertion against ans

Co-authored-by: Ben McDonald <[email protected]>
Co-authored-by: Ben McDonald <[email protected]>
Co-authored-by: reuster986 <[email protected]>
Co-authored-by: Ethan-DeBandi99 <[email protected]>
Co-authored-by: Brad Chamberlain <[email protected]>
Co-authored-by: Pierce Hayes <[email protected]>
Copy link
Contributor

@mhmerrill mhmerrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok to me now

tests/parquet_test.py Show resolved Hide resolved
tests/parquet_test.py Outdated Show resolved Hide resolved
tests/parquet_test.py Show resolved Hide resolved
Copy link
Contributor

@Ethan-DeBandi99 Ethan-DeBandi99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now

@mhmerrill mhmerrill merged commit ec22bf8 into master Mar 10, 2022
@reuster986 reuster986 deleted the parquet-testing branch April 21, 2022 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test parquet against data from apache/parquet-testing
5 participants