Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to PR#1183: use apache/parquet-testing files #1194

Merged
merged 12 commits into from
Mar 10, 2022

Conversation

stress-tess
Copy link
Member

@stress-tess stress-tess commented Mar 9, 2022

I rebased on top of master to get Ben's latest arrow error handling which is showing up here as a lot of noise (both in commits and files changed) but all of this is from master and not new functionality (this noise shouldn't show up in 1183 because it's the same git history as master)

The only changes I've made are:

  • 12d968a
    • changing "--run-optional-parquet" flag to "--optional-parquet"
  • 629636d
    • changing columns1 to not fail self.assertListEqual(columns, ans)
    • changes Parquet error message to use the right chapel format specifier

This PR updates the parquet-testing branch, so merging it will update PR #1183

bmcdonald3 and others added 6 commits March 7, 2022 11:14
The UsedModules.cfg file, which is generated from --saveUsedModules
was not being ignored by git, and I don't think we would ever want
a UsedModules.cfg pushed into the repo, so moving it into the
gitignore will resolve any annoyance from having to selectively not
include that file.
This PR adds a check for the Arrow float type on the Chapel side,
which was left out previously, allowing float data types to be
read with `ak.read_parquet()`. With this addition, the file
`alltypes_plain.parquet` from the Apache website now runs through
`ak.read_parquet()` without fail.
Add line to pass Arrow float to Chapel side
* Modified drop() to allow for row and column dropping. Added functionality for dropping rows in new function called by drop. Moved column drop to new function called by drop.

Updated the drop row algorithm to set the new column values in place using the superclass's __setitem__.

Updated docstrings to match new functionality.

Set default axis to 0 to conform to pandas.

* Added ak.DataFrame.drop() documentation.

* Resolving Merge Conflicts

* Added test to validate axis can only be 1 or 0.

* Updated code based on PR comments. Removed unneeded if blocks and a few other style changes.

* Updating unknown axis error.

* Correcting typo
In Chapel 1.26, various C types from 'SysCTypes', 'SysBasic' and
'CPtr' are being brought together into a single (new) module,
'CTypes'.  In this PR, I'm adding a new user-level 'CTypes'
compatibility module that brings the pieces that Arkouda relies upon
together so that it can use the new names and organization yet still
be compiled with older compilers.

---
Signed-off-by: Brad Chamberlain <[email protected]>
* Catch all Parquet errors and report them to the client
In the initial Parquet error handling code, the only errors that
were being reported were errors that came from status codes of
calls to the Parquet API, but all errors in the C++ code went
unhandled and would result in a server crash.

This PR wraps all the C++ functions in a try/catch to catch all
errors to be reported back to the client rather than crashing the
server. This approach seems to align with the philosophy of
"The server should never crash, errors should just be reported to
the client".

Additionally, a bug was identified with throwing errors in `forall`
loops within a function, which would cause a try/catch wrapping
the throwing function to not catch the resulting `TaskErrors` error
throw, but a workaround for that is to wrap `forall` loops in a
try/catch which just throws the error, which then allows the
overarching try/catch to catch and report the error, where previously
it would crash the server.

* Add unsupported type error message to getType function

* Remove the ARROWUNDEFINED variable and add some throws
Copy link
Contributor

@bmcdonald3 bmcdonald3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this locally and the change that is being made to switch to --optional-parquet works as expected, but there are still some failures in the file.

I would assume that is not part of this PR, so I think this is good to go, but should I be working to resolve those failures at the moment?

@stress-tess
Copy link
Member Author

stress-tess commented Mar 10, 2022

I rebased on top of master to get Ben's latest arrow error handling which is showing up here as a lot of noise (both in commits and files changed) but all of this is from master and not new functionality (this noise shouldn't show up in 1183 because it's the same git history as master)

The only changes I've made are:

  • 12d968a
    • changing "--run-optional-parquet" flag to "--optional-parquet"
  • 629636d
    • changing columns1 to not fail self.assertListEqual(columns, ans)
    • changes Parquet error message to use the right chapel format specifier

I'm not sure the column names change is what we want but with these changes it looks like we can get through the 'alltypes_plain.parquet' and 'alltypes_plain.snappy.parquet' files. We're still failing on the 'delta_byte_array.parquet' file with the following error:

[Arkouda Client] Line 333 DEBUG: sending message RequestMessage(user='', token=None, cmd='readAllParquet', format=STRING, args='True 9 1 False ["c_customer_id", "c_salutation", "c_first_name", "c_last_name", "c_preferred_cust_flag", "c_birth_country", "c_login", "c_email_address", "c_last_review_date"] | ["resources/parquet-testing/delta_byte_array.parquet"]')
[Arkouda Client] Line 447 DEBUG: [Python] Sending request: disconnect
[Arkouda Client] Line 333 DEBUG: sending message RequestMessage(user='', token=None, cmd='disconnect', format=STRING, args=None)
[Arkouda Client] Line 449 DEBUG: [Python] Received response: disconnected from arkouda server tcp://*:5555
disconnected from arkouda server tcp://*:5555
=================================================================================================== short test summary info ===================================================================================================
FAILED tests/parquet_test.py::ParquetTest::test_against_standard_files - RuntimeError: 1 errors: Error: ParquetError 161 getStrColSize:ParquetMsg Unexpected end of stream: Read 0 values, expected 1

@stress-tess stress-tess requested a review from bmcdonald3 March 10, 2022 02:49
@stress-tess stress-tess changed the title Update marker name to match decorator Changes to PR#1183 Mar 10, 2022
@stress-tess stress-tess changed the title Changes to PR#1183 Changes to PR#1183: use apache/parquet-testing files Mar 10, 2022
Copy link
Contributor

@mhmerrill mhmerrill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, look ok to me, i guess these are a bunch of changes from master into this branch.

@mhmerrill mhmerrill merged commit f4d834a into Bears-R-Us:parquet-testing Mar 10, 2022
mhmerrill pushed a commit that referenced this pull request Mar 10, 2022
* add parquet test files with license and readme

* added and skipped test against parquet standard files

* add option for running parquet file tests

* Update marker name to match decorator (#1193)

Co-authored-by: Pierce Hayes <[email protected]>

* Changes to PR#1183: use apache/parquet-testing files (#1194)

* Add UsedModules.cfg to .gitignore (#1179)

The UsedModules.cfg file, which is generated from --saveUsedModules
was not being ignored by git, and I don't think we would ever want
a UsedModules.cfg pushed into the repo, so moving it into the
gitignore will resolve any annoyance from having to selectively not
include that file.

* Add line to pass Arrow float to Chapel side
This PR adds a check for the Arrow float type on the Chapel side,
which was left out previously, allowing float data types to be
read with `ak.read_parquet()`. With this addition, the file
`alltypes_plain.parquet` from the Apache website now runs through
`ak.read_parquet()` without fail.

* Closes #1165 drop on axis (#1177)

* Modified drop() to allow for row and column dropping. Added functionality for dropping rows in new function called by drop. Moved column drop to new function called by drop.

Updated the drop row algorithm to set the new column values in place using the superclass's __setitem__.

Updated docstrings to match new functionality.

Set default axis to 0 to conform to pandas.

* Added ak.DataFrame.drop() documentation.

* Resolving Merge Conflicts

* Added test to validate axis can only be 1 or 0.

* Updated code based on PR comments. Removed unneeded if blocks and a few other style changes.

* Updating unknown axis error.

* Correcting typo

* Update Arkouda to work with CTypes (#1185)

In Chapel 1.26, various C types from 'SysCTypes', 'SysBasic' and
'CPtr' are being brought together into a single (new) module,
'CTypes'.  In this PR, I'm adding a new user-level 'CTypes'
compatibility module that brings the pieces that Arkouda relies upon
together so that it can use the new names and organization yet still
be compiled with older compilers.

---
Signed-off-by: Brad Chamberlain <[email protected]>

* Catch all Parquet errors and report them to the client (#1188)

* Catch all Parquet errors and report them to the client
In the initial Parquet error handling code, the only errors that
were being reported were errors that came from status codes of
calls to the Parquet API, but all errors in the C++ code went
unhandled and would result in a server crash.

This PR wraps all the C++ functions in a try/catch to catch all
errors to be reported back to the client rather than crashing the
server. This approach seems to align with the philosophy of
"The server should never crash, errors should just be reported to
the client".

Additionally, a bug was identified with throwing errors in `forall`
loops within a function, which would cause a try/catch wrapping
the throwing function to not catch the resulting `TaskErrors` error
throw, but a workaround for that is to wrap `forall` loops in a
try/catch which just throws the error, which then allows the
overarching try/catch to catch and report the error, where previously
it would crash the server.

* Add unsupported type error message to getType function

* Remove the ARROWUNDEFINED variable and add some throws

* add parquet test files with license and readme

* added and skipped test against parquet standard files

* add option for running parquet file tests

* Update marker name to match decorator

* updated error message string in ParquetMsg with wrong format specifier and updated columns1 to pass assertion against ans

Co-authored-by: Ben McDonald <[email protected]>
Co-authored-by: Ben McDonald <[email protected]>
Co-authored-by: reuster986 <[email protected]>
Co-authored-by: Ethan-DeBandi99 <[email protected]>
Co-authored-by: Brad Chamberlain <[email protected]>
Co-authored-by: Pierce Hayes <[email protected]>

Co-authored-by: pierce314159 <[email protected]>
Co-authored-by: Pierce Hayes <[email protected]>
Co-authored-by: Ben McDonald <[email protected]>
Co-authored-by: Ben McDonald <[email protected]>
Co-authored-by: Ethan-DeBandi99 <[email protected]>
Co-authored-by: Brad Chamberlain <[email protected]>
@stress-tess stress-tess deleted the parquet_testing branch March 16, 2022 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants