-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return empty dataframe when reading a Parquet file using empty columns
option
#11018
Return empty dataframe when reading a Parquet file using empty columns
option
#11018
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.08 #11018 +/- ##
===============================================
Coverage ? 86.34%
===============================================
Files ? 144
Lines ? 22741
Branches ? 0
===============================================
Hits ? 19635
Misses ? 3106
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried out your changes locally, libcudf
behavior seems to be expected now.
These are the corresponding additional cython/python changes needed to achieve feature parity with pandas.
Co-authored-by: GALI PREM SAGAR <[email protected]>
…om/andygrove/cudf into bug-pq-reader-empty-col-names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, but this looks good to me. A clean solution to the problem.
* | ||
* @return Names of column to be read | ||
* @return Names of column to be read; `nullopt` if the option is not set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be std::nullopt
to be clear?
/** | ||
* Read parquet formatted data. | ||
* @param opts various parquet parsing options. | ||
* @param buffer raw parquet formatted bytes. | ||
* @return the data parsed as a table on the GPU. | ||
*/ | ||
public static Table readParquet(ParquetOptions opts, byte[] buffer) { | ||
if (opts.getIncludeColumnNames().length == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an undesirable change and is reflected in the corresponding update to the tests. If the caller must specify all columns to read from the file, how does the caller read a Parquet file when they don't know the columns in the file? (e.g.: doing dynamic schema discovery). This used to be possible with the Java API but now seems like it isn't.
.includeColumn("loan_id") | ||
.includeColumn("orig_channel") | ||
.includeColumn("orig_interest_rate") | ||
.includeColumn("orig_upb") | ||
.includeColumn("orig_loan_term") | ||
.includeColumn("orig_date") | ||
.includeColumn("first_pay_date") | ||
.includeColumn("orig_ltv") | ||
.includeColumn("orig_cltv") | ||
.includeColumn("num_borrowers") | ||
.includeColumn("dti") | ||
.includeColumn("borrower_credit_score") | ||
.includeColumn("first_home_buyer") | ||
.includeColumn("loan_purpose") | ||
.includeColumn("property_type") | ||
.includeColumn("num_units") | ||
.includeColumn("occupancy_status") | ||
.includeColumn("property_state") | ||
.includeColumn("zip") | ||
.includeColumn("mortgage_insurance_percent") | ||
.includeColumn("product_type") | ||
.includeColumn("coborrow_credit_score") | ||
.includeColumn("mortgage_insurance_type") | ||
.includeColumn("relocation_mortgage_indicator") | ||
.includeColumn("quarter") | ||
.includeColumn("seller_id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be necessary. There needs to be a way to read all columns of a Parquet file without prior knowledge of all column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to achieve this using includeColumn
, but having includeColumns
that takes an array of names allows this behavior - skip the call to includeColumns
if you want to read all columns. If you want to read none, pass an empty array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this change should not have a significant impact on the API. If no columns are specified (i.e.: no includeColumn
or includeColumns
calls) then it should read all columns as it did before. That means all the existing tests should run without modification. However the code explicitly throws if no columns were ever specified, which required many existing tests to be updated. That's the thing we need to fix, otherwise there's no way to load a Parquet file without knowing the columns up-front.
having includeColumns that takes an array of names allows this behavior - skip the call to includeColumns if you want to read all columns. If you want to read none, pass an empty array.
Or don't call includeColumn
or includeColumns
at all, which is the way it used to work. If we want to add special behavior for a case where the caller explicitly does not want to load any columns (e.g.: opts.includeColumns(new String[0])
then we can do so. However I'm not sure we can actually load an empty Table
. Table
wraps a cudf::table_view
and the latter does not allow construction with zero columns. It seems like the Java change, if any, should be checking for no columns specified when includeColumns
is called and throwing there rather than checking for no columns when trying to load the Parquet file, as the latter means we want to load all columns rather than no columns. If that's done then no existing tests would be modified, all we would do is add a new test that verifies the API throws in that corner case.
I misunderstood what was needed on the Java side here. I have re-implemented this in my branch at https://github.com/andygrove/cudf/tree/bug-pq-reader-empty-col-names-java2 so that the JNI code will not add columns to the @vuule could you merge those changes into this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the Java tests pass, lgtm.
@gpucibot merge |
Some recently merged PRs (#11018 + #11036) do not include enough header which may cause compile error in some systems (in particular, CUDA 11.7 + gcc-11.2). This PR adds the missing header (`<optional>`) to fix the compile issue. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Yunsong Wang (https://github.com/PointKernel) URL: #11126
… option (#11446) Changes are mostly equivalent to Parquet changes in #11018. Store the `columns` option as `optional`: - `nullopt` when columns are not passed by caller - read all columns. - Empty vector when caller explicitly passes an empty list/vector - return empty dataframe. - Vector of column names - read columns with given names. Also includes a small cleanup of the code equivalent in the Parquet reader. Fixes #11021 Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - MithunR (https://github.com/mythrocks) - Nghia Truong (https://github.com/ttnghia) URL: #11446
Fixes #8668
Store the columns option as
optional
:nullopt
when columns are not passed by caller - read all columns.