-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly construct data_column
variable when drop_nan == False
in _drop_na_rows
#10123
Correctly construct data_column
variable when drop_nan == False
in _drop_na_rows
#10123
Conversation
drop_nan: bool | ||
`nan` is also considered as `NA` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop_nan: bool | |
`nan` is also considered as `NA` | |
drop_nan : bool, optional | |
If True, `NaN` values are also dropped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can see, the drop_nan
flag should not exist at all (it should always be True
).
drop_nan
defaults to False
but this internal method is only called in one place. In that call, drop_nan
is hard-coded as True
.
cudf/python/cudf/cudf/core/indexed_frame.py
Lines 1301 to 1303 in baff5cf
result = self._drop_na_rows( | |
how=how, subset=subset, thresh=thresh, drop_nan=True | |
) |
Moreover, the corresponding method _drop_na_columns
does not have such a flag.
cudf/python/cudf/cudf/core/frame.py
Line 1242 in baff5cf
def _drop_na_columns(self, how="any", subset=None, thresh=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When drop_nan
is True, it doesn't always drop that row, it only makes it consider the nan
row as NA
. Only when the criteria of the row meets any
/all
the row is dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users cannot depend on the functionality of internal APIs. I understand the goal of avoiding breakage, but if we defer we'll need to make a follow-up PR for the next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created #10125 to track this. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this PR is merged into 22.02 and forward-merged into 22.04, I'll work on this fix.
Co-authored-by: Bradley Dice <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving as-is. Follow-up work can be done in later PRs.
rerun tests |
…sVoid/cudf into fix/drop_na_data_column_not_referenced
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #10123 +/- ##
================================================
- Coverage 10.49% 10.38% -0.11%
================================================
Files 119 119
Lines 20305 20219 -86
================================================
- Hits 2130 2099 -31
+ Misses 18175 18120 -55
Continue to review full report at Codecov.
|
This PR removes the `drop_nan` parameter from the internal API `IndexedFrame._drop_na_rows`. Its behavior was unused internally in cudf (always set to `True` in the public API `IndexedFrame.dropna`). The behavior of `drop_nan=False` was untested until 22.02 hotfix #10123, when an issue was found in gpu-bdb. However, that code can use the public API `df.dropna(axis=0)` instead. See rapidsai/gpu-bdb#228. This is marked as a non-breaking change because it only affects internal APIs. Resolves #10125, follows up on #10123. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10140
Currently when
drop_nan == False
, variabledata_columns
was not created and referenced below. This PR fixes that.