Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly construct data_column variable when drop_nan == False in _drop_na_rows #10123

Merged

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Jan 25, 2022

Currently when drop_nan == False, variable data_columns was not created and referenced below. This PR fixes that.

@isVoid isVoid added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. ! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround labels Jan 25, 2022
@isVoid isVoid requested a review from a team as a code owner January 25, 2022 19:43
@isVoid isVoid requested review from trxcllnt and skirui-source and removed request for a team January 25, 2022 19:43
@galipremsagar galipremsagar added ! - Release 5 - Ready to Merge Testing and reviews complete, ready to merge non-breaking Non-breaking change bug Something isn't working and removed 3 - Ready for Review Ready for review by team ! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround labels Jan 25, 2022
Comment on lines +1327 to +1328
drop_nan: bool
`nan` is also considered as `NA`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
drop_nan: bool
`nan` is also considered as `NA`
drop_nan : bool, optional
If True, `NaN` values are also dropped.

Copy link
Contributor

@bdice bdice Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see, the drop_nan flag should not exist at all (it should always be True).

drop_nan defaults to False but this internal method is only called in one place. In that call, drop_nan is hard-coded as True.

result = self._drop_na_rows(
how=how, subset=subset, thresh=thresh, drop_nan=True
)

Moreover, the corresponding method _drop_na_columns does not have such a flag.

def _drop_na_columns(self, how="any", subset=None, thresh=None):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When drop_nan is True, it doesn't always drop that row, it only makes it consider the nan row as NA. Only when the criteria of the row meets any/all the row is dropped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users cannot depend on the functionality of internal APIs. I understand the goal of avoiding breakage, but if we defer we'll need to make a follow-up PR for the next release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created #10125 to track this. 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR is merged into 22.02 and forward-merged into 22.04, I'll work on this fix.

python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as-is. Follow-up work can be done in later PRs.

@galipremsagar
Copy link
Contributor

rerun tests

@codecov
Copy link

codecov bot commented Jan 26, 2022

Codecov Report

Merging #10123 (3ec5ebe) into branch-22.02 (967a333) will decrease coverage by 0.10%.
The diff coverage is n/a.

❗ Current head 3ec5ebe differs from pull request most recent head 193506f. Consider uploading reports for the commit 193506f to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02   #10123      +/-   ##
================================================
- Coverage         10.49%   10.38%   -0.11%     
================================================
  Files               119      119              
  Lines             20305    20219      -86     
================================================
- Hits               2130     2099      -31     
+ Misses            18175    18120      -55     
Impacted Files Coverage Δ
python/custreamz/custreamz/tests/conftest.py 71.42% <0.00%> (-7.15%) ⬇️
python/custreamz/custreamz/tests/test_kafka.py 38.46% <0.00%> (-4.40%) ⬇️
...ython/custreamz/custreamz/tests/test_dataframes.py 96.97% <0.00%> (-2.42%) ⬇️
python/custreamz/custreamz/kafka.py 29.16% <0.00%> (-0.63%) ⬇️
python/dask_cudf/dask_cudf/backends.py 82.53% <0.00%> (-0.61%) ⬇️
python/dask_cudf/dask_cudf/core.py 70.68% <0.00%> (-0.34%) ⬇️
python/dask_cudf/dask_cudf/accessors.py 92.00% <0.00%> (-0.31%) ⬇️
python/dask_cudf/dask_cudf/sorting.py 92.66% <0.00%> (-0.25%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
... and 69 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 893f540...193506f. Read the comment docs.

@sevagh sevagh merged commit 270772f into rapidsai:branch-22.02 Jan 26, 2022
rapids-bot bot pushed a commit that referenced this pull request Jan 27, 2022
This PR removes the `drop_nan` parameter from the internal API `IndexedFrame._drop_na_rows`. Its behavior was unused internally in cudf (always set to `True` in the public API `IndexedFrame.dropna`). The behavior of `drop_nan=False` was untested until 22.02 hotfix #10123, when an issue was found in gpu-bdb. However, that code can use the public API `df.dropna(axis=0)` instead. See rapidsai/gpu-bdb#228.

This is marked as a non-breaking change because it only affects internal APIs.

Resolves #10125, follows up on #10123.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10140
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants