Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework Scalar imports #10791

Merged
merged 4 commits into from
May 10, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion python/cudf/cudf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@
register_series_accessor,
)
from cudf.core.scalar import (
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
NA,
Scalar,
)
from cudf.core.index import (
Expand All @@ -45,6 +44,7 @@
)
from cudf.core.dataframe import DataFrame, from_pandas, merge, from_dataframe
from cudf.core.series import Series
from cudf.core.missing import NA
from cudf.core.multiindex import MultiIndex
from cudf.core.cut import cut
from cudf.core.algorithms import factorize
Expand Down
7 changes: 7 additions & 0 deletions python/cudf/cudf/core/missing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Copyright (c) 2018-2022, NVIDIA CORPORATION.


# Pandas NAType enforces a single instance exists at a time
# instantiating this class will yield the existing instance
# of pandas._libs.missing.NAType, id(cudf.NA) == id(pd.NA).
from pandas import NA # noqa: F401
brandon-b-miller marked this conversation as resolved.
Show resolved Hide resolved
bdice marked this conversation as resolved.
Show resolved Hide resolved
40 changes: 14 additions & 26 deletions python/cudf/cudf/core/scalar.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,12 @@

import numpy as np
import pyarrow as pa
from pandas._libs.missing import NAType as pd_NAType

import cudf
from cudf.core.column.column import ColumnBase
from cudf.api.types import is_scalar
from cudf.core.dtypes import ListDtype, StructDtype
from cudf.core.index import BaseIndex
from cudf.core.missing import NA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since cudf is already imported, do we want to use cudf.NA rather than import this name? I see quite a few files that use cudf.NA rather than importing NA separately. I don't think any other files use this convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really great question. We use cudf.NA almost everywhere, but should we? In some ways it feels like it might fit the "onion model" a bit better if scalar.py didn't mention the top level cudf namespace anywhere.

I think there's probably plenty of places I have used cudf.NA throughout the codebase as a convenience since those places already have cudf imported, but I realize now that it has the negative consequence of further entrenching the need for import cudf in those places, which is part of what I think gives rise to these import issues.

Is this way off base? I am happy to put in a PR that addresses this for NA across the codebase as a follow up if that's what we think should happen, and just imports from missing.

Copy link
Contributor

@vyasr vyasr May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run I would love to get rid of import cudf almost everywhere. It is a code smell that almost always (there are always exceptions) indicates circular dependencies that should not exist. Unfortunately a lot of those circular dependencies are baked deeply into our code right now and are hard to excise. If you could get rid of these on a case-by-case basis (in this instance, for cudf.NA), I think that would be good for us in the long run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @vyasr I think that lines up with my understanding. I am going to leave this as-is for now and look into making the change everywhere. I'll leave this thread open though and let @bdice resolve if this conclusion seems satisfactory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing import cudf sounds fine to me! Could we make an issue to document the plan before resolving this? (I would do it but I’m on mobile at the moment.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised #10820 regarding this. I'll tackle this for NA first after this PR is merged, after which it should be importable from missing. From there I'll start probing what it takes to address this more generally.

from cudf.core.mixins import BinaryOperand
from cudf.core.series import Series
from cudf.utils.dtypes import (
get_allowed_combinations_for_operator,
to_cudf_compatible_scalar,
Expand Down Expand Up @@ -273,19 +271,19 @@ def _binop_result_dtype_or_error(self, other, op):
return cudf.dtype(out_dtype)

def _binaryop(self, other, op: str):
if isinstance(other, (ColumnBase, Series, BaseIndex, np.ndarray)):
# dispatch to column implementation
return NotImplemented
other = to_cudf_compatible_scalar(other)
out_dtype = self._binop_result_dtype_or_error(other, op)
valid = self.is_valid and (
isinstance(other, np.generic) or other.is_valid
)
if not valid:
return Scalar(None, dtype=out_dtype)
if is_scalar(other):
other = to_cudf_compatible_scalar(other)
out_dtype = self._binop_result_dtype_or_error(other, op)
valid = self.is_valid and (
isinstance(other, np.generic) or other.is_valid
)
if not valid:
return Scalar(None, dtype=out_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize it's not in scope, so I'd be perfectly happy with a follow-up, but this seems like an important case to have tested somewhere. It's basically for any binary op between two scalars where one is NA, right? I'm surprised that's not tested, I guess because we only test column + scalar?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Approving but would also like to see tests for this at some point.)

else:
result = self._dispatch_scalar_binop(other, op)
return Scalar(result, dtype=out_dtype)
else:
result = self._dispatch_scalar_binop(other, op)
return Scalar(result, dtype=out_dtype)
return NotImplemented

def _dispatch_scalar_binop(self, other, op):
if isinstance(other, Scalar):
Expand Down Expand Up @@ -323,13 +321,3 @@ def _dispatch_scalar_unaop(self, op):

def astype(self, dtype):
return Scalar(self.value, dtype)


class _NAType(pd_NAType):
# Pandas NAType enforces a single instance exists at a time
# instantiating this class will yield the existing instance
# of pandas._libs.missing.NAType, id(cudf.NA) == id(pd.NA).
pass


NA = _NAType()