Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/BUG: np.dtype equality comparisons versus string-like is inconsistent #5329

Closed
jreback opened this issue Nov 30, 2014 · 16 comments
Closed

Comments

@jreback
Copy link

jreback commented Nov 30, 2014

xref: pandas-dev/pandas#8814

using numpy 1.9.1/macosx

np.dtype equality checking versus string-likes should return a boolean rather than raising
in dtype comparisons. This is currently inconsistent when presented with a valid dtype parse of the string, but will raise if its not valid.

e.g

In [12]: np.dtype('i8') == 'int64'
Out[12]: True

In [14]: np.dtype('i8') == 'int32'
Out[14]: False

In [13]: np.dtype('i8') == 'foo'
TypeError: data type "foo" not understood
@njsmith
Copy link
Member

njsmith commented Nov 30, 2014

The semantics of dtype == str are that it interprets the str as a dtype and
then checks whether the dtypes are equivalent. So the current numpy
behavior is certainly defensible, though I can see the argument the other
way too.

What worries me more is that according to the linked bug report, the reason
you're asking for this is that you want to be able to write dtype ==
"categorical" to check whether or not a dtype-like object is one of
pandas's categorical pseudo-dtype objects. This api seems simply wrong to
me, and it certainly contradicts numpy's current semantics. The problem is
that two categorical dtypes are equivalent only if they have the same
levels. When you write == "categorical", therefore, you're not doing any
kind of equality check at all -- you're doing an isinstance (or issubdtype)
kind of check.

For comparison, consider that
np.dtype(">f8") == "float64"
returns false (on little-endian machines), because even though dtype(">f8")
is a specific kind of float64, it is not interchangeable with
dtype("float64"). Similarly, if/when numpy gets a native categorical dtype
it'll certainly return false or error out when compared to the string
"categorical", because returning true would violate numpy's established
semantics.

Why not just define an is_categorical function or something? Strings are
sometimes convenient but they're a pretty terrible way to represent and
work with structured representations like dtype objects.
On 30 Nov 2014 17:44, "jreback" [email protected] wrote:

xref: pandas-dev/pandas#8814 pandas-dev/pandas#8814

np.dtype equality checking versus string-likes should return a boolean
rather than raising
in dtype comparisons. This is currently inconsistent when presented with a
valid dtype parse of the string, but will raise if its not valid.

e.g

In [12]: np.dtype('i8') == 'int64'
Out[12]: True

In [14]: np.dtype('i8') == 'int32'
Out[14]: False

In [13]: np.dtype('i8') == 'foo'
TypeError: data type "foo" not understood


Reply to this email directly or view it on GitHub
#5329.

@jreback
Copy link
Author

jreback commented Nov 30, 2014

@njsmith

pandas has an additional convenience for testing s.dtype=='category', which is de-facto equivalent to s.dtype=='float64' type testing. It is testing whether a dtype has a .name field that matches (which I think is something akin to what np.dtype does). The bug report is actually why numpy doesn't support a simple string comparison on dtypes.

The problem is in numpy-land in that

np.dtype('i8') == 'int64' works, but np.dtype('i8') == 'foo' just raises. So now you have to guard against that (and I do), but its a bit odd from a user perspective.

You are expecting a boolean if it happens to be a dtype (or is a duck-typed dtype object, which numpy doesn't really handle....). and raises TypeError otherwise

This is a straight equality check, I don't think python EVER does this e.g. type(object) == '.....' will ALWAYS be a boolean (I could be wrong though)

defensible maybe, just not sensible.

I don't need numpy to try to solve is this a xxx dtype. I don't think this is possible in numpy, nor likely to ever happen (this is NOT a test of whether 2 categoricals are equal at all, rather that the TYPE of the object matches something).

simply put to have numpy not raise on a simple type equality operation vs a string (it can of course return the appropriate boolean depending on the architecture or whatever).

@njsmith
Copy link
Member

njsmith commented Nov 30, 2014

pandas has an additional convenience for testing s.dtype=='category', which is de-facto equivalent to s.dtype=='float64' type testing. It is testing whether a dtype has a .name field that matches (which I think is something akin to what np.dtype does).

......not sure how to respond to this? I just wrote a whole long message about how pandas's current s.dtype == "category" is not equivalent to numpy's current s.dtype == "float64" (specifically using float64 as the example!), and how testing the .name field is exactly what dtype.__eq__ doesn't do? dtype.__eq__ tests equality, with the extra convenience that it attempts to coerce to a dtype object before checking for equality.

In [9]: np.dtype(">f8") == "float64"
Out[9]: False

In [10]: np.dtype(">f8").name
Out[10]: 'float64'

It's true that most Python __eq__ methods don't raise errors, but most of them don't attempt to do any automatic coercion either; if we followed the usually Python conventions then dtype("float64") == "float64" would return false, just like {} == "{}" does. And the exception does serve a useful purpose -- consider dtype("float64") == "flaot64". Numpy's API for checking whether the "TYPE of the object matches something" is issubdtype. (Which I do recognize is totally inadequate for your needs.)

This message probably sounds more negative than I intend -- I am totally willing to have a discussion about this! I'd just like to have a discussion that starts from an accurate picture of what numpy's API actually is, and why it is that way, so we can go from there.

@jankatins
Copy link

See also #4820 (comment)

@jreback
Copy link
Author

jreback commented Nov 30, 2014

I am not objecting to this at all.

In [1]: np.dtype(">f8") == "float64"
Out[1]: False

# obviously on 64-bit
In [2]: np.dtype("f8") == "float64"
Out[2]: True

And I raised the .name as a user-based alternative.

What I AM objecting to is simply that this equality check raises if the rhs is non-convertible to a dtype. (I think a reasonable alternative is then to check if it matches the .name attribute, but that might be a bit controversial).

It simply should not raise at all.

And not sure why are saying that s.dtype == 'cateogory' is NOT exactly equivalent to `s.dtype == 'float64'``. it IS full stop (the difference is that I can intercept the call and handle it properly).

In [9]: Series([1.]).dtype=='float64'
Out[9]: True

In [10]: Series([1.]).dtype=='category'
TypeError: data type "category" not understood

In [11]: Series(list('a'),dtype='category').dtype=='category'
Out[11]: True

In [12]: Series(list('a'),dtype='category').dtype=='float64'
Out[12]: False

I am also not trying being negative! This is a dtype comparison issue (now if you said that you won't support strings on the rhs, then I will withdraw my report here).

@njsmith
Copy link
Member

njsmith commented Nov 30, 2014

What I AM objecting to is simply that this equality check raises if the rhs is non-convertible to a dtype.

Like I said, this has the benefit that it will raise an error for typoes like arr.dtype == "flaot64", and it follows the general principle that we should decide which operation we're going to perform (in this case, coerce-then-compare or just compare) based on the type of the arguments rather than the value.

You can argue that there's some benefit to not raising an error here that outweighs these, but you have to actually say what this is, not just assert that no errors should be raised. I understand what you're asking for, what I need is for you to convince us that it's the right thing to do :-)

(I think a reasonable alternative is then to check if it matches the .name attribute, but that might be a bit controversial).

Numpy already has way too many operations that follow this kind of "oh, well, that didn't work, so let's try again with some other random semantics". I'm not a fan :-/ "When in doubt, refuse the temptation to guess" and all that.

And not sure why are saying that s.dtype == 'cateogory' is NOT exactly equivalent to s.dtype == 'float64'`. it IS full stop

Here's another example of what I mean: np.dtype("S4") == "S" -> False. In fact, np.array(["abcd"], dtype="S") == "S" also returns False, even though we specifically said dtype="S". Numpy simply doesn't use == to test for the overall kind -- it only checks for exact equality, arr.dtype == x if and only if arr.view(np.dtype(x)) is a no-op. An specific instance of a categorical dtype is like "S4" -- there are a lot of different categorical dtypes with different categories and integer codings that are not equivalent to each other.

Seriously, why not tell people to use pd.is_categorical instead of == "categorical"? It's a better API in its own right, and even if we implemented your requested change tomorrow then it'd still be a year before you could assume everyone had the new numpy version.

@jreback
Copy link
Author

jreback commented Nov 30, 2014

@njsmith

we don't need to give anyone in pandas land a test for categorical. Its already there
e.g.
df.dtypes == 'categorical'
df.select_dtypes(include=['categorical'])
df.a_column.dtype == 'categorical'
work just fine

the PROBLEM is that

df.a_column.dtype == 'foo' breaks
which is unfortunate for the user.

so are you suggesting that we tell people to:

is_float_dtype(s), is_integer_dtype(s) and so on?

I don't NEED numpy to do anything to support categorical. I am suggesting (as others have in
the past), that it this is clearly an upstream fix that makes sense.

To have to have a user 'manually' have to figure out whether passing a correct dtype string
to test for a dtype is a horrible API.

strong enough language?

@njsmith
Copy link
Member

njsmith commented Dec 1, 2014

we don't need to give anyone in pandas land a test for categorical. Its already there

Except... your current solution doesn't actually work well for users, which is why you're filing a bug report. So we have to decide how to fix that, either by changing numpy or by changing pandas. My point is that it's not really a bug in numpy that you have decided to make an API that's inconsistent with numpy's semantics in theory and that doesn't work well in practice. And even if we do change numpy to do what you want, it still won't actually fix your problem anytime soon.

so are you suggesting that we tell people to: is_float_dtype(s), is_integer_dtype(s) and so on?

Well... I mean... yes? That's... how it's always worked. Numpy has never had any way to tell whether a dtype is floating or integer by doing s == "some string". Instead of writing is_float_dtype(s) and is_integer_dtype(s) they're called np.issubdtype(s, np.floating) and np.issubdtype(s, np.integer), but that's just a spelling detail.

To have to have a user 'manually' have to figure out whether passing a correct dtype string
to test for a dtype is a horrible API

Yeah, the string comparison API is kinda horrible all around. Personally I avoid it and would recommend not using it at all, though numpy's stuck with supporting it b/c of backcompat. But silently ignoring uninterpretable dtype strings would hide real errors, and you haven't really convinced me that your use case is so compelling that we should do it anyway. Mostly this is because you haven't tried :-)

Why is s.dtype == "categorical" so much better than is_categorical(s.dtype) that we should add hacks to numpy to enable it? What benefit do you get from the wacky string comparison API specifically?

@jreback
Copy link
Author

jreback commented Dec 1, 2014

@njsmith

Do you now see how hard it would have been to try to make ANY even small changes to numpy, e.g. adding categorical types?

so much for pushing things upstream.

@jreback jreback closed this as completed Dec 1, 2014
@njsmith
Copy link
Member

njsmith commented Dec 1, 2014

:-(

It's easy to make changes to numpy if you're willing to talk about the
design and like.... even try to justify it. All you've done in this bug
report is say repeatedly that we have to do something and refused to write
any words about why it's the right thing to do :-/. I very much wants
pandas and numpy to play well together but I don't know how to do that if
we can't talk about design trade-offs :-/.

I'm sorry if our inability to communicate here was somehow my fault.

On Mon, Dec 1, 2014 at 12:35 AM, jreback [email protected] wrote:

@njsmith https://github.com/njsmith

Do you now see how hard it would have been to try to make ANY even small
changes to numpy, e.g. adding categorical types?

so much for pushing things upstream.


Reply to this email directly or view it on GitHub
#5329 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@juliantaylor
Copy link
Contributor

so if I understand correctly you want either:
dtype equality to always return False unless the string can be parsed to a numpy dtype. So it would return False for categorical as numpy does not know this dtype
or
dtype equality checks the name attribute if it cannot be parsed, so that checking against string categorical would return True

the former I think is not unreasonable, though I do agree with @njsmith that you should not be doing this at all. The categorical example alone is to me at least not a good enough argument for this change.
the latter cannot be done as already mentioned ">f8" != "<f8" but they both have the name float64. We could change the name to be more descriptive but that would need some design work for structured types (which are currently just void) and will likely break backward compatibility.
One may have more luck with the str attribute but that also does not handle structured types.

@jankatins
Copy link

One fix I can think of, could be allowing extendung the dtype system from outside, e.g. adding a registry which could add a string and a class so that numpy can use that to return the dtype.

@jreback
Copy link
Author

jreback commented Dec 1, 2014

I do appreciate both @juliantaylor and @njsmith comments.

However, I have a couple of issues:

  • You both seem to be defending a backward compat dtype-string comparison API, which IMHO is quite useful, but quite inconsistent. If I give it an arbitrary string it raises, but a stringified dtype like 'int64' works.
  • Its current implementation violates several core python axioms:
    • its not simple
    • errors should not be fatal
    • its not generalizable / customizable

An equality comparison should always return a boolean.

I am not suggesting that numpy change what float64 or float actually return. Rather that option a) from above simply work.

The bottom line is I am dismayed by the approach that seems to be happening here. The issues, rather that being addressed as a typical bug report, is attacked as to the original motiviation / design decisions (which actually are a result of the inflexibility of np.dtype, but that is a separate discussion).

This does not encourage one to contribute or even push issues upstream to numpy. It feels like pulling teeth.

I have often been obstinate when I know I know something is wrong but feel the need to rationalize its existence. It is then upon the library and not the issue writer to justify its existence.

Hence I think that numpy should justify why if the string-like dtype comparisons exist (even if only as a backward compatible, but de-facto API), they should not behave like normal python comparions.

@endolith
Copy link
Contributor

endolith commented Apr 7, 2016

So what's the recommended way to check a.dtype != 'complex256' on a system that doesn't have complex256?

scipy/scipy#5608 (comment)

@njsmith
Copy link
Member

njsmith commented Apr 7, 2016

@endolith: I guess there are a few options -- not hasattr(np, "complex256") or a.dtype != np.complex256 would be one approach.

@solsword
Copy link

Just ran into this issue when attempting the following:

vtype = data[col].dtype
if vtype in (bool, "category"):
  do_something()

Getting a TypeError there was absolutely not what I expected, and it took a bit of effort to figure out what the hell was going on. (Bonus: you don't get the type error if vtype is bool, of course, because the checks are sequential).

Reading through the comments here, I'm inclined to agree with @njsmith though: telling users to use the string "category" for dtype comparisons in pandas is a real abuse of the numpy dtype concept, especially because as pointed out, two values form columns with categorical dtypes are quite often not at all comparable, whereas that property holds for all other existing dtypes (and is the whole point of the dtype system).

I'm going to switch to using pandas.core.dtypes.common.is_categorical_dtype and the like, but IMO the proper solution to this is for pandas folks to make the dtypes they want to surface easy to get at in their API as objects instead of telling users to plug in strings and then ask numpy to support that with a hack that breaks numpy abstractions.

As a compromise, the numpy behavior could be changed to a warning and return False instead of an error, because I think the point about comparisons not throwing errors is valid, but as pointed out above, just that change alone doesn't help fix the pandas problem at all, because the == comparison with "category" will then return False.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants