-
-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API/BUG: np.dtype equality comparisons versus string-like is inconsistent #5329
Comments
The semantics of dtype == str are that it interprets the str as a dtype and What worries me more is that according to the linked bug report, the reason For comparison, consider that Why not just define an is_categorical function or something? Strings are
|
pandas has an additional convenience for testing The problem is in numpy-land in that
You are expecting a boolean if it happens to be a dtype (or is a duck-typed dtype object, which numpy doesn't really handle....). and raises This is a straight equality check, I don't think python EVER does this e.g. defensible maybe, just not sensible. I don't need numpy to try to solve simply put to have numpy not raise on a simple type equality operation vs a string (it can of course return the appropriate boolean depending on the architecture or whatever). |
......not sure how to respond to this? I just wrote a whole long message about how pandas's current
It's true that most Python This message probably sounds more negative than I intend -- I am totally willing to have a discussion about this! I'd just like to have a discussion that starts from an accurate picture of what numpy's API actually is, and why it is that way, so we can go from there. |
See also #4820 (comment) |
I am not objecting to this at all.
And I raised the What I AM objecting to is simply that this equality check raises if the rhs is non-convertible to a dtype. (I think a reasonable alternative is then to check if it matches the It simply should not raise at all. And not sure why are saying that
I am also not trying being negative! This is a dtype comparison issue (now if you said that you won't support strings on the rhs, then I will withdraw my report here). |
Like I said, this has the benefit that it will raise an error for typoes like You can argue that there's some benefit to not raising an error here that outweighs these, but you have to actually say what this is, not just assert that no errors should be raised. I understand what you're asking for, what I need is for you to convince us that it's the right thing to do :-)
Numpy already has way too many operations that follow this kind of "oh, well, that didn't work, so let's try again with some other random semantics". I'm not a fan :-/ "When in doubt, refuse the temptation to guess" and all that.
Here's another example of what I mean: Seriously, why not tell people to use |
we don't need to give anyone in pandas land a test for categorical. Its already there the PROBLEM is that
so are you suggesting that we tell people to:
I don't NEED numpy to do anything to support categorical. I am suggesting (as others have in To have to have a user 'manually' have to figure out whether passing a correct dtype string strong enough language? |
Except... your current solution doesn't actually work well for users, which is why you're filing a bug report. So we have to decide how to fix that, either by changing numpy or by changing pandas. My point is that it's not really a bug in numpy that you have decided to make an API that's inconsistent with numpy's semantics in theory and that doesn't work well in practice. And even if we do change numpy to do what you want, it still won't actually fix your problem anytime soon.
Well... I mean... yes? That's... how it's always worked. Numpy has never had any way to tell whether a dtype is floating or integer by doing
Yeah, the string comparison API is kinda horrible all around. Personally I avoid it and would recommend not using it at all, though numpy's stuck with supporting it b/c of backcompat. But silently ignoring uninterpretable dtype strings would hide real errors, and you haven't really convinced me that your use case is so compelling that we should do it anyway. Mostly this is because you haven't tried :-) Why is |
Do you now see how hard it would have been to try to make ANY even small changes to numpy, e.g. adding categorical types? so much for pushing things upstream. |
:-( It's easy to make changes to numpy if you're willing to talk about the I'm sorry if our inability to communicate here was somehow my fault. On Mon, Dec 1, 2014 at 12:35 AM, jreback [email protected] wrote:
Nathaniel J. Smith |
so if I understand correctly you want either: the former I think is not unreasonable, though I do agree with @njsmith that you should not be doing this at all. The categorical example alone is to me at least not a good enough argument for this change. |
One fix I can think of, could be allowing extendung the dtype system from outside, e.g. adding a registry which could add a string and a class so that numpy can use that to return the dtype. |
I do appreciate both @juliantaylor and @njsmith comments. However, I have a couple of issues:
An equality comparison should always return a boolean. I am not suggesting that numpy change what The bottom line is I am dismayed by the approach that seems to be happening here. The issues, rather that being addressed as a typical bug report, is attacked as to the original motiviation / design decisions (which actually are a result of the inflexibility of This does not encourage one to contribute or even push issues upstream to numpy. It feels like pulling teeth. I have often been obstinate when I know I know something is wrong but feel the need to rationalize its existence. It is then upon the library and not the issue writer to justify its existence. Hence I think that numpy should justify why if the string-like dtype comparisons exist (even if only as a backward compatible, but de-facto API), they should not behave like normal python comparions. |
So what's the recommended way to check |
@endolith: I guess there are a few options -- |
Just ran into this issue when attempting the following: vtype = data[col].dtype
if vtype in (bool, "category"):
do_something() Getting a Reading through the comments here, I'm inclined to agree with @njsmith though: telling users to use the string "category" for dtype comparisons in pandas is a real abuse of the numpy dtype concept, especially because as pointed out, two values form columns with categorical dtypes are quite often not at all comparable, whereas that property holds for all other existing dtypes (and is the whole point of the dtype system). I'm going to switch to using pandas.core.dtypes.common.is_categorical_dtype and the like, but IMO the proper solution to this is for pandas folks to make the dtypes they want to surface easy to get at in their API as objects instead of telling users to plug in strings and then ask numpy to support that with a hack that breaks numpy abstractions. As a compromise, the numpy behavior could be changed to a warning and return False instead of an error, because I think the point about comparisons not throwing errors is valid, but as pointed out above, just that change alone doesn't help fix the pandas problem at all, because the == comparison with "category" will then return False. |
xref: pandas-dev/pandas#8814
using
numpy 1.9.1/macosx
np.dtype
equality checking versus string-likes should return a boolean rather than raisingin dtype comparisons. This is currently inconsistent when presented with a valid dtype parse of the string, but will raise if its not valid.
e.g
The text was updated successfully, but these errors were encountered: