-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DISCUSS: disambiguation of NA and "NA" in reprs #30415
Comments
Thanks for opening the issue! It is indeed true that currently, there is no distinction in the repr:
Up to now, we had the same problem with strings like "None" or "NaN", but I agree that "NA" might be more common to have as a string. For me, it is not so much a question of There are 2 main reprs used in pandas: the plain text repr (eg in the console or when printing) and the html repr (eg in the notebook). As comparison, R also uses > data.frame(a=c("NA", NA))
a
1 NA
2 <NA> The tibble uses a more rich display with coloring: So there might be options that we can solve this ambiguity between |
str
dtype
behaviour with missing value
In your edit I fear you have misrepresented my position with regards missing
In your note above you do not address the first point other than to say: "For me, it is not so much a question of None vs pd.NA as missing value indicator (None also has several disadvantages, which is one of the reasons to go with a pd.NA)" but without explanation as to what these disadvantages are. Please would you explain the advantages of With regards representation, point 2, as I typically work with command line tools and see plain-text representation as key I would suggest for missing scalar representation to be either the existing |
You can use isna to distinguish pd.NA from the string NA. We’d prefer to us NA rather than None for consistent behavior across dtypes. We have more control over NA. |
@anisotropi4 I'm sorry if I changed the intent of your issue without explicitly saying or asking that. You're correct there can be two questions, but I understood your comment on twitter to be your second item ("how to display). Moreover, I don't think it is a problem that the string "NA" can also mean different things compared to the object
That's what you can do in code, yes. But I think we should still think about the display, where you can't use |
Right you covered the display stuff :) I think we should explore the color stuff at least within IPython and jupyter. |
Just for fun, wrote this up diff --git a/pandas/core/arrays/string_.py b/pandas/core/arrays/string_.py
index de254f662b..7b929adb9d 100644
--- a/pandas/core/arrays/string_.py
+++ b/pandas/core/arrays/string_.py
@@ -219,6 +219,16 @@ class StringArray(PandasArray):
arr[mask] = -1
return arr, -1
+ def _formatter(self, boxed=False):
+ def fmt(x):
+ if x is libmissing.NA:
+ return "\033[91m" + "NA" + '\033[0m'
+ elif boxed:
+ return str(x)
+ else:
+ return repr(x)
+ return fmt
+
def __setitem__(self, key, value):
value = extract_array(value, extract_numpy=True)
if isinstance(value, type(self)): This worked for the array repr, but not for Series or DataFrame |
right we likely need to ask the column for its repr of nulls (rather than hard code based on dtype) |
Just FYI, this will take a bit more work to get working inside Series / DataFrame. Things like truncating / aligning columns gets broken because the length of the "value" `"\033[91m + NA + \033[0m" doesn't match its display length of 2. I don't think this should necessarily be a blocker for 1.0. |
for sure not a blocker |
@jorisvandenbossche thank you for your clarification. I get now that I was rather woolly in the way that I framed the question and should've been clearer about which of the two issues this looking at. Then, for those of us that are still working in black-and-white, I would ask that another representation of NA rather than simply the text |
I have a branch started, but it’s quite a bit of work. Probably not happening for 1.0. |
Don't use color, because that will have an effect on the color blind (not a problem for me, but we've had other comments from the color blind in the past) But here's a suggestion that I think will look nice and easily handles the issue of length of the representation.
|
This will need to be configurable. And we'll want an option that works without colors too. Is this something we want to pursue for 1.0? It's already a surprisingly large diff, and I haven't written thorough tests, and I haven't implemented the option handling yet. |
@TomAugspurger You also have to worry about the documentation impacts, because will the color show up in the docs? And even if it does, that's not good for the color blind. That's why I suggested the |
Yes, the color should show up in the docs.
…On Mon, Jan 6, 2020 at 7:21 AM Irv Lustig ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> You also have to worry
about the documentation impacts, because will the color show up in the
docs? And even if it does, that's not good for the color blind. That's why
I suggested the '«NA»' option. Visible to everybody (and probably a
smaller diff?)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30415?email_source=notifications&email_token=AAKAOITSZOTBPKUNOCFQ64TQ4MV4BA5CNFSM4J6N3VR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIFNNFI#issuecomment-571135637>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIQYQJCWZDFXHGIZO6LQ4MV4BANCNFSM4J6N3VRQ>
.
|
@Dr-Irv is there a reason to use the special utf to have '«NA»' instead of a simpler ''? Or just because you think it looks better? -- If we want to go with a different text repr (like '«NA»'), I think it would be nice to include this in 1.0, as it impacts quite a bit the "look" of the new feature (but if we want that, maybe not a blocker for 1.0). |
@jorisvandenbossche I chose it because it looked better. But we could also use |
Change to
str
dtype
behaviour for missing elementsFollowing comments the discussion about how to handle missing
NA
scalar values in #28778 I was asked to raise my question as this seperate issue.My rather prosaic question is how if missing
str
elements are given the valueNA
, how would I distinguish between a missingstr
value and the two-character string 'NA'?I ask as NA is a common abbreviation for 'Not Applicable', 'North America' et al, in a way that in my experience that 'NaN' or 'Not a Number' isn't
That is, if 'NA' were generated as the default missing
str
dtype
value, especially if introduced as change rather than as a opt-in, it risks becoming a UX developer issue as I (for one) would no longer know if 'NA' is a valid or a missing data value.For what it's worth, current idiomatic behaviour is that in a missing values would be replaced by
None
dtype
:The
dtypes
here are:Given this, my thought is that
NA
is not a suitable default replacement for missingstr
dtype
elements ratherNone
ofNoneType
dtype
The text was updated successfully, but these errors were encountered: