Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add built-in function for Styler to format the text displayed for missing values #29118

Conversation

immaxchen
Copy link
Contributor

@immaxchen immaxchen commented Oct 20, 2019

As described in the issues, user who wants to control how NA values are printed
while applying styles to the output will have to implement their own formatter.
(so that the underlying data will not change and can be used for styling)

Since the behavior is common in styling (for reports etc.), suggest to add this
shortcut function to enable users format their NA values as something like '--'
or 'Not Available' easily.

EDIT:

  • Change implementation to integrate with the original .format() using na_rep argument
  • Add a new table-wise default na_rep setting, which can be set through .set_na_rep()
  • Minor fix: formatter should be wrapped outside of locs-loop
  • Added a few user guide examples and test cases

example usage: df.style.highlight_max().format(None, na_rep="-")

…ing values

As described in GH pandas-dev#28358, user who wants to control how NA values are printed
while applying styles to the output will have to implement their own formatter.
(so that the underlying data will not change and can be used for styling)

Since the behavior is common in styling (for reports etc.), suggest to add this
shortcut function to enable users format their NA values as something like '--'
or 'Not Available' easily.

example usage: `df.style.highlight_max().format_null('--')`
@@ -110,6 +110,7 @@ Other enhancements
- :meth:`DataFrame.to_json` now accepts an ``indent`` integer argument to enable pretty printing of JSON output (:issue:`12004`)
- :meth:`read_stata` can read Stata 119 dta files. (:issue:`28250`)
- Added ``encoding`` argument to :func:`DataFrame.to_html` for non-ascii text (:issue:`28663`)
- :meth:`Styler.format_null` is now added into the built-in functions to help formatting missing values (:issue:`28358`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add this into the user guide as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like the name format_nans better, to be similar to fillna, hasnans etc.

@jreback?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll be glad to!

@topper-123 topper-123 changed the title Add built-in function for Styler to format the text displayed for missing values ENH: Add built-in function for Styler to format the text displayed for missing values Oct 20, 2019
@topper-123 topper-123 added Code Style Code style, linting, code_checks Enhancement labels Oct 20, 2019
@jbrockmendel
Copy link
Member

@topper-123 My understanding is that the "Style" tag is refers to code style, linting stuff, not the Styler class. Do I have this backwards?

@immaxchen
Copy link
Contributor Author

The doc has been added, please have a look! 😄
How about the method name? do we have a conclusion?

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been moving away from null and to NA. So I think this should be called format_na if anything.

I'm a bit concerned about the implementation. I suspect this would be better handled as an argument to Styler and a method like set_na_format (see how uuid is handled). Then the default_display_func` would check if the value is NA and use the na format if it's NA.

-------
self : Styler
"""
self.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it will overwrite the formatting of a previously applied formatter for non-NA values. Something like

df.style.format("hi-{}".format).format_null()

is that the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @TomAugspurger, I like the name format_na! and yes, I was intended to make an overwriting implementation. Actually, I have considered the set_* approach as you, but it seems confusing for the case:

df.style.format('{:.2%}').set_na_format('-') # got 'nan%' instead of '-'

I've got a new idea, how about interface like this?

.format_na('-', subset=['col1','col2'])
.format('{:.2%}', na_rep='-', subset=['col3','col4'])

And the docstring for format_na rephrase to:

Format the text display value using default formatter but represent nan as `na_rep`.
For more advanced formatting, use Styler.format() with your custom formatter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may not have been clear about my concern. It's fine that na_format overwrites the formatting for NA values. I'm concerned tht it overwrites the formatting for non-NA values. In my .format("hi-{}".format).format_na('NA') example, the NA values should be formatted as 'NA' and the non-NA values should be formatted as hi-<value>. But I suspect that right now the non-NA formatting is lost (though perhaps it's not).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about adding an na_rep to the .format function... That's probably fine. I think it'd still be useful for users to have a way to control the default NA formatting at the table level.

But if we add an na_rep to format, then we wouldn't need a new format_na method, right?

Copy link
Contributor Author

@immaxchen immaxchen Oct 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, so a setting at the table level: self.na_rep
and the def format(self, formatter, subset=None): becomes
def format(self, formatter=None, subset=None, na_rep=None):
drop .format_na('-'), use .format(na_rep='-') instead, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sounds correct. I'm not sure what the default should be, but probably just None (no special formatting for NA values).

@simonjayhawkins simonjayhawkins added IO HTML read_html, to_html, Styler.apply, Styler.applymap and removed Code Style Code style, linting, code_checks labels Oct 21, 2019
…r.format method

Change the implementation to integrate with the original `.format()` method by `na_rep` parameter
Add a new table-wise default `na_rep` setting, which can be set through the new `.set_na_rep()` method
Also enhanced the `.highlight_null()` method to be able to use `subset` parameter
Add a few user guide examples and test cases
@immaxchen
Copy link
Contributor Author

Revision done, please help to review, thanks! :D

@immaxchen
Copy link
Contributor Author

I think this PR can also resolve #21527 (ENH: Styler display of NaN / null Values)

@immaxchen
Copy link
Contributor Author

Hi @TomAugspurger, would you be available to review this PR? thanks! 😉

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 29, 2019 via email

@immaxchen
Copy link
Contributor Author

Ouch. 😂
@jreback can you help to review? 😁

@@ -416,16 +423,20 @@ def format_attr(pair):
table_attributes=table_attr,
)

def format(self, formatter, subset=None):
def format(self, formatter=None, subset=None, na_rep=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the point of changing formatter here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you William!
Making formatter optional can enable this syntax .format(na_rep="-") for "only" formatting NA display value. otherwise will have to use .format(None, na_rep="-") -- less intuitive.

@@ -891,6 +908,23 @@ def set_table_styles(self, table_styles):
self.table_styles = table_styles
return self

def set_na_rep(self, na_rep):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you annotate this function? I think should be

Suggested change
def set_na_rep(self, na_rep):
def set_na_rep(self, na_rep: str) -> "Styler":

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! will do.

@@ -935,19 +969,21 @@ def _highlight_null(v, null_color):
"background-color: {color}".format(color=null_color) if pd.isna(v) else ""
)

def highlight_null(self, null_color="red"):
def highlight_null(self, null_color="red", subset=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this related to na_rep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not related. 😂
Just making it consistent with highlight_min and highlight_max, also for completeness of missing values styling and formatting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea but would prefer if you take this out then. We try to keep PRs to changing one thing at a time as it helps speed up review process

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can always do as a follow up PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that is a great advice!

…ortcut-for-missing-values-formatting-gh28358

sync with the latest upstream/master
1. keep formatter as mandatory in `.format` method
2. annotate the new method `.set_na_rep`
3. remove changes in `.highlight_null` to another PR
4. minor refinement to the whats new and user guide
@immaxchen
Copy link
Contributor Author

Revision done, please have a look. 😁

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Can you add

  1. A test exercising passing na_rep to Styler, Styler(df, na_rep="-")
  2. A test ensuring that non-numeric NA values are handled correctly? At least NaT for datetime data.

@@ -71,6 +71,11 @@ class Styler:
The ``id`` takes the form ``T_<uuid>_row<num_row>_col<num_col>``
where ``<uuid>`` is the unique identifier, ``<num_row>`` is the row
number and ``<num_col>`` is the column number.
na_rep : str or None, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be str, optional.

@@ -1480,3 +1518,13 @@ def _maybe_wrap_formatter(formatter):
"instead".format(formatter=formatter)
)
raise TypeError(msg)


def _maybe_wrap_na_formatter(formatter, na_rep):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be folded into _maybe_wrap_formatter? I'm finding the multiple layers of wrapping hard to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, all done!

@immaxchen
Copy link
Contributor Author

Hi Tom, William, please have a look at the revision when you are available, thanks~

@immaxchen immaxchen requested a review from WillAyd November 9, 2019 09:31
@@ -114,6 +114,8 @@ Other enhancements
- Added ``encoding`` argument to :meth:`DataFrame.to_string` for non-ascii text (:issue:`28766`)
- Added ``encoding`` argument to :func:`DataFrame.to_html` for non-ascii text (:issue:`28663`)
- :meth:`Styler.background_gradient` now accepts ``vmin`` and ``vmax`` arguments (:issue:`12145`)
- :class:`Styler` added :meth:`Styler.set_na_rep` method to set default missing values representation for the entire table.
:meth:`Styler.format` added the ``na_rep`` parameter to help format the missing values (:issue:`21527`, :issue:`28358`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just this note is fine (can delete the line above)

@@ -126,6 +131,7 @@ def __init__(
caption=None,
table_attributes=None,
cell_ids=True,
na_rep=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you annotate this? I think Optional[str]

@@ -416,16 +425,22 @@ def format_attr(pair):
table_attributes=table_attr,
)

def format(self, formatter, subset=None):
def format(self, formatter, subset=None, na_rep=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotation here as well

@@ -1487,14 +1523,22 @@ def _get_level_lengths(index, hidden_elements=None):
return non_zero_lengths


def _maybe_wrap_formatter(formatter):
def _maybe_wrap_formatter(formatter, na_rep):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

elif isinstance(na_rep, str):
return lambda x: na_rep if pd.isna(x) else formatter_func(x)
else:
msg = "Expected a string, got {na_rep} instead".format(na_rep=na_rep)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test case that hits this?

@immaxchen
Copy link
Contributor Author

immaxchen commented Nov 12, 2019

After adding the annotation for na_rep, mypy complained as below, so I have no choice but to add more annotations. 😂

pandas/io/formats/style.py:136: error: Need type annotation for 'ctx'
pandas/io/formats/style.py:137: error: Need type annotation for '_todo' (hint: "_todo: List[<type>] = ...")
pandas/io/formats/style.py:158: error: Need type annotation for 'hidden_columns' (hint: "hidden_columns: List[<type>] = ...")
pandas/io/formats/style.py:173: error: Need type annotation for '_display_funcs'
pandas/io/formats/style.py:470: error: "None" not callable

@immaxchen immaxchen requested a review from WillAyd November 17, 2019 07:02
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Minor issue with the new test rest of the comments are optional and can be done as follow ups

self.ctx = defaultdict(list)
self._todo = []
self.ctx = defaultdict(list) # type: DefaultDict[Tuple[int, int], List[str]]
self._todo = [] # type: List[Tuple[Callable, Tuple, Dict]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add sub-types for Callable, Tuple, Dict here?

@@ -416,16 +427,22 @@ def format_attr(pair):
table_attributes=table_attr,
)

def format(self, formatter, subset=None):
def format(self, formatter, subset=None, na_rep: Optional[str] = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required here but if you wanted to in a follow up annotate other arguments would be super helpful

def test_format_with_bad_na_rep(self):
# GH 21527 28358
df = pd.DataFrame([[None, None], [1.1, 1.2]], columns=["A", "B"])
df.style.format(None, na_rep=-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of marking this as fail should use pytest.raises as a context manager (you'll find other examples throughout tests)

@immaxchen
Copy link
Contributor Author

Thanks @WillAyd, I have revised using pytest.raises and resolve merge conflict. Since this PR is getting quite lengthy, I will try to add annotations in the follow-ups.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK last edit - I know you want to refine types further later but in the interim can you convert to the Py36 syntax? We have #29741 outstanding which is doing that for existing types, so want to use that in new PRs as well

@immaxchen
Copy link
Contributor Author

Hi @WillAyd, revision done for py36 type syntax, please have a look :)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR this lgtm. @TomAugspurger care to look?

@TomAugspurger TomAugspurger merged commit cc3daa6 into pandas-dev:master Nov 25, 2019
@TomAugspurger
Copy link
Contributor

Thanks @immaxchen!

@TomAugspurger TomAugspurger added this to the 1.0 milestone Nov 25, 2019
keechongtan added a commit to keechongtan/pandas that referenced this pull request Nov 25, 2019
…ndexing-1row-df

* upstream/master: (185 commits)
  ENH: add BooleanArray extension array (pandas-dev#29555)
  DOC: Add link to dev calendar and meeting notes (pandas-dev#29737)
  ENH: Add built-in function for Styler to format the text displayed for missing values (pandas-dev#29118)
  DEPR: remove statsmodels/seaborn compat shims (pandas-dev#29822)
  DEPR: remove Index.summary (pandas-dev#29807)
  DEPR: passing an int to read_excel use_cols (pandas-dev#29795)
  STY: fstrings in io.pytables (pandas-dev#29758)
  BUG: Fix melt with mixed int/str columns (pandas-dev#29792)
  TST: add test for ffill/bfill for non unique multilevel (pandas-dev#29763)
  Changed description of parse_dates in read_excel(). (pandas-dev#29796)
  BUG: pivot_table not returning correct type when margin=True and aggfunc='mean'  (pandas-dev#28248)
  REF: Create _lib/window directory (pandas-dev#29817)
  Fixed small mistake (pandas-dev#29815)
  minor cleanups (pandas-dev#29798)
  DEPR: enforce deprecations in core.internals (pandas-dev#29723)
  add test for unused level raises KeyError (pandas-dev#29760)
  Add documentation linking to sqlalchemy (pandas-dev#29373)
  io/parsers: ensure decimal is str on PythonParser (pandas-dev#29743)
  Reenabled no-unused-function (pandas-dev#29767)
  CLN:F-string in pandas/_libs/tslibs/*.pyx (pandas-dev#29775)
  ...

# Conflicts:
#	pandas/tests/frame/indexing/test_indexing.py
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
…r missing values (pandas-dev#29118)

* Add built-in funcion for Styler to format the text displayed for missing values

As described in GH pandas-dev#28358, user who wants to control how NA values are printed
while applying styles to the output will have to implement their own formatter.
(so that the underlying data will not change and can be used for styling)
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
…r missing values (pandas-dev#29118)

* Add built-in funcion for Styler to format the text displayed for missing values

As described in GH pandas-dev#28358, user who wants to control how NA values are printed
while applying styles to the output will have to implement their own formatter.
(so that the underlying data will not change and can be used for styling)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Styler display of NaN / null Values
7 participants