ENH: Fix `by` in DataFrame.plot.hist and DataFrame.plot.box #28373

charlesdong1991 · 2019-09-10T16:41:14Z

closes DataFrame.plot.box ignores by argument #15079
xref: API: consider deprecating DataFrame.hist in favor of DataFrame.plot.hist #11053, DEPR: Clean up of pandas.plotting #28177
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

simonjayhawkins · 2019-09-10T17:12:47Z

@charlesdong1991 Thanks for the PR! What issue does this relate to/close?

pep8speaks · 2019-09-10T18:28:17Z

Hello @charlesdong1991! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-07-12 06:22:22 UTC

charlesdong1991 · 2019-09-10T18:29:59Z

ahh, sorry, it's still in progress, so i didn't add that

Added! @simonjayhawkins

MarcoGorelli

Looking through reviews here, I noticed some in which

if self.by is None

was asked to be changed to

if self.by

for the sake of readability

Just for reference, here's an example of bugs in pandas visualisation which were going unnoticed because if labelsize was being used instead of if labelsize is None: #34768 - so, strong preference for checking is None from me :)

Anyway, I didn't notice tests here in which by isn't None but would pass if self.by (e.g. an empty list), could one be added?

charlesdong1991 · 2021-07-01T08:32:24Z

Hi, @MarcoGorelli Thanks very much for your reviews!

I agree we should be cautious here, I didn't notice that kind of bugs. However, in this case, I think it is fine to use if self.by for instance, because both None and [] (an empty list) should just return the same result which is there is no grouped data needed for plotting.

As you mentioned, indeed there is no tests for this, and I have added two tests: test_hist_plot_with_none_empty_list_by and test_box_plot_with_none_empty_list_by in the test file, and have 5 parametrization to test different scenarios. Please let me know your thoughts

MarcoGorelli · 2021-07-01T08:33:05Z

Awesome, thanks!

datapythonista · 2021-07-01T14:24:12Z

Agree with @MarcoGorelli about if self.by is None:, thanks for the comment. Not sure why I was thinking in if not self.by:, where it'd make sense what I made. Sorry for making you change those twice @charlesdong1991

charlesdong1991 · 2021-07-01T14:41:52Z

thank both of your comment @datapythonista @MarcoGorelli Just double check a bit before making the change:

Let me just summarize a bit: so in this case, no matter if it is None or an empty list assigned to by, we should both return un-grouped data and thus the plots. And I guess that might be your reason to suggest if (not) self.by instead of if self.by is (not) None? @datapythonista

Or do we expect to return an error if an empty list is assigned to by?

Hope we have consensus on the expected behaviour ^^

datapythonista · 2021-07-01T20:09:17Z

The empty list could be good if it's the same as None, but 0 should be considered as a column name, and without the is None won't.

But I'd check what's the current behavior in df.hist() and try to do the same, if it's reasonable. So, when we make df.hist and alias of df.plot.hist and start its deprecation, we don't change much our users code behavior.

charlesdong1991 · 2021-07-02T12:01:58Z

ahh, okay, yeah 0 can be a column name, didn't think of this scenario. I will adjust the implementation a bit then.

charlesdong1991 · 2021-07-10T14:38:23Z

Hi, thanks for your comments @MarcoGorelli @datapythonista Sorry for late updates again, I was still busy with some exams in the past week.

As discussed, I have changed it to is not None to reflect your comment, and also add additional tests test_box_plot_by_0 and test_hist_plot_by_0 to test some cases where column name is 0.

The behavior now is when users put not None input for by, if input is an empty list or tuple, then we treat the same as None, since it means there is no group-by needed; elsewhere, we will use the input (no matter it is 0 or string or list/tuple of col names) to group and make corresponding plots.

Please let me know if you have further comments.

MarcoGorelli · 2021-07-11T13:25:02Z

Please let me know if you have further comments

From a quick look, no obvious objections from me - I won't have a chance to do a full review til next week though

jreback

looks pretty reasonable, a couple of questions.

do the viz docs / doc-strings need updating with examples of this? (could be a followon, if so, can you open an issue)

jreback · 2021-05-31T16:50:11Z

pandas/plotting/_core.py

@@ -1277,6 +1277,9 @@ def hist(self, by=None, bins=10, **kwargs):
        ----------
        by : str or sequence, optional
            Column in the DataFrame to group by.
+
+            .. versionadded:: 1.3.0


versionchanged right? can you add a line on what is changing here

yeah, I was doubting about this. We do accept by right now but don't do anything about that. Now we do start supporting by to make plots for groups, so i put added here instead of changed.

I think then i will change to versionchanged.

jreback · 2021-05-31T16:50:25Z

pandas/plotting/_core.py

+            :context: close-figs
+
+            >>> age_list = [8, 10, 12, 14, 72, 74, 76, 78, 20, 25, 30, 35, 60, 85]
+            >>> df = pd.DataFrame({"gender": list("MMMMMMMMFFFFFF"), "age": age_list})


was by allowed before?

we do allow by, but don't do anything on that. i will also change to versionchanged

jreback · 2021-07-12T01:36:48Z

pandas/plotting/_matplotlib/core.py

-        self.by = by
+
+        # if users assign an empty list or tuple, treat them as None
+        # then no group-by will be conducted.


why is by allowed to be an empty list/tuple?

you are right. I initially thought that is the behaviour of current df.hist and df.box, but I am wrong. I change to raise an error exactly the same as current df.hist and df.box.

I also changed the inline comment above this line and align with the change, and also add tests to reflect the changes.

jreback · 2021-07-12T01:38:06Z

pandas/plotting/_matplotlib/groupby.py

+        level = 1
+    else:
+        raise ValueError(
+            f"create_iter_data_given_by can only be used with "


is this hit in a test? this looks internal yes? or is it user facing

yes, this is just an internal function, and only used in hist and box plot function, so literally this line should never be hit, maybe let me just remove it!

The point of the comment was if we should add a test, to make sure the exception is raised as expected. Probably not needed, but feel free to add. But better don't remove this. If this is ever called for the wrong plot, better to have this clear message, that one about level not being defined.

I agree! Will change later and maybe add a small test for this also

charlesdong1991 · 2021-07-12T06:20:39Z

@jreback thanks very much for your reviews! I made several changes to reflect all your comments! And I will open a PR to update docs/vis with examples of this, and can work on it as a follow-up!

@MarcoGorelli No rush at all, I have completed my exams last week ^^, so I will have time to work on reviews quickly. It is nice to have several pairs of eyes!

jreback · 2021-07-12T13:10:38Z

going to merge this, but @MarcoGorelli pls have a look when you get a chance.

jreback · 2021-07-12T13:10:59Z

thanks @charlesdong1991 sorry for the long delay. nice feature!

datapythonista · 2021-07-12T18:32:51Z

Great job @charlesdong1991. Would be great if you can create issues, or work on the PRs if you've got time, to delete all the duplicate code among df.hist() and df.plot.hist()... I think we're close now, and will be a really nice clean up to not have two implementation of those plots.

charlesdong1991 added 5 commits December 3, 2018 17:43

remove \n from docstring

7e461a1

fix conflicts

1314059

Merge remote-tracking branch 'upstream/master'

8bcb313

Merge remote-tracking branch 'upstream/master' into fix_by_plot

e36592c

fix by in hist

b2f45a6

make plot work

8b6e00a

charlesdong1991 added 2 commits September 10, 2019 21:05

add _group_plot function

dc0c2ec

check function

d803938

charlesdong1991 changed the title ~~Fix by in hist plot (will change once pr is ready for review)~~ ENH: Fix by in hist plot Sep 10, 2019

charlesdong1991 marked this pull request as ready for review September 10, 2019 19:16

charlesdong1991 added 9 commits September 10, 2019 21:18

reformat

33dd762

put import up

d59d642

add comments

66eb06c

Mimic group plot

ea267ad

fix import failure

8095224

reformat

31decc1

fix test

e4bdbd0

hacky fix

4033159

fix isrot

57a3bdf

charlesdong1991 changed the title ~~ENH: Fix by in hist plot~~ [WIP] ENH: Fix by in hist plot Sep 10, 2019

charlesdong1991 added 5 commits September 11, 2019 11:26

fix tests

8060223

fix import failure

d666334

fix import error

3216d59

Update imports

45f4b7f

test imports

2b0785b

charlesdong1991 changed the title ~~[WIP] ENH: Fix by in hist plot~~ ENH: Fix by in hist plot Sep 11, 2019

charlesdong1991 changed the title ~~ENH: Fix by in hist plot~~ [WIP] ENH: Fix by in hist plot Sep 12, 2019

charlesdong1991 added 2 commits June 30, 2021 22:42

better doc string

6896546

fixup doc fail

3c54302

MarcoGorelli requested changes Jul 1, 2021

View reviewed changes

charlesdong1991 added 2 commits July 1, 2021 10:26

code change on Macro reviews

2d20178

Add more tests

a169dfd

MarcoGorelli self-requested a review July 1, 2021 08:33

fixup

d0b56ff

code change on reviews

dec313c

jreback added this to the 1.4 milestone Jul 12, 2021

jreback requested changes Jul 12, 2021

View reviewed changes

charlesdong1991 added 3 commits July 12, 2021 08:13

changes based on Jeff review

143f286

doc

283286f

Merge remote-tracking branch 'upstream/master' into fix_by_plot

f2a0736

fix flake8

f1aeee0

charlesdong1991 mentioned this pull request Jul 12, 2021

DOC: Add updated docstring for examples of by argument in df.plot.hist and df.plot.box #42492

Closed

jreback approved these changes Jul 12, 2021

View reviewed changes

jreback merged commit d3a018d into pandas-dev:master Jul 12, 2021

charlesdong1991 deleted the fix_by_plot branch July 12, 2021 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Fix `by` in DataFrame.plot.hist and DataFrame.plot.box #28373

ENH: Fix `by` in DataFrame.plot.hist and DataFrame.plot.box #28373

charlesdong1991 commented Sep 10, 2019 •

edited by datapythonista

Loading

simonjayhawkins commented Sep 10, 2019

pep8speaks commented Sep 10, 2019 •

edited

Loading

charlesdong1991 commented Sep 10, 2019

MarcoGorelli left a comment

charlesdong1991 commented Jul 1, 2021

MarcoGorelli commented Jul 1, 2021

datapythonista commented Jul 1, 2021

charlesdong1991 commented Jul 1, 2021 •

edited

Loading

datapythonista commented Jul 1, 2021

charlesdong1991 commented Jul 2, 2021

charlesdong1991 commented Jul 10, 2021

MarcoGorelli commented Jul 11, 2021

jreback left a comment

jreback May 31, 2021

charlesdong1991 Jul 12, 2021 •

edited

Loading

jreback May 31, 2021

charlesdong1991 Jul 12, 2021

jreback Jul 12, 2021

charlesdong1991 Jul 12, 2021

jreback Jul 12, 2021

charlesdong1991 Jul 12, 2021

datapythonista Jul 12, 2021

charlesdong1991 Jul 12, 2021

charlesdong1991 commented Jul 12, 2021

jreback commented Jul 12, 2021

jreback commented Jul 12, 2021

datapythonista commented Jul 12, 2021

ENH: Fix by in DataFrame.plot.hist and DataFrame.plot.box #28373

ENH: Fix by in DataFrame.plot.hist and DataFrame.plot.box #28373

Conversation

charlesdong1991 commented Sep 10, 2019 • edited by datapythonista Loading

simonjayhawkins commented Sep 10, 2019

pep8speaks commented Sep 10, 2019 • edited Loading

Comment last updated at 2021-07-12 06:22:22 UTC

charlesdong1991 commented Sep 10, 2019

MarcoGorelli left a comment

Choose a reason for hiding this comment

charlesdong1991 commented Jul 1, 2021

MarcoGorelli commented Jul 1, 2021

datapythonista commented Jul 1, 2021

charlesdong1991 commented Jul 1, 2021 • edited Loading

datapythonista commented Jul 1, 2021

charlesdong1991 commented Jul 2, 2021

charlesdong1991 commented Jul 10, 2021

MarcoGorelli commented Jul 11, 2021

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charlesdong1991 Jul 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charlesdong1991 commented Jul 12, 2021

jreback commented Jul 12, 2021

jreback commented Jul 12, 2021

datapythonista commented Jul 12, 2021

ENH: Fix `by` in DataFrame.plot.hist and DataFrame.plot.box #28373

ENH: Fix `by` in DataFrame.plot.hist and DataFrame.plot.box #28373

charlesdong1991 commented Sep 10, 2019 •

edited by datapythonista

Loading

pep8speaks commented Sep 10, 2019 •

edited

Loading

charlesdong1991 commented Jul 1, 2021 •

edited

Loading

charlesdong1991 Jul 12, 2021 •

edited

Loading