Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Fix by in DataFrame.plot.hist and DataFrame.plot.box #28373

Merged
merged 178 commits into from
Jul 12, 2021

Conversation

charlesdong1991
Copy link
Member

@charlesdong1991 charlesdong1991 commented Sep 10, 2019

@simonjayhawkins
Copy link
Member

@charlesdong1991 Thanks for the PR! What issue does this relate to/close?

@pep8speaks
Copy link

pep8speaks commented Sep 10, 2019

Hello @charlesdong1991! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-07-12 06:22:22 UTC

@charlesdong1991
Copy link
Member Author

ahh, sorry, it's still in progress, so i didn't add that

Added! @simonjayhawkins

@charlesdong1991 charlesdong1991 changed the title Fix by in hist plot (will change once pr is ready for review) ENH: Fix by in hist plot Sep 10, 2019
@charlesdong1991 charlesdong1991 marked this pull request as ready for review September 10, 2019 19:16
@charlesdong1991 charlesdong1991 changed the title ENH: Fix by in hist plot [WIP] ENH: Fix by in hist plot Sep 10, 2019
@charlesdong1991 charlesdong1991 changed the title [WIP] ENH: Fix by in hist plot ENH: Fix by in hist plot Sep 11, 2019
@charlesdong1991 charlesdong1991 changed the title ENH: Fix by in hist plot [WIP] ENH: Fix by in hist plot Sep 12, 2019
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through reviews here, I noticed some in which

if self.by is None

was asked to be changed to

if self.by

for the sake of readability

Just for reference, here's an example of bugs in pandas visualisation which were going unnoticed because if labelsize was being used instead of if labelsize is None: #34768 - so, strong preference for checking is None from me :)

Anyway, I didn't notice tests here in which by isn't None but would pass if self.by (e.g. an empty list), could one be added?

@charlesdong1991
Copy link
Member Author

Hi, @MarcoGorelli Thanks very much for your reviews!

I agree we should be cautious here, I didn't notice that kind of bugs. However, in this case, I think it is fine to use if self.by for instance, because both None and [] (an empty list) should just return the same result which is there is no grouped data needed for plotting.

As you mentioned, indeed there is no tests for this, and I have added two tests: test_hist_plot_with_none_empty_list_by and test_box_plot_with_none_empty_list_by in the test file, and have 5 parametrization to test different scenarios. Please let me know your thoughts

@MarcoGorelli
Copy link
Member

Awesome, thanks!

@MarcoGorelli MarcoGorelli self-requested a review July 1, 2021 08:33
@datapythonista
Copy link
Member

Agree with @MarcoGorelli about if self.by is None:, thanks for the comment. Not sure why I was thinking in if not self.by:, where it'd make sense what I made. Sorry for making you change those twice @charlesdong1991

@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Jul 1, 2021

thank both of your comment @datapythonista @MarcoGorelli Just double check a bit before making the change:

Let me just summarize a bit: so in this case, no matter if it is None or an empty list assigned to by, we should both return un-grouped data and thus the plots. And I guess that might be your reason to suggest if (not) self.by instead of if self.by is (not) None? @datapythonista

Or do we expect to return an error if an empty list is assigned to by?

Hope we have consensus on the expected behaviour ^^

@datapythonista
Copy link
Member

The empty list could be good if it's the same as None, but 0 should be considered as a column name, and without the is None won't.

But I'd check what's the current behavior in df.hist() and try to do the same, if it's reasonable. So, when we make df.hist and alias of df.plot.hist and start its deprecation, we don't change much our users code behavior.

@charlesdong1991
Copy link
Member Author

ahh, okay, yeah 0 can be a column name, didn't think of this scenario. I will adjust the implementation a bit then.

@charlesdong1991
Copy link
Member Author

Hi, thanks for your comments @MarcoGorelli @datapythonista Sorry for late updates again, I was still busy with some exams in the past week.

As discussed, I have changed it to is not None to reflect your comment, and also add additional tests test_box_plot_by_0 and test_hist_plot_by_0 to test some cases where column name is 0.

The behavior now is when users put not None input for by, if input is an empty list or tuple, then we treat the same as None, since it means there is no group-by needed; elsewhere, we will use the input (no matter it is 0 or string or list/tuple of col names) to group and make corresponding plots.

Please let me know if you have further comments.

@MarcoGorelli
Copy link
Member

Please let me know if you have further comments

From a quick look, no obvious objections from me - I won't have a chance to do a full review til next week though

@jreback jreback added this to the 1.4 milestone Jul 12, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty reasonable, a couple of questions.

do the viz docs / doc-strings need updating with examples of this? (could be a followon, if so, can you open an issue)

@@ -1277,6 +1277,9 @@ def hist(self, by=None, bins=10, **kwargs):
----------
by : str or sequence, optional
Column in the DataFrame to group by.

.. versionadded:: 1.3.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionchanged right? can you add a line on what is changing here

Copy link
Member Author

@charlesdong1991 charlesdong1991 Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I was doubting about this. We do accept by right now but don't do anything about that. Now we do start supporting by to make plots for groups, so i put added here instead of changed.

I think then i will change to versionchanged.

:context: close-figs

>>> age_list = [8, 10, 12, 14, 72, 74, 76, 78, 20, 25, 30, 35, 60, 85]
>>> df = pd.DataFrame({"gender": list("MMMMMMMMFFFFFF"), "age": age_list})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was by allowed before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do allow by, but don't do anything on that. i will also change to versionchanged

self.by = by

# if users assign an empty list or tuple, treat them as None
# then no group-by will be conducted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is by allowed to be an empty list/tuple?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. I initially thought that is the behaviour of current df.hist and df.box, but I am wrong. I change to raise an error exactly the same as current df.hist and df.box.

I also changed the inline comment above this line and align with the change, and also add tests to reflect the changes.

level = 1
else:
raise ValueError(
f"create_iter_data_given_by can only be used with "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this hit in a test? this looks internal yes? or is it user facing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is just an internal function, and only used in hist and box plot function, so literally this line should never be hit, maybe let me just remove it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of the comment was if we should add a test, to make sure the exception is raised as expected. Probably not needed, but feel free to add. But better don't remove this. If this is ever called for the wrong plot, better to have this clear message, that one about level not being defined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! Will change later and maybe add a small test for this also

@charlesdong1991
Copy link
Member Author

@jreback thanks very much for your reviews! I made several changes to reflect all your comments! And I will open a PR to update docs/vis with examples of this, and can work on it as a follow-up!

@MarcoGorelli No rush at all, I have completed my exams last week ^^, so I will have time to work on reviews quickly. It is nice to have several pairs of eyes!

@jreback
Copy link
Contributor

jreback commented Jul 12, 2021

going to merge this, but @MarcoGorelli pls have a look when you get a chance.

@jreback jreback merged commit d3a018d into pandas-dev:master Jul 12, 2021
@jreback
Copy link
Contributor

jreback commented Jul 12, 2021

thanks @charlesdong1991 sorry for the long delay. nice feature!

@charlesdong1991 charlesdong1991 deleted the fix_by_plot branch July 12, 2021 13:47
@datapythonista
Copy link
Member

Great job @charlesdong1991. Would be great if you can create issues, or work on the PRs if you've got time, to delete all the duplicate code among df.hist() and df.plot.hist()... I think we're close now, and will be a really nice clean up to not have two implementation of those plots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.plot.box ignores by argument
9 participants