Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: update the pandas.DataFrame.plot.hist docstring #20155

Merged
merged 12 commits into from
Mar 19, 2018
41 changes: 33 additions & 8 deletions pandas/plotting/_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -2951,21 +2951,46 @@ def box(self, by=None, **kwds):

def hist(self, by=None, bins=10, **kwds):
"""
Histogram
Draw one histogram of the DataFrame's columns.

A histogram is a representation of the distribution of data.
This function groups the values of all given Series in the DataFrame
into bins, and draws all bins in only one matplotlib.AxesSubplot. This
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matplotlib.AxesSubplot isn't a think. I'd say either "matplotlib axes" or :ref:`matplotlib.axes.Axes`.

is useful when the DataFrame's Series are in a similar scale.

Parameters
----------
by : string or sequence
by : str or sequence, optional
Column in the DataFrame to group by.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to indicate how this affects the resulting plot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this parameter is passed on to DataFrame.plot(), which is passed on to matplotlib's plot() later... but apparently it's not used in any place at all. I tried with different examples, but in all cases this parameter is ignored. So I'm keeping the old documentation just for the sake of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, you seem to be correct. Will open an issue to discuss in general.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read about a similar issue with pandas.DataFrame.plot.box: #15079

cc/ @lkxz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dukebody ah, yes, thanks for linking that

bins: integer, default 10
Number of histogram bins to be used
`**kwds` : optional
Additional keyword arguments are documented in
:meth:`pandas.DataFrame.plot`.
bins : int, default 10
Number of histogram bins to be used.
**kwds : optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove : optional.

Keyword arguments to pass on to :meth:`pandas.DataFrame.plot`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep the original sentence here? (it was originally how you have it now, but we changed it in one of the previous PRs to say that ""Additional keywords are documented in DataFrame.plot"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche We will have to revise this after all the sprint PRs have been merged, since people have been changing this and it might not be consistent anymore. But it will be fast and easy.

See #20348


Returns
-------
axes : :class:`matplotlib.axes.Axes` or numpy.ndarray of them
axes : matplotlib.AxesSubplot histogram.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "or ndarray of them" was not correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it was not correct. This function always returns one matplotlib.AxesSubplot. It's different than DataFrame.hist(), which returns an ndarray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looked, and the key to get this is passing subplots=True, then the different columns are each plotted in a subplot in instead of in overlay, en then you get an ndarray of axes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. However this is actually a parameter that is passed on to the plot() function, but it's not part of the default plot.hist() parameters... so IMHO it's ok to say that the return will be just one. Don't you think so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should still document it here, since this is what the return value of this function can be.

(btw, you are welcome to add subplots to the list of Parameters, we have a general issue about that we should document more of those common parameters in each of the methods)


See Also
--------
pandas.DataFrame.hist : Draw histograms per DataFrame's Series.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove pandas.

pandas.Series.hist : Draw a histogram with Series' data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove pandas.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add DataFrame.plot

Examples
--------
When using values between 0 and 3, calling hist() with bins = 3 will
create three bins: one that groups values between 0 and 1, another
for values between 1 and 2, and another for values between 2 and 3.
We use alpha parameter to be able to see overlapping columns.

.. plot::
:context: close-figs

>>> df = pd.DataFrame({
... 'length': [ 1.5, 0.5, 1.2, 0.9, 3],
... 'width': [ 0.7, 0.2, 0.15, 0.2, 1.1]
... }, index = ['pig', 'rabbit', 'duck', 'chicken', 'horse'])
>>> hist = df.plot.hist(bins = 3, xticks = range(4), alpha = 0.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inside the function call, no spaces around = for PEP8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I like the fact you used some more "real world" data, the problem is that with only 5 values, the histrogram will look rather strange. So I would maybe still generate some more random data in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the sprint some of us actually discussed about it: if you generate random data, each time the histogram will be different, and it will be kind of difficult to understand and to explain. So I tried to create some "real" data that can be small enough to generate a clear histogram that shows how it works. But if you think this data is not enough, I could try to improve it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want reproducible data, you can set a random seed. Also, if you use quite some data eg from a normal distrubution, the plot will look very similar each time.

I plotted the data above, and indeed I think it is too small, IMO it gives not a very good "sense" of what a good/useful histogram looks like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using random data, as it imparts when histograms are useful (summarizing the distribution of a larger dataset).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been thinking about it but I can't imagine an easy way to generate random data that, at the same time, it's easy to understand in a compact form (this is just an Example section) and also shows the direct relation with the resulting histogram. Could you tell me any suggestion?

On the other hand, in pandas.DataFrame.hist() docstring they used another "toy" example (which actually was my work too) and it's already merged...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [64]: df = pd.DataFrame({'a': np.random.randn(1000), 'b': np.random.randn(1000)*1.5 + 2})

In [65]: df.plot.hist(alpha=0.5)
Out[65]: <matplotlib.axes._subplots.AxesSubplot at 0x7fbaadcea6d8>

gives me

figure_1-1

which I think is a reasonable plot to show functionality of hist ?

On the other hand, in pandas.DataFrame.hist() docstring they used another "toy" example (which actually was my work too) and it's already merged...

That's true, but IMO we should change it there as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I found a random example that sounds quite natural and understandable. I hope you'll like it.

"""
return self(kind='hist', by=by, bins=bins, **kwds)

Expand Down