Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: update the pandas.DataFrame.plot.hist docstring #20155

Merged
merged 12 commits into from
Mar 19, 2018

Conversation

liopic
Copy link
Contributor

@liopic liopic commented Mar 10, 2018

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

  • PR title is "DOC: update the docstring"
  • The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
  • The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
  • The html version looks good: python doc/make.py --single <your-function-or-method>
  • It has been proofread on language by another sprint participant

Please include the output of the validation script below between the "```" ticks:

################################################################################
#################### Docstring (pandas.DataFrame.plot.hist) ####################
################################################################################

Draw one histogram of the DataFrame's Series using matplotlib.

A histogram is a representation of the distribution of data.
This function groups the values of all given Series in the DataFrame
into bins, and draws all bins in only one matplotlib.AxesSubplot. This
is useful when the DataFrame's Series are in a similar scale.

Parameters
----------
by : str
    Column in the DataFrame to group by.
bins : int, default 10
    Number of histogram bins to be used.
**kwds : optional
    Parameters to pass on to :meth:`pandas.DataFrame.plot`.

Returns
-------
axes : matplotlib.AxesSubplot or np.array of them

See Also
--------
:meth:`pandas.DataFrame.hist` : Draw histograms per DataFrame's Series.
:meth:`pandas.Series.hist` : Draw a histogram with Series' data.

Examples
--------

When using values between 0 and 3, calling hist() with bins = 3 will
create three bins: one that groups values between 0 and 1, another
for values between 1 and 2, and another for values between 2 and 3.
We use alpha parameter to be able to see overlapping columns.

.. plot::
    :context: close-figs

    >>> df = pd.DataFrame({
    ...     'length': [ 1.5, 0.5, 1.2, 0.9, 3],
    ...     'width': [ 0.7, 0.2, 0.15, 0.2,  1.1]
    ...     }, index= ['pig', 'rabbit', 'duck', 'chicken', 'horse'])
    >>> hist = df.plot.hist(bins = 3, xticks = range(4), alpha = 0.5)

################################################################################
################################## Validation ##################################
################################################################################

Docstring for "pandas.DataFrame.plot.hist" correct. :)


Copy link
Contributor

@igorcadelima igorcadelima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, remove the blank line after

Examples
--------

For the comment on **kwds, remove the backticks around **kwds and replace the description with "Keyword arguments to pass on to :meth:pandas.DataFrame.plot.".

For the "Returns" section, indicate the type the method returns and provide a clear and brief explanation. Avoid the use of "them" if not referring to the subject of the sentence.


Parameters
----------
by : string or sequence
by : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing sequence? A sequence of string can be passed to select the columns to be plotted.

@codecov
Copy link

codecov bot commented Mar 12, 2018

Codecov Report

Merging #20155 into master will increase coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20155      +/-   ##
==========================================
+ Coverage   91.77%   91.79%   +0.02%     
==========================================
  Files         152      152              
  Lines       49205    49205              
==========================================
+ Hits        45159    45169      +10     
+ Misses       4046     4036      -10
Flag Coverage Δ
#multiple 90.18% <ø> (+0.02%) ⬆️
#single 41.85% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/plotting/_core.py 82.27% <ø> (ø) ⬆️
pandas/util/testing.py 83.74% <0%> (-0.21%) ⬇️
pandas/core/indexes/base.py 96.68% <0%> (ø) ⬆️
pandas/core/indexes/timedeltas.py 91.22% <0%> (ø) ⬆️
pandas/plotting/_converter.py 66.81% <0%> (+1.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7273ea0...627170e. Read the comment docs.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Added some comments

@@ -2951,21 +2951,46 @@ def box(self, by=None, **kwds):

def hist(self, by=None, bins=10, **kwds):
"""
Histogram
Draw one histogram of the DataFrame's Series using matplotlib.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the DataFrame's Series" -> "the DataFrame's columns" ?

The "using matplotlib" is not needed in this summary line, as all our plotting machinery is based on matplotlb


Parameters
----------
by : string or sequence
by : str or sequence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add , optional


Parameters
----------
by : string or sequence
by : str or sequence
Column in the DataFrame to group by.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to indicate how this affects the resulting plot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this parameter is passed on to DataFrame.plot(), which is passed on to matplotlib's plot() later... but apparently it's not used in any place at all. I tried with different examples, but in all cases this parameter is ignored. So I'm keeping the old documentation just for the sake of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, you seem to be correct. Will open an issue to discuss in general.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read about a similar issue with pandas.DataFrame.plot.box: #15079

cc/ @lkxz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dukebody ah, yes, thanks for linking that


Returns
-------
axes : :class:`matplotlib.axes.Axes` or numpy.ndarray of them
axes : matplotlib.AxesSubplot histogram.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "or ndarray of them" was not correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it was not correct. This function always returns one matplotlib.AxesSubplot. It's different than DataFrame.hist(), which returns an ndarray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looked, and the key to get this is passing subplots=True, then the different columns are each plotted in a subplot in instead of in overlay, en then you get an ndarray of axes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. However this is actually a parameter that is passed on to the plot() function, but it's not part of the default plot.hist() parameters... so IMHO it's ok to say that the return will be just one. Don't you think so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should still document it here, since this is what the return value of this function can be.

(btw, you are welcome to add subplots to the list of Parameters, we have a general issue about that we should document more of those common parameters in each of the methods)

See Also
--------
:meth:`pandas.DataFrame.hist` : Draw histograms per DataFrame's Series.
:meth:`pandas.Series.hist` : Draw a histogram with Series' data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the :meth`..` part here, as sphinx will make those links automatically

... 'length': [ 1.5, 0.5, 1.2, 0.9, 3],
... 'width': [ 0.7, 0.2, 0.15, 0.2, 1.1]
... }, index= ['pig', 'rabbit', 'duck', 'chicken', 'horse'])
>>> hist = df.plot.hist(bins = 3, xticks = range(4), alpha = 0.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inside the function call, no spaces around = for PEP8

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! Added few more comments, almost there!

bins : int, default 10
Number of histogram bins to be used.
**kwds : optional
Keyword arguments to pass on to :meth:`pandas.DataFrame.plot`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep the original sentence here? (it was originally how you have it now, but we changed it in one of the previous PRs to say that ""Additional keywords are documented in DataFrame.plot"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche We will have to revise this after all the sprint PRs have been merged, since people have been changing this and it might not be consistent anymore. But it will be fast and easy.

See #20348

... 'length': [ 1.5, 0.5, 1.2, 0.9, 3],
... 'width': [ 0.7, 0.2, 0.15, 0.2, 1.1]
... }, index = ['pig', 'rabbit', 'duck', 'chicken', 'horse'])
>>> hist = df.plot.hist(bins = 3, xticks = range(4), alpha = 0.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I like the fact you used some more "real world" data, the problem is that with only 5 values, the histrogram will look rather strange. So I would maybe still generate some more random data in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the sprint some of us actually discussed about it: if you generate random data, each time the histogram will be different, and it will be kind of difficult to understand and to explain. So I tried to create some "real" data that can be small enough to generate a clear histogram that shows how it works. But if you think this data is not enough, I could try to improve it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want reproducible data, you can set a random seed. Also, if you use quite some data eg from a normal distrubution, the plot will look very similar each time.

I plotted the data above, and indeed I think it is too small, IMO it gives not a very good "sense" of what a good/useful histogram looks like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using random data, as it imparts when histograms are useful (summarizing the distribution of a larger dataset).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been thinking about it but I can't imagine an easy way to generate random data that, at the same time, it's easy to understand in a compact form (this is just an Example section) and also shows the direct relation with the resulting histogram. Could you tell me any suggestion?

On the other hand, in pandas.DataFrame.hist() docstring they used another "toy" example (which actually was my work too) and it's already merged...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [64]: df = pd.DataFrame({'a': np.random.randn(1000), 'b': np.random.randn(1000)*1.5 + 2})

In [65]: df.plot.hist(alpha=0.5)
Out[65]: <matplotlib.axes._subplots.AxesSubplot at 0x7fbaadcea6d8>

gives me

figure_1-1

which I think is a reasonable plot to show functionality of hist ?

On the other hand, in pandas.DataFrame.hist() docstring they used another "toy" example (which actually was my work too) and it's already merged...

That's true, but IMO we should change it there as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I found a random example that sounds quite natural and understandable. I hope you'll like it.

@pep8speaks
Copy link

pep8speaks commented Mar 15, 2018

Hello @liopic! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 19, 2018 at 20:56 Hours UTC


A histogram is a representation of the distribution of data.
This function groups the values of all given Series in the DataFrame
into bins, and draws all bins in only one matplotlib.AxesSubplot. This
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matplotlib.AxesSubplot isn't a think. I'd say either "matplotlib axes" or :ref:`matplotlib.axes.Axes`.

`**kwds` : optional
bins : int, default 10
Number of histogram bins to be used.
**kwds : optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove : optional.


See Also
--------
pandas.DataFrame.hist : Draw histograms per DataFrame's Series.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove pandas.

See Also
--------
pandas.DataFrame.hist : Draw histograms per DataFrame's Series.
pandas.Series.hist : Draw a histogram with Series' data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove pandas.

--------
pandas.DataFrame.hist : Draw histograms per DataFrame's Series.
pandas.Series.hist : Draw a histogram with Series' data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add DataFrame.plot

... 'length': [ 1.5, 0.5, 1.2, 0.9, 3],
... 'width': [ 0.7, 0.2, 0.15, 0.2, 1.1]
... }, index = ['pig', 'rabbit', 'duck', 'chicken', 'horse'])
>>> hist = df.plot.hist(bins = 3, xticks = range(4), alpha = 0.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using random data, as it imparts when histograms are useful (summarizing the distribution of a larger dataset).

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, nice example!

(made a small amendment mainly for PEP8 fixup)

@jorisvandenbossche jorisvandenbossche merged commit c726c12 into pandas-dev:master Mar 19, 2018
@jorisvandenbossche
Copy link
Member

@liopic Thanks for the PR 👍

nehiljain added a commit to nehiljain/pandas that referenced this pull request Mar 21, 2018
…ame_describe

* upstream/master: (158 commits)
  Add link to "Craft Minimal Bug Report" blogpost (pandas-dev#20431)
  BUG: fixed json_normalize for subrecords with NoneTypes (pandas-dev#20030) (pandas-dev#20399)
  BUG: ExtensionArray.fillna for scalar values (pandas-dev#20412)
  DOC" update the Pandas core window rolling count docstring" (pandas-dev#20264)
  DOC: update the pandas.DataFrame.plot.hist docstring (pandas-dev#20155)
  DOC: Only use ~ in class links to hide prefixes. (pandas-dev#20402)
  Bug: Allow np.timedelta64 objects to index TimedeltaIndex (pandas-dev#20408)
  DOC: add disallowing of Series construction of len-1 list with index to whatsnew (pandas-dev#20392)
  MAINT: Remove weird pd file
  DOC: update the Index.isin docstring (pandas-dev#20249)
  BUG: Handle all-NA blocks in concat (pandas-dev#20382)
  DOC: update the pandas.core.resample.Resampler.fillna docstring (pandas-dev#20379)
  BUG: Don't raise exceptions splitting a blank string (pandas-dev#20067)
  DOC: update the pandas.DataFrame.cummax docstring (pandas-dev#20336)
  DOC: update the pandas.core.window.x.mean docstring (pandas-dev#20265)
  DOC: update the api.types.is_number docstring (pandas-dev#20196)
  Fix linter (pandas-dev#20389)
  DOC: Improved the docstring of pandas.Series.dt.to_pytimedelta (pandas-dev#20142)
  DOC: update the pandas.Series.dt.is_month_end docstring (pandas-dev#20181)
  DOC: update the window.Rolling.min docstring (pandas-dev#20263)
  ...
nikoskaragiannakis pushed a commit to nikoskaragiannakis/pandas that referenced this pull request Mar 25, 2018
dworvos pushed a commit to dworvos/pandas that referenced this pull request Apr 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants