-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Updating operators docstrings #20415
Conversation
pandas doc ops.py update for Comparison Methods df examples
Codecov Report
@@ Coverage Diff @@
## master #20415 +/- ##
==========================================
- Coverage 42.46% 42.46% -0.01%
==========================================
Files 161 161
Lines 51557 51556 -1
==========================================
- Hits 21892 21891 -1
Misses 29665 29665
Continue to review full report at Codecov.
|
This looks close to ready. Pls rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments inline.
pandas/core/ops.py
Outdated
@@ -582,6 +642,14 @@ def _get_op_name(op, special): | |||
DataFrame.{reverse} | |||
""" | |||
|
|||
_flex_comp_doc_FRAME = """ | |||
Wrapper for flexible comparison methods {name} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could use a description of what “flexible” means in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We maintained original text and only added examples. Please advise on text changes.
pandas/core/ops.py
Outdated
'ge': {'op': '>=', | ||
'desc': 'Greater than or equal to', | ||
'reverse': None, | ||
'df_examples': None}} | ||
'df_examples': _ge_example_FRAME}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why you chose these two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was part of the Pandas Doc Sprint some months ago. I was tasked with ge/gt functions with teammates handling others. But we can incorporate le/lt and eq/ne pairs with similar set of examples.
pandas/core/ops.py
Outdated
4 5 10 17 | ||
>>> df2 = pd.DataFrame({'num1': range(6,11), | ||
... 'num2': range(1,10,2), | ||
... 'num3': range(1,20,4)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case the doc linter cares,about it, add spaces after commas here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls add a space after each of the commas in the range
calls.
pandas/core/ops.py
Outdated
if op_desc['df_examples'] is not None: | ||
base_doc = _flex_comp_doc_FRAME | ||
doc = base_doc.format(name=op_name, | ||
df_examples=op_desc['df_examples']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls add a .strip()
to op_desc['df_examples']
Needs rebase. Aside from that and two small requested edits, this is ready pending OK from @datapythonista . |
Thanks for the contribution @ParfaitG. Your examples look good, but if you have time it'd be cool to add some extra stuff to this PR (sorry we couldn't provide more clear information during the sprint). I'd personally reuse a single docstring for Then, it'd be good to have the Also, the For the examples, what you did looks good, but it could help the users if the dataframes are a bit smaller (something like 2 cols and 3 rows), and the example is a bit more meaningful (e.g. instead of columns How does this sound? |
Hello @ParfaitG! Thanks for updating the PR.
Comment last updated on September 27, 2018 at 23:44 Hours UTC |
Understood @datapythonista ... please see updated PR with a more comprehensive revision on the comparison operators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes @ParfaitG, looking much better now. I'd still change some things, mainly to simplify the code.
pandas/core/ops.py
Outdated
_eq_example_FRAME = """ | ||
>>> df1 = pd.DataFrame({'tool': ['python', 'r', 'julia'], | ||
... 'score': [100, 200, 300]}, | ||
... columns=['tool', 'score']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to specify columns, they are already taken from the dictionary keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Columns specify order here as score will precede tool due to alphabetical order. But I will adjust with new examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, in that case I'd use a list of tuples, with the columns
argument. But just a personal opinion, whatever you find clearer.
pandas/core/ops.py
Outdated
tool score | ||
0 python 100 | ||
1 r 200 | ||
2 julia 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to create controversy with other languages, or give the idea that Python is competing with them. I know it's a bit exagerated, but let's better avoid this example.
I like how it makes easier to understand the point, than with the random data. But I find slightly misleading that the languages are not the labels, and that the dataframes are compared with the rows sorted in a different way.
I know it's very hard to find great examples (I know by experience, still couldn't find good ones for some docstrings I'm working on). But let's see if we can find something better.
It just came to my mind that we could have two dataframes, predicted_gdp
and actual_gdp
, with couple of countries as rows, and couple of years as columns. I think that could be a quite realistic example. Or something similar, but that solves the resorting of the columns trickiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. Will add a new example. Just thought in the spirit of open source!
pandas/core/ops.py
Outdated
tool score | ||
0 True False | ||
1 True True | ||
2 True True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think that more than having a simple example for each docstring, we could also reuse the examples. So, we create the dataframes once, and we show the methods (not necessarily all, I'm sure the users will be able to extrapolate).
And what it would be nice, is to show in examples what the parameters do, and how they change the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All examples uses the same df1 and df2. But I will work on showing the use of rows and columns axis args.
pandas/core/ops.py
Outdated
Parameters | ||
---------- | ||
other : DataFrame | ||
axis : {{0, 1, 'columns', 'rows'}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you can check an example on how we usually document the axis values. Also, it would be nice to have description for both other
and axis
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I pulled these two args from Series.ge
doc (which in hindsight needs work as well since its example uses .add
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, still plenty of work in many docstrings, but we're getting closer. :)
pandas/core/ops.py
Outdated
|
||
Returns | ||
------- | ||
result : DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can leave just the type, DataFrame of bool
would be perfect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
pandas/core/ops.py
Outdated
Returns | ||
------- | ||
result : DataFrame | ||
Consisting of boolean values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a bit redundant, but a bit more explanation, like "Result of the comparison.". Note the period at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even better.
pandas/core/ops.py
Outdated
DataFrame.gt : Return DataFrame of boolean values strictly greater | ||
than to the elementwise rows or columns of one DataFrame to | ||
another | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I'm missing something, but is this always the same for all methods? It could go directly to the doctring and not in a separate variable if that's the case. Besides good docstrings, it's good if we keep the code as simple as possible. So we usually leave the same See Also
for all methods in this case, even if the method being documented is self referencing.
Also, do you think we can find something less verbose for the descriptions. Something like Compare DataFrames for equality elementwise
may be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though much of the words are the same, they differ slightly in the middle. Are you advising removing descriptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, what I meant is that if it's possible, it'd be good that each of these descriptions is shorter. Just making them more concise if you find the way.
pandas/core/ops.py
Outdated
|
||
See also | ||
-------- | ||
{reverse} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See Also
should go before examples. Also, note the capital A
.
Can you run ./scripts/validate_docstring.py pandas.DataFrame.eq
after the next set of changes to validate these details?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. And strangely errors did not mention the order of See Also (possibly capitalization of A was the issue?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to add several validations to that script. The order of the sections is one of them.
Thanks for the update @ParfaitG. It looks better, but as we discussed before, I don't think it is as useful to have the example of each comparison in each page, than to show well why the methods exists, how they are different from the operators, and how you can use them on different axis. Yesterday I was working on a tutorial that among other things covers this. You can take a look at the What I think we should do: Get rid of all the Then, the example should show:
As there are many cases to show, you can keep using the different methods in each of them, so more or less all them are covered. Meaning that when you compare with a scalar you can use Let me know how does this sound, and if there is anything I can help with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much better now. Added some comments, but I think this new version adds much more value for users.
pandas/core/ops.py
Outdated
@@ -436,27 +436,33 @@ def _get_op_name(op, special): | |||
'eq': {'op': '==', | |||
'desc': 'Equal to', | |||
'reverse': None, | |||
'df_examples': None}, | |||
'df_examples': None, | |||
'others': None}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't use reverse
, df_examples
and others
, can we get rid of them instead of having all them to None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df_examples
was kept from original docs and not removed for code consistency if used elsewhere. Can remove and test docstrings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverse
is needed for compatibility with arithmetic operators as they are included in same dictionary, _op_descriptions
pandas/core/ops.py
Outdated
_flex_comp_doc_FRAME = """ | ||
Flexible wrappers to comparison operators (`eq`, `ne`, `le`, `lt`, `ge`, `gt`). | ||
|
||
Equivalent to `==`, `=!`, `<=`, `<`, `>=`, `>` with support to choose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused... In the previous dictionaries, you define op
, desc
... and you also have name
which is the key. Shouldn't we use them here? So, we explain in each page the documented method and not all them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with examples that run through several of these sibling methods, the intro lists all of them. It seems counterintuitive to have an intro that describes one method with examples from several methods.
pandas/core/ops.py
Outdated
|
||
Parameters | ||
---------- | ||
other : constant, list, tuple, ndarray, Series, or DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think scalar
is preferred over constant
. For list
, tuple
and ndarray
I think you can abbreviate them as sequence
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
pandas/core/ops.py
Outdated
axis : int or str, optional | ||
Axis to target. Can be either the axis name ('index', 'rows', | ||
'columns') or number (0, 1). | ||
level : int or name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use object
than name
, as in this line we document the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
pandas/core/ops.py
Outdated
|
||
Returns | ||
------- | ||
result : DataFrame of bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need result
here, you can just leave the type DataFrame of bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
pandas/core/ops.py
Outdated
result : DataFrame of bool | ||
Result of the comparison. | ||
|
||
See also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the A
should be upper case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no current docs page that has Also
as title case.
pandas/core/ops.py
Outdated
B 300 175 | ||
C 220 225 | ||
|
||
Compare to a constant and operator version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain that both are equivalent in this case? And may be finish the sentence with a period?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
pandas/core/ops.py
Outdated
0 A 300 | ||
1 B 250 | ||
2 C 100 | ||
3 D 150 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead if defining all the DataFrames first, I'd define the first, show the examples that don't need the others, and define the others immediately before they are used.
For the naming, I'd name df
the first, other
the second used to be compared. And something like df_multiindex
the last one. I think that will make things easier for the users.
I also think that would make things easier for the user having smaller DataFrames (like 2x2). These big True
/False
DataFrame results are not immediate to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood on names. As for lengths, df1 and df2 have different number of rows and columns to show results of across different lengths. And it is just one more row and one more column. Surely not that big. I can add a note to mention results of different lengths.
pandas/core/ops.py
Outdated
0 True True False | ||
1 True True False | ||
2 True True False | ||
>>> df1.ne([100, 250, 300], axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer index
and columns
for the axis
values, it's more explicit and less confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, I think there are few things in the examples that can be more clear, but it's looking great.
pandas/core/ops.py
Outdated
Any single or multiple element data structure, or list-like object. | ||
axis : int or str, optional | ||
Axis to target. Can be either the axis name ('index', 'rows', | ||
'columns') or number (0, 1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent, I'd document the axis as we usually do. See this docstring: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html
pandas/core/ops.py
Outdated
company revenue | ||
0 False False | ||
1 False False | ||
2 False True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious to me what we're showing here.
pandas/core/ops.py
Outdated
0 False False False | ||
1 False False False | ||
2 False False True | ||
3 False False False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use axis='index'
and columns
instead of 0 and 1. Or may be even better add in the text explaining the example that the parameter axis
is not used when comparing two DataFrame
objects.
pandas/core/ops.py
Outdated
company | ||
A True True | ||
B True True | ||
C True True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we're comparing two DataFrame
objects. This is more useful to show how .loc
works with a multiindex than how operators behave. I'd get rid of it.
B True False | ||
C True False | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to explain what's going on here. And I'd probably set company
as the index when creating the DataFrame
, to avoid adding extra complexity here. I don't think it's a problem for other examples.
The blank line after the docstring is not required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add this explanation? So users can understand in an easier way what we are showing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Just few last comments, mainly of failing examples. Please run the validation script to make sure the examples work, after you make these last changes. Cheers!
pandas/core/ops.py
Outdated
'columns' return same results. | ||
|
||
>>> other = pd.DataFrame({{'revenue': [300, 250, 100, 150]}}, | ||
... index = ['A', 'B', 'C', 'D']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we would use index=
instead of index =
. Is that different here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ParfaitG can you fix the index = [...]
spaces (in all the cases). I think it's the only missing thing to merge this.
@datapythonista ... Did you get my note above regarding the need for double curly braces? The entire docstring is being run with |
@ParfaitG I see that the doctest that fails in
Being such a small change, if you want to change it in this PR, instead of skipping the |
@ParfaitG can you please take care of the conflicts. Merging after that, and when it's all green. Sorry about the delay. |
thanks for the update @ParfaitG @jbrockmendel I think you reviewed this before. Do you mind taking another look and merge it if you're happy? |
pandas/core/ops.py
Outdated
|
||
>>> df_multindex = pd.DataFrame({{'cost': [250, 150, 100, 150, 300, 220], | ||
... 'revenue': [100, 250, 300, 200, 175, 225]}}, | ||
... index = [['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index=
?
@datapythonista a handful of questions checking that your comments have been resolved. If those are all OK, then LGTM. |
You're right, didn't see those @ParfaitG I think you should be able to catch those with |
@datapythonista Your suggested |
In PEP-8, this is the valid syntax:
So yes, the spaces around the The call to |
Hmmm...the call Aside, with every docstirng change I run Please advise how to find these PEP-8 syntax issues. Otherwise, I can manually check and hope they pass on your end. |
Sorry, you're right, seems like I created #23154, that is anyway the way I think we should validate it. For now, if you can fix the reported errors, and check manually that nothing else is pep-8 invalid, that would be great. |
Understood. I am noticing some items outside of comparison and arithmetic docstrings that do not adhere to PEP 8 such as the recurring assignment to lambda expressions. According to PEP 8 - Style Guide:
Unfortunately, it is not easy to switch as some if special:
dunderize = lambda x: '__{name}__'.format(name=x.strip('_'))
else:
dunderize = lambda x: x
new_methods = {dunderize(k): v for k, v in new_methods.items()} To be replaced as: def dunderize(x):
if special:
return '__{name}__'.format(name=x.strip('_'))
else:
return x
new_methods = {dunderize(k): v for k, v in new_methods.items()} |
if you check |
@ParfaitG can you merge master? |
@ParfaitG can you merge master and make sure the CI is green, so we can merge. @jbrockmendel can you take another look and see if you're happy to merge this? |
|
||
Examples | ||
-------- | ||
{df_examples} | ||
>>> df = pd.DataFrame({{'angles': [0, 3, 4], | ||
... 'degrees': [360, 180, 360]}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more space here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, my mistake
LGTM |
Thanks @ParfaitG, and sorry it took so long to get this merged. And thanks @jbrockmendel for the review. |
Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):
scripts/validate_docstrings.py <your-function-or-method>
git diff upstream/master -u -- "*.py" | flake8 --diff
python doc/make.py --single <your-function-or-method>
Please include the output of the validation script below between the "```" ticks:
If the validation script still gives errors, but you think there is a good reason to deviate in this case (and there are certainly such cases), please state this explicitly.
This PR follows up on previous attempt: #20291. Please see discussion.
Specifically, comparison methods in ops.py including
.gt()
and.ge()
do not have standard docstring like other methods. They use a generalized wrapper summary line. This pull request simply adds examples by conditionally building docstring with examples. General changes include: adding global string variable, _flex_comp_doc_FRAME andif
logic to the method, _flex_comp_method_FRAME.