-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: update the pandas.Series.str.split docstring #20282
DOC: update the pandas.Series.str.split docstring #20282
Conversation
2f3094e
to
f97aea1
Compare
pandas/core/strings.py
Outdated
@@ -1095,24 +1095,48 @@ def str_pad(arr, width, side='left', fillchar=' '): | |||
|
|||
def str_split(arr, pat=None, n=None): | |||
""" | |||
Split strings around given separator/delimiter. | |||
|
|||
Split each string (a la re.split) in the Series/Index by given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can put reference to re
module in backticks, so `re.split`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already in the extended summary.
Split each string (a la re.split) in the Series/Index by given pattern, propagating NA values. Equivalent to :meth:`str.split`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right I just mean actually surrounding the reference with backticks so it renders as inline code. So simply change any occurrence of re.split to `re.split`
pandas/core/strings.py
Outdated
Split each string (a la re.split) in the Series/Index by given | ||
pattern, propagating NA values. Equivalent to :meth:`str.split`. | ||
|
||
Parameters | ||
---------- | ||
pat : string, default None | ||
String or regular expression to split on. If None, splits on whitespace | ||
String or regular expression to split on. If None, split on whitespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`None`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd small feedback: it's either no backticks or double backticks for small code snippets, single backticks are for parameter names.
The question for such short things like None if we see them as 'code', but I think typically we do if it references a value that is passed.
(we should probably make a better overview of this in the docstring guidelines)
pandas/core/strings.py
Outdated
n : int, default -1 (all) | ||
None, 0 and -1 will be interpreted as return all splits | ||
* None, 0 and -1 will be interpreted as return all splits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't need periods at the end of bullet points
pandas/core/strings.py
Outdated
n : int, default -1 (all) | ||
None, 0 and -1 will be interpreted as return all splits | ||
* None, 0 and -1 will be interpreted as return all splits. | ||
* Vary output dimensionality if `expand` is True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`True`
pandas/core/strings.py
Outdated
@@ -1095,24 +1095,48 @@ def str_pad(arr, width, side='left', fillchar=' '): | |||
|
|||
def str_split(arr, pat=None, n=None): | |||
""" | |||
Split strings around given separator/delimiter. | |||
|
|||
Split each string (a la re.split) in the Series/Index by given | |||
pattern, propagating NA values. Equivalent to :meth:`str.split`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`NaN` instead of NA would be better
pandas/core/strings.py
Outdated
Returns | ||
------- | ||
split : Series/Index or DataFrame/MultiIndex of objects | ||
|
||
Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The examples are good but I think the section can use a short sentence or two to introduce the examples and clue the reader in on what they should be looking at. Also, is it possible to show an example that deals with missing values?
pandas/core/strings.py
Outdated
@@ -1095,24 +1095,48 @@ def str_pad(arr, width, side='left', fillchar=' '): | |||
|
|||
def str_split(arr, pat=None, n=None): | |||
""" | |||
Split strings around given separator/delimiter. | |||
|
|||
Split each string (a la re.split) in the Series/Index by given | |||
pattern, propagating NA values. Equivalent to :meth:`str.split`. | |||
|
|||
Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment - rather than putting sub-bullets under the parameters would it be clearer to move those into a dedicated Notes section? They do explain some of the implementation details so wonder if they are better served there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be better. I'll update it.
pandas/core/strings.py
Outdated
* If False, return Series/Index. | ||
|
||
return_type : deprecated, use `expand` | ||
|
||
Returns | ||
------- | ||
split : Series/Index or DataFrame/MultiIndex of objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this return type rather confusing - how does the Index play a part? I think this could be clarified better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was confused about that as well. Should I keep it in the doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For returns make sure the first line is just the type (unless returning more than one, which isn't the case here). I think that would best be Series or Index
with a description that says something like "Return type matches caller unless expand=True
(always returns a DataFrame
)"
pandas/core/strings.py
Outdated
@@ -1095,24 +1095,48 @@ def str_pad(arr, width, side='left', fillchar=' '): | |||
|
|||
def str_split(arr, pat=None, n=None): | |||
""" | |||
Split strings around given separator/delimiter. | |||
|
|||
Split each string (a la re.split) in the Series/Index by given | |||
pattern, propagating NA values. Equivalent to :meth:`str.split`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a reference to this in a See Also section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reference to re.split?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry you can ignore this. I was thinking of linking to str.split
since you mention it as equivalent here, but it's the same exact docstring as noted in other comments
pandas/core/strings.py
Outdated
* None, 0 and -1 will be interpreted as return all splits. | ||
* Vary output dimensionality if `expand` is True: | ||
- If n >= default splits, return all splits. | ||
- If n < default splits, makes first n splits only. | ||
expand : bool, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this in more detail, I see why you have this documented here in spite of the fact that str_split
doesn't actually accept this parameter (it's docstring is copied to split
). Ref some of the work in #10085
@jreback is there any reason why we wouldn't want to deprecate items like str_split
and move to private internal methods like _str_split
that the front-end facing API uses instead? If so may be a logical follow up to this docstring update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd the str_split
methods are not considered public, it's just how it is implemented right now. I agree this one is a bit confusing as some keyword are used in str_split
and some only in the method, but generally in the file, the docstrings are located at the functions (so str_split
), although they are written as if they were documenting the method.
Feel free to open an issue if you have a proposal how this could be improved.
pandas/core/strings.py
Outdated
Examples | ||
-------- | ||
>>> s = pd.Series(["this is good text", "but this is even better"]) | ||
>>> s.str.split() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref my other comment - kind of strange that the method here is actually str_split
but we are showing how to use str.split
. While technically the same, from an API perspective I feel like this docstring belongs under the split
method and we should make this method private internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same is the case for str.rsplit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep thanks - might be a few other instances in the module as well. I don't want it to hold up what you've done here but could be a good follow up for you to clean up the actual functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I work on it in this same pr, or should I make a separate issue for this and work on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate issue
pandas/core/strings.py
Outdated
* If False, return Series/Index. | ||
|
||
return_type : deprecated, use `expand` | ||
|
||
Returns | ||
------- | ||
split : Series/Index or DataFrame/MultiIndex of objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For returns make sure the first line is just the type (unless returning more than one, which isn't the case here). I think that would best be Series or Index
with a description that says something like "Return type matches caller unless expand=True
(always returns a DataFrame
)"
f97aea1
to
afa6c42
Compare
1c79389
to
92f43d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does everything seem fine now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still a few minor things but it's getting there
pandas/core/strings.py
Outdated
Split strings around given separator/delimiter. | ||
|
||
Split each string in the Series/Index by given | ||
pattern, propagating NaN values. Equivalent to :meth:`str.split`. | ||
|
||
Parameters | ||
---------- | ||
pat : string, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use str
instead of string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and ", default None" -> ", optional"
pandas/core/strings.py
Outdated
|
||
Parameters | ||
---------- | ||
pat : string, default None | ||
String or regular expression to split on. If None, splits on whitespace | ||
String or regular expression to split on.\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm does this render the backslash in the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh OK good to know. That said, I don't think the backslash is required or part of the standard (see the **kwargs example in first instance of class Series
in the sprint documentation https://python-sprints.github.io/pandas/guide/pandas_docstring.html)
Should be fine just to place on separate line and ensure proper indentation
pandas/core/strings.py
Outdated
- If n < default splits, makes first n splits only | ||
- Appends `None` for padding. | ||
|
||
Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might have missed mentioning this but while I think the examples are good you should add a sentence (or a few) to call out what users should be looking at with the examples. It would also be nice to show one or two with missing data to illustrate how NaN
gets propagated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like this?
>>> s.str.split("is", n=1, expand=True)
0 1
0 th is good text
1 but th is even better
see notes about expand=True
Also, I wrote an example to show NaN propagation but None is being propagated instead of NaN.
>>> s = pd.Series(["this is good text", "but this is even better", np.nan])
>>> s.str.split(expand=True)
0 1 2 3 4
0 this is good text None
1 but this is even better
2 NaN None None None None
Shouldn't output be?
0 1 2 3 4
0 this is good text None
1 but this is even better
2 NaN NaN NaN NaN NaN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put your comments before the example and just highlight what the user should look at. So for your first example say something like "By default, split will return an object of the same size containing lists to hold the split elements" and then introduce the second with something like "By contrast, when using expand=True
the split elements will expand out into separate columns." Doesn't need to be exactly those words but something along those lines - make sense?
As far as your example is concerned, make sure you run everything on the master branch. My guess is you are using an older version of pandas as the fix to propagate NaN
was released in v0.21.1 (see #18462)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, right.
thanks!
pandas/core/strings.py
Outdated
pattern, propagating NA values. Equivalent to :meth:`str.split`. | ||
Split strings around given separator/delimiter. | ||
|
||
Split each string in the Series/Index by given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally writing Series/Index
and DataFrame/MultiIndex
is not very clear. I'd suggest saying "in the caller's values"
92f43d4
to
53713a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvements to the docstring! Added some more comments
pandas/core/strings.py
Outdated
None, 0 and -1 will be interpreted as return all splits | ||
Vary dimensionality of output. | ||
|
||
* `None`, 0 and -1 will be interpreted as return all splits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not put this in a bullet point, as this breaks the flow (i.e. need to have a blank line for the list to render..). Lists with one bullet point are a bit strange anyhow
pandas/core/strings.py
Outdated
expand : bool, default False | ||
* If True, return DataFrame/MultiIndex expanding dimensionality. | ||
* If False, return Series/Index. | ||
Expand the split strings into separate columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split -> splitted
pandas/core/strings.py
Outdated
|
||
return_type : deprecated, use `expand` | ||
* If `True`, return DataFrame/MultiIndex expanding dimensionality. | ||
* If `False`, return Series/Index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add "containing lists of strings"
pandas/core/strings.py
Outdated
|
||
Returns | ||
------- | ||
Type matches caller unless `expand=True` (return type is `DataFrame`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type for expand=True can also be MultiIndex I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the wording around Index
and MultiIndex
is confusing here - how does this return a MultiIndex
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry @WillAyd, merged maybe a bit too quick
When you use this on an index, and do expand=True
, you get a MultiIndex (similar as if Series -> DataFrame).
Maybe it would also be good to add an example of this.
@mananpal1997 Feel free to open a new PR to do this small follow-up change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK makes sense. Yes I just find the text like "Series/Index" and "DataFrame/MultiIndex" wording to be rather confusing. Part of me wants to interpret those as being bound together, so that a Series is always indexed normally and a DataFrame is indexed with a MultiIndex. Obviously that's not the case, but I think that we could more clearly delineate.
@mananpal1997 an example would certainly help for your next contribution!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd @jorisvandenbossche
Any edit suggestions to this change before I push it?
Parameter ``expand=True`` returns DataFrame and MultiIndex objects for
Series and Index objects respectively.
>>> type(s.str.split(expand=True))
<class 'pandas.core.frame.DataFrame'>
>>> i = pd.Index(["ba 100 001", "ba 101 002", "ba 102 003"])
>>> i.str.split(expand=True)
MultiIndex(levels=[['ba'], ['100', '101', '102'], ['001', '002', '003']],
labels=[[0, 0, 0], [0, 1, 2], [0, 1, 2]])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can ping us both in the new PR, so let's further discss there.
But I think I would just show the actual result, instead of the types, it will be clear from the result what the different types are
pandas/core/strings.py
Outdated
Split strings around given separator/delimiter. | ||
|
||
Split each string in the Series/Index by given | ||
pattern, propagating NaN values. Equivalent to :meth:`str.split`. | ||
|
||
Parameters | ||
---------- | ||
pat : string, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and ", default None" -> ", optional"
pandas/core/strings.py
Outdated
1 but this even better | ||
|
||
Parameter `n` can be used to limit the number of columns in | ||
expansion of output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say here to "limit the number of splits". Of course that directly maps to the number of columns, but for n=1
you actually have two columns, which might be confusing
pandas/core/strings.py
Outdated
Notes | ||
----- | ||
If `expand` parameter is `True` and: | ||
- If n >= default splits, makes all splits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is also true when expand=False ? (you just get less elements in the list)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad. I'll correct that
53713a6
to
3ab9f24
Compare
Codecov Report
@@ Coverage Diff @@
## master #20282 +/- ##
=========================================
Coverage ? 91.73%
=========================================
Files ? 150
Lines ? 49168
Branches ? 0
=========================================
Hits ? 45102
Misses ? 4066
Partials ? 0
Continue to review full report at Codecov.
|
3ab9f24
to
0815c43
Compare
@mananpal1997 general comment for future reference: can you add commits when updating instead of squashing it into one commit? That makes it a bit easier to see what has changed after a review |
pandas/core/strings.py
Outdated
|
||
Returns | ||
------- | ||
Type matches caller unless `expand=True` (return type is `DataFrame` or | ||
`MultiIndex`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this one below the line below?
Like
split : Series/Index or DataFrame/MultiIndex
Type matches caller unless `expand=True` (return type is `DataFrame` or
`MultiIndex`)
@mananpal1997 I added a small commit with some edits regarding the usage of quotes (due to some in-clarities in the guidelines) |
pandas/core/strings.py
Outdated
----- | ||
- If n >= default splits, makes all splits | ||
- If n < default splits, makes first n splits only | ||
- Appends `None` for padding if ``expand=True`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One final comment: I find this list not fully clear. What is 'n' and what is 'default splits' ?
I suppose it details how n
is handles depending on whether the number of found splits is bigger/smaller than the specified value for n
?
Proposal (but not fully sure this is what you meant):
The handling of the `n` keyword depends on the number of found splits:
- If found splits > `n`, make first `n` splits only
- If found splits <= `n`, make all splits
- If for a certain row the number of found splits < `n`, append `None` for padding up to `n` if ``expand=True``
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right!
@jorisvandenbossche wanted to ask about an issue I faced while setting up pandas.
I followed the same steps as mentioned in the environment setup guide. I tried it fresh 2 times and both times, my html doc won't generate throwing error nbsphinx couldn't be loaded
and then I would manually install it with pip install nbsphinx
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, nbsphinx is missing from the dev requirements, they are only included in the optional ones. Will open an issues about that.
Hello @mananpal1997! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on March 12, 2018 at 15:14 Hours UTC |
b662388
to
da27e5f
Compare
@mananpal1997 Thanks a lot! |
Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):
scripts/validate_docstrings.py <your-function-or-method>
git diff upstream/master -u -- "*.py" | flake8 --diff
python doc/make.py --single <your-function-or-method>