Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: update the pandas.Series.str.split docstring #20282

Conversation

mananpal1997
Copy link
Contributor

@mananpal1997 mananpal1997 commented Mar 11, 2018

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

  • PR title is "DOC: update the docstring"
  • The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
  • The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
  • The html version looks good: python doc/make.py --single <your-function-or-method>
  • It has been proofread on language by another sprint participant
################################################################################
##################### Docstring (pandas.Series.str.split)  #####################
################################################################################

Split strings around given separator/delimiter.

Split each str in the caller's values by given
pattern, propagating NaN values. Equivalent to :meth:`str.split`.

Parameters
----------
pat : string, default None
    String or regular expression to split on.
    If `None`, split on whitespace.
n : int, default -1 (all)
    Vary dimensionality of output.

    * `None`, 0 and -1 will be interpreted as return all splits
expand : bool, default False
    Expand the split strings into separate columns.

    * If `True`, return DataFrame/MultiIndex expanding dimensionality.
    * If `False`, return Series/Index.

Returns
-------
Type matches caller unless `expand=True` (return type is `DataFrame`)
split : Series/Index or DataFrame/MultiIndex of objects

Notes
-----
If `expand` parameter is `True` and:
  - If n >= default splits, makes all splits
  - If n < default splits, makes first n splits only
  - Appends `None` for padding.

Examples
--------
>>> s = pd.Series(["this is good text", "but this is even better"])

By default, split will return an object of the same size
having lists containing the split elements

>>> s.str.split()
0           [this, is, good, text]
1    [but, this, is, even, better]
dtype: object
>>> s.str.split("random")
0          [this is good text]
1    [but this is even better]
dtype: object

When using `expand=True`, the split elements will
expand out into separate columns.

>>> s.str.split(expand=True)
      0     1     2     3       4
0  this    is  good  text    None
1   but  this    is  even  better
>>> s.str.split(" is ", expand=True)
          0            1
0      this    good text
1  but this  even better

Parameter `n` can be used to limit the number of columns in
expansion of output.

>>> s.str.split("is", n=1, expand=True)
        0                1
0      th     is good text
1  but th   is even better

If NaN is present, it is propagated throughout the columns
during the split.

>>> s = pd.Series(["this is good text", "but this is even better", np.nan])
>>> s.str.split(n=3, expand=True)
      0     1     2            3
0  this    is  good         text
1   but  this    is  even better
2   NaN   NaN   NaN          NaN

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
	Errors in parameters section
		Parameter "n" description should finish with "."
	See Also section not found

@mananpal1997 mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from 2f3094e to f97aea1 Compare March 11, 2018 15:34
@@ -1095,24 +1095,48 @@ def str_pad(arr, width, side='left', fillchar=' '):

def str_split(arr, pat=None, n=None):
"""
Split strings around given separator/delimiter.

Split each string (a la re.split) in the Series/Index by given
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can put reference to re module in backticks, so `re.split`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already in the extended summary.

Split each string (a la re.split) in the Series/Index by given pattern, propagating NA values. Equivalent to :meth:`str.split`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right I just mean actually surrounding the reference with backticks so it renders as inline code. So simply change any occurrence of re.split to `re.split`

Split each string (a la re.split) in the Series/Index by given
pattern, propagating NA values. Equivalent to :meth:`str.split`.

Parameters
----------
pat : string, default None
String or regular expression to split on. If None, splits on whitespace
String or regular expression to split on. If None, split on whitespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`None`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd small feedback: it's either no backticks or double backticks for small code snippets, single backticks are for parameter names.
The question for such short things like None if we see them as 'code', but I think typically we do if it references a value that is passed.

(we should probably make a better overview of this in the docstring guidelines)

n : int, default -1 (all)
None, 0 and -1 will be interpreted as return all splits
* None, 0 and -1 will be interpreted as return all splits.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't need periods at the end of bullet points

n : int, default -1 (all)
None, 0 and -1 will be interpreted as return all splits
* None, 0 and -1 will be interpreted as return all splits.
* Vary output dimensionality if `expand` is True:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`True`

@@ -1095,24 +1095,48 @@ def str_pad(arr, width, side='left', fillchar=' '):

def str_split(arr, pat=None, n=None):
"""
Split strings around given separator/delimiter.

Split each string (a la re.split) in the Series/Index by given
pattern, propagating NA values. Equivalent to :meth:`str.split`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`NaN` instead of NA would be better

Returns
-------
split : Series/Index or DataFrame/MultiIndex of objects

Examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples are good but I think the section can use a short sentence or two to introduce the examples and clue the reader in on what they should be looking at. Also, is it possible to show an example that deals with missing values?

@@ -1095,24 +1095,48 @@ def str_pad(arr, width, side='left', fillchar=' '):

def str_split(arr, pat=None, n=None):
"""
Split strings around given separator/delimiter.

Split each string (a la re.split) in the Series/Index by given
pattern, propagating NA values. Equivalent to :meth:`str.split`.

Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment - rather than putting sub-bullets under the parameters would it be clearer to move those into a dedicated Notes section? They do explain some of the implementation details so wonder if they are better served there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be better. I'll update it.

* If False, return Series/Index.

return_type : deprecated, use `expand`

Returns
-------
split : Series/Index or DataFrame/MultiIndex of objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this return type rather confusing - how does the Index play a part? I think this could be clarified better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was confused about that as well. Should I keep it in the doc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For returns make sure the first line is just the type (unless returning more than one, which isn't the case here). I think that would best be Series or Index with a description that says something like "Return type matches caller unless expand=True (always returns a DataFrame)"

@@ -1095,24 +1095,48 @@ def str_pad(arr, width, side='left', fillchar=' '):

def str_split(arr, pat=None, n=None):
"""
Split strings around given separator/delimiter.

Split each string (a la re.split) in the Series/Index by given
pattern, propagating NA values. Equivalent to :meth:`str.split`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a reference to this in a See Also section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference to re.split?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry you can ignore this. I was thinking of linking to str.split since you mention it as equivalent here, but it's the same exact docstring as noted in other comments

* None, 0 and -1 will be interpreted as return all splits.
* Vary output dimensionality if `expand` is True:
- If n >= default splits, return all splits.
- If n < default splits, makes first n splits only.
expand : bool, default False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this in more detail, I see why you have this documented here in spite of the fact that str_split doesn't actually accept this parameter (it's docstring is copied to split). Ref some of the work in #10085

@jreback is there any reason why we wouldn't want to deprecate items like str_split and move to private internal methods like _str_split that the front-end facing API uses instead? If so may be a logical follow up to this docstring update

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd the str_split methods are not considered public, it's just how it is implemented right now. I agree this one is a bit confusing as some keyword are used in str_split and some only in the method, but generally in the file, the docstrings are located at the functions (so str_split), although they are written as if they were documenting the method.

Feel free to open an issue if you have a proposal how this could be improved.

Examples
--------
>>> s = pd.Series(["this is good text", "but this is even better"])
>>> s.str.split()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref my other comment - kind of strange that the method here is actually str_split but we are showing how to use str.split. While technically the same, from an API perspective I feel like this docstring belongs under the split method and we should make this method private internal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same is the case for str.rsplit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep thanks - might be a few other instances in the module as well. I don't want it to hold up what you've done here but could be a good follow up for you to clean up the actual functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I work on it in this same pr, or should I make a separate issue for this and work on that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate issue

* If False, return Series/Index.

return_type : deprecated, use `expand`

Returns
-------
split : Series/Index or DataFrame/MultiIndex of objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For returns make sure the first line is just the type (unless returning more than one, which isn't the case here). I think that would best be Series or Index with a description that says something like "Return type matches caller unless expand=True (always returns a DataFrame)"

@mananpal1997 mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from f97aea1 to afa6c42 Compare March 11, 2018 20:19
@mananpal1997 mananpal1997 reopened this Mar 11, 2018
@mananpal1997 mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from 1c79389 to 92f43d4 Compare March 11, 2018 21:08
Copy link
Contributor Author

@mananpal1997 mananpal1997 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does everything seem fine now?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a few minor things but it's getting there

Split strings around given separator/delimiter.

Split each string in the Series/Index by given
pattern, propagating NaN values. Equivalent to :meth:`str.split`.

Parameters
----------
pat : string, default None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use str instead of string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and ", default None" -> ", optional"


Parameters
----------
pat : string, default None
String or regular expression to split on. If None, splits on whitespace
String or regular expression to split on.\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm does this render the backslash in the output?

Copy link
Contributor Author

@mananpal1997 mananpal1997 Mar 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No

Copy link
Member

@WillAyd WillAyd Mar 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh OK good to know. That said, I don't think the backslash is required or part of the standard (see the **kwargs example in first instance of class Series in the sprint documentation https://python-sprints.github.io/pandas/guide/pandas_docstring.html)

Should be fine just to place on separate line and ensure proper indentation

- If n < default splits, makes first n splits only
- Appends `None` for padding.

Examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might have missed mentioning this but while I think the examples are good you should add a sentence (or a few) to call out what users should be looking at with the examples. It would also be nice to show one or two with missing data to illustrate how NaN gets propagated

Copy link
Contributor Author

@mananpal1997 mananpal1997 Mar 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this?

>>> s.str.split("is", n=1, expand=True)
        0                1
0      th     is good text
1  but th   is even better
see notes about expand=True

Also, I wrote an example to show NaN propagation but None is being propagated instead of NaN.

>>> s = pd.Series(["this is good text", "but this is even better", np.nan])
>>> s.str.split(expand=True)
      0     1     2     3       4
0  this    is  good  text    None
1   but  this    is  even  better
2   NaN  None  None  None    None

Shouldn't output be?

      0     1     2     3       4
0  this    is  good  text    None
1   but  this    is  even  better
2   NaN   NaN   NaN   NaN     NaN

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put your comments before the example and just highlight what the user should look at. So for your first example say something like "By default, split will return an object of the same size containing lists to hold the split elements" and then introduce the second with something like "By contrast, when using expand=True the split elements will expand out into separate columns." Doesn't need to be exactly those words but something along those lines - make sense?

As far as your example is concerned, make sure you run everything on the master branch. My guess is you are using an older version of pandas as the fix to propagate NaN was released in v0.21.1 (see #18462)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, right.
thanks!

pattern, propagating NA values. Equivalent to :meth:`str.split`.
Split strings around given separator/delimiter.

Split each string in the Series/Index by given
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally writing Series/Index and DataFrame/MultiIndex is not very clear. I'd suggest saying "in the caller's values"

@mananpal1997 mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from 92f43d4 to 53713a6 Compare March 11, 2018 23:05
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements to the docstring! Added some more comments

None, 0 and -1 will be interpreted as return all splits
Vary dimensionality of output.

* `None`, 0 and -1 will be interpreted as return all splits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not put this in a bullet point, as this breaks the flow (i.e. need to have a blank line for the list to render..). Lists with one bullet point are a bit strange anyhow

expand : bool, default False
* If True, return DataFrame/MultiIndex expanding dimensionality.
* If False, return Series/Index.
Expand the split strings into separate columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split -> splitted


return_type : deprecated, use `expand`
* If `True`, return DataFrame/MultiIndex expanding dimensionality.
* If `False`, return Series/Index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add "containing lists of strings"


Returns
-------
Type matches caller unless `expand=True` (return type is `DataFrame`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type for expand=True can also be MultiIndex I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the wording around Index and MultiIndex is confusing here - how does this return a MultiIndex?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry @WillAyd, merged maybe a bit too quick

When you use this on an index, and do expand=True, you get a MultiIndex (similar as if Series -> DataFrame).

Maybe it would also be good to add an example of this.
@mananpal1997 Feel free to open a new PR to do this small follow-up change.

Copy link
Member

@WillAyd WillAyd Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK makes sense. Yes I just find the text like "Series/Index" and "DataFrame/MultiIndex" wording to be rather confusing. Part of me wants to interpret those as being bound together, so that a Series is always indexed normally and a DataFrame is indexed with a MultiIndex. Obviously that's not the case, but I think that we could more clearly delineate.

@mananpal1997 an example would certainly help for your next contribution!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add it 👍

Copy link
Contributor Author

@mananpal1997 mananpal1997 Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd @jorisvandenbossche
Any edit suggestions to this change before I push it?

Parameter ``expand=True`` returns DataFrame and MultiIndex objects for
    Series and Index objects respectively.

    >>> type(s.str.split(expand=True))
    <class 'pandas.core.frame.DataFrame'>
    >>> i = pd.Index(["ba 100 001", "ba 101 002", "ba 102 003"])
    >>> i.str.split(expand=True)
    MultiIndex(levels=[['ba'], ['100', '101', '102'], ['001', '002', '003']],
           labels=[[0, 0, 0], [0, 1, 2], [0, 1, 2]])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ping us both in the new PR, so let's further discss there.
But I think I would just show the actual result, instead of the types, it will be clear from the result what the different types are

Split strings around given separator/delimiter.

Split each string in the Series/Index by given
pattern, propagating NaN values. Equivalent to :meth:`str.split`.

Parameters
----------
pat : string, default None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and ", default None" -> ", optional"

1 but this even better

Parameter `n` can be used to limit the number of columns in
expansion of output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say here to "limit the number of splits". Of course that directly maps to the number of columns, but for n=1 you actually have two columns, which might be confusing

Notes
-----
If `expand` parameter is `True` and:
- If n >= default splits, makes all splits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also true when expand=False ? (you just get less elements in the list)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad. I'll correct that

@mananpal1997 mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from 53713a6 to 3ab9f24 Compare March 12, 2018 11:56
@codecov
Copy link

codecov bot commented Mar 12, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@74e6c78). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #20282   +/-   ##
=========================================
  Coverage          ?   91.73%           
=========================================
  Files             ?      150           
  Lines             ?    49168           
  Branches          ?        0           
=========================================
  Hits              ?    45102           
  Misses            ?     4066           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.11% <ø> (?)
#single 41.86% <ø> (?)
Impacted Files Coverage Δ
pandas/core/strings.py 98.32% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 74e6c78...da27e5f. Read the comment docs.

@mananpal1997 mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from 3ab9f24 to 0815c43 Compare March 12, 2018 12:14
@mananpal1997 mananpal1997 reopened this Mar 12, 2018
@jorisvandenbossche
Copy link
Member

@mananpal1997 general comment for future reference: can you add commits when updating instead of squashing it into one commit? That makes it a bit easier to see what has changed after a review


Returns
-------
Type matches caller unless `expand=True` (return type is `DataFrame` or
`MultiIndex`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this one below the line below?
Like

split : Series/Index or DataFrame/MultiIndex
    Type matches caller unless `expand=True` (return type is `DataFrame` or
    `MultiIndex`)

@jorisvandenbossche
Copy link
Member

@mananpal1997 I added a small commit with some edits regarding the usage of quotes (due to some in-clarities in the guidelines)

-----
- If n >= default splits, makes all splits
- If n < default splits, makes first n splits only
- Appends `None` for padding if ``expand=True``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final comment: I find this list not fully clear. What is 'n' and what is 'default splits' ?

I suppose it details how n is handles depending on whether the number of found splits is bigger/smaller than the specified value for n ?

Proposal (but not fully sure this is what you meant):

The handling of the `n` keyword depends on the number of found splits:

- If found splits > `n`,  make first `n` splits only
- If found splits <= `n`, make all splits
- If for a certain row the number of found splits < `n`, append `None` for padding up to `n` if ``expand=True``
```

Copy link
Contributor Author

@mananpal1997 mananpal1997 Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right!

@jorisvandenbossche wanted to ask about an issue I faced while setting up pandas.
I followed the same steps as mentioned in the environment setup guide. I tried it fresh 2 times and both times, my html doc won't generate throwing error nbsphinx couldn't be loaded and then I would manually install it with pip install nbsphinx.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, nbsphinx is missing from the dev requirements, they are only included in the optional ones. Will open an issues about that.

@pep8speaks
Copy link

pep8speaks commented Mar 12, 2018

Hello @mananpal1997! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 12, 2018 at 15:14 Hours UTC

@mananpal1997 mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from b662388 to da27e5f Compare March 12, 2018 15:14
@jorisvandenbossche jorisvandenbossche merged commit 8f24748 into pandas-dev:master Mar 12, 2018
@jorisvandenbossche
Copy link
Member

@mananpal1997 Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants