DOC: update the pandas.Series.str.split docstring #20282

mananpal1997 · 2018-03-11T15:08:00Z

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

PR title is "DOC: update the docstring"
The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
The html version looks good: python doc/make.py --single <your-function-or-method>
It has been proofread on language by another sprint participant

################################################################################
##################### Docstring (pandas.Series.str.split)  #####################
################################################################################

Split strings around given separator/delimiter.

Split each str in the caller's values by given
pattern, propagating NaN values. Equivalent to :meth:`str.split`.

Parameters
----------
pat : string, default None
    String or regular expression to split on.
    If `None`, split on whitespace.
n : int, default -1 (all)
    Vary dimensionality of output.

    * `None`, 0 and -1 will be interpreted as return all splits
expand : bool, default False
    Expand the split strings into separate columns.

    * If `True`, return DataFrame/MultiIndex expanding dimensionality.
    * If `False`, return Series/Index.

Returns
-------
Type matches caller unless `expand=True` (return type is `DataFrame`)
split : Series/Index or DataFrame/MultiIndex of objects

Notes
-----
If `expand` parameter is `True` and:
  - If n >= default splits, makes all splits
  - If n < default splits, makes first n splits only
  - Appends `None` for padding.

Examples
--------
>>> s = pd.Series(["this is good text", "but this is even better"])

By default, split will return an object of the same size
having lists containing the split elements

>>> s.str.split()
0           [this, is, good, text]
1    [but, this, is, even, better]
dtype: object
>>> s.str.split("random")
0          [this is good text]
1    [but this is even better]
dtype: object

When using `expand=True`, the split elements will
expand out into separate columns.

>>> s.str.split(expand=True)
      0     1     2     3       4
0  this    is  good  text    None
1   but  this    is  even  better
>>> s.str.split(" is ", expand=True)
          0            1
0      this    good text
1  but this  even better

Parameter `n` can be used to limit the number of columns in
expansion of output.

>>> s.str.split("is", n=1, expand=True)
        0                1
0      th     is good text
1  but th   is even better

If NaN is present, it is propagated throughout the columns
during the split.

>>> s = pd.Series(["this is good text", "but this is even better", np.nan])
>>> s.str.split(n=3, expand=True)
      0     1     2            3
0  this    is  good         text
1   but  this    is  even better
2   NaN   NaN   NaN          NaN

################################################################################
################################## Validation ##################################
################################################################################

Errors found:
	Errors in parameters section
		Parameter "n" description should finish with "."
	See Also section not found

WillAyd · 2018-03-11T18:31:04Z

pandas/core/strings.py

@@ -1095,24 +1095,48 @@ def str_pad(arr, width, side='left', fillchar=' '):

 def str_split(arr, pat=None, n=None):
    """
+    Split strings around given separator/delimiter.
+
    Split each string (a la re.split) in the Series/Index by given


Can put reference to re module in backticks, so `re.split`

It's already in the extended summary.

Split each string (a la re.split) in the Series/Index by given pattern, propagating NA values. Equivalent to :meth:`str.split`.

Right I just mean actually surrounding the reference with backticks so it renders as inline code. So simply change any occurrence of re.split to `re.split`

WillAyd · 2018-03-11T18:31:33Z

pandas/core/strings.py

    Split each string (a la re.split) in the Series/Index by given
    pattern, propagating NA values. Equivalent to :meth:`str.split`.

    Parameters
    ----------
    pat : string, default None
-        String or regular expression to split on. If None, splits on whitespace
+        String or regular expression to split on. If None, split on whitespace.


@WillAyd small feedback: it's either no backticks or double backticks for small code snippets, single backticks are for parameter names.
The question for such short things like None if we see them as 'code', but I think typically we do if it references a value that is passed.

(we should probably make a better overview of this in the docstring guidelines)

WillAyd · 2018-03-11T18:33:50Z

pandas/core/strings.py

    n : int, default -1 (all)
-        None, 0 and -1 will be interpreted as return all splits
+        * None, 0 and -1 will be interpreted as return all splits.


Shouldn't need periods at the end of bullet points

WillAyd · 2018-03-11T18:33:58Z

pandas/core/strings.py

    n : int, default -1 (all)
-        None, 0 and -1 will be interpreted as return all splits
+        * None, 0 and -1 will be interpreted as return all splits.
+        * Vary output dimensionality if `expand` is True:


WillAyd · 2018-03-11T18:36:31Z

pandas/core/strings.py

@@ -1095,24 +1095,48 @@ def str_pad(arr, width, side='left', fillchar=' '):

 def str_split(arr, pat=None, n=None):
    """
+    Split strings around given separator/delimiter.
+
    Split each string (a la re.split) in the Series/Index by given
    pattern, propagating NA values. Equivalent to :meth:`str.split`.


`NaN` instead of NA would be better

WillAyd · 2018-03-11T18:37:19Z

pandas/core/strings.py

    Returns
    -------
    split : Series/Index or DataFrame/MultiIndex of objects
+
+    Examples


The examples are good but I think the section can use a short sentence or two to introduce the examples and clue the reader in on what they should be looking at. Also, is it possible to show an example that deals with missing values?

WillAyd · 2018-03-11T18:39:48Z

pandas/core/strings.py

@@ -1095,24 +1095,48 @@ def str_pad(arr, width, side='left', fillchar=' '):

 def str_split(arr, pat=None, n=None):
    """
+    Split strings around given separator/delimiter.
+
    Split each string (a la re.split) in the Series/Index by given
    pattern, propagating NA values. Equivalent to :meth:`str.split`.

    Parameters


General comment - rather than putting sub-bullets under the parameters would it be clearer to move those into a dedicated Notes section? They do explain some of the implementation details so wonder if they are better served there

That would be better. I'll update it.

WillAyd · 2018-03-11T18:40:45Z

pandas/core/strings.py

        * If False, return Series/Index.

-    return_type : deprecated, use `expand`
-
    Returns
    -------
    split : Series/Index or DataFrame/MultiIndex of objects


I find this return type rather confusing - how does the Index play a part? I think this could be clarified better

Yes, I was confused about that as well. Should I keep it in the doc?

For returns make sure the first line is just the type (unless returning more than one, which isn't the case here). I think that would best be Series or Index with a description that says something like "Return type matches caller unless expand=True (always returns a DataFrame)"

WillAyd · 2018-03-11T20:02:26Z

pandas/core/strings.py

@@ -1095,24 +1095,48 @@ def str_pad(arr, width, side='left', fillchar=' '):

 def str_split(arr, pat=None, n=None):
    """
+    Split strings around given separator/delimiter.
+
    Split each string (a la re.split) in the Series/Index by given
    pattern, propagating NA values. Equivalent to :meth:`str.split`.


Add a reference to this in a See Also section

reference to re.split?

Sorry you can ignore this. I was thinking of linking to str.split since you mention it as equivalent here, but it's the same exact docstring as noted in other comments

WillAyd · 2018-03-11T20:12:56Z

pandas/core/strings.py

+        * None, 0 and -1 will be interpreted as return all splits.
+        * Vary output dimensionality if `expand` is True:
+            - If n >= default splits, return all splits.
+            - If n < default splits, makes first n splits only.
    expand : bool, default False


Looking at this in more detail, I see why you have this documented here in spite of the fact that str_split doesn't actually accept this parameter (it's docstring is copied to split). Ref some of the work in #10085

@jreback is there any reason why we wouldn't want to deprecate items like str_split and move to private internal methods like _str_split that the front-end facing API uses instead? If so may be a logical follow up to this docstring update

@WillAyd the str_split methods are not considered public, it's just how it is implemented right now. I agree this one is a bit confusing as some keyword are used in str_split and some only in the method, but generally in the file, the docstrings are located at the functions (so str_split), although they are written as if they were documenting the method.

Feel free to open an issue if you have a proposal how this could be improved.

WillAyd · 2018-03-11T20:14:43Z

pandas/core/strings.py

+    Examples
+    --------
+    >>> s = pd.Series(["this is good text", "but this is even better"])
+    >>> s.str.split()


xref my other comment - kind of strange that the method here is actually str_split but we are showing how to use str.split. While technically the same, from an API perspective I feel like this docstring belongs under the split method and we should make this method private internal.

same is the case for str.rsplit

Yep thanks - might be a few other instances in the module as well. I don't want it to hold up what you've done here but could be a good follow up for you to clean up the actual functions

Should I work on it in this same pr, or should I make a separate issue for this and work on that?

Separate issue

WillAyd · 2018-03-11T20:14:52Z

pandas/core/strings.py

        * If False, return Series/Index.

-    return_type : deprecated, use `expand`
-
    Returns
    -------
    split : Series/Index or DataFrame/MultiIndex of objects


For returns make sure the first line is just the type (unless returning more than one, which isn't the case here). I think that would best be Series or Index with a description that says something like "Return type matches caller unless expand=True (always returns a DataFrame)"

mananpal1997

Does everything seem fine now?

WillAyd

Still a few minor things but it's getting there

WillAyd · 2018-03-11T21:12:44Z

pandas/core/strings.py

+    Split strings around given separator/delimiter.
+
+    Split each string in the Series/Index by given
+    pattern, propagating NaN values. Equivalent to :meth:`str.split`.

    Parameters
    ----------
    pat : string, default None


Use str instead of string

and ", default None" -> ", optional"

WillAyd · 2018-03-11T21:13:39Z

pandas/core/strings.py


    Parameters
    ----------
    pat : string, default None
-        String or regular expression to split on. If None, splits on whitespace
+        String or regular expression to split on.\


Hmm does this render the backslash in the output?

Oh OK good to know. That said, I don't think the backslash is required or part of the standard (see the **kwargs example in first instance of class Series in the sprint documentation https://python-sprints.github.io/pandas/guide/pandas_docstring.html)

Should be fine just to place on separate line and ensure proper indentation

WillAyd · 2018-03-11T21:15:49Z

pandas/core/strings.py

+      - If n < default splits, makes first n splits only
+      - Appends `None` for padding.
+
+    Examples


Might have missed mentioning this but while I think the examples are good you should add a sentence (or a few) to call out what users should be looking at with the examples. It would also be nice to show one or two with missing data to illustrate how NaN gets propagated

Like this?

>>> s.str.split("is", n=1, expand=True) 0 1 0 th is good text 1 but th is even better see notes about expand=True

Also, I wrote an example to show NaN propagation but None is being propagated instead of NaN.

>>> s = pd.Series(["this is good text", "but this is even better", np.nan]) >>> s.str.split(expand=True) 0 1 2 3 4 0 this is good text None 1 but this is even better 2 NaN None None None None

Shouldn't output be?

0 1 2 3 4 0 this is good text None 1 but this is even better 2 NaN NaN NaN NaN NaN

Put your comments before the example and just highlight what the user should look at. So for your first example say something like "By default, split will return an object of the same size containing lists to hold the split elements" and then introduce the second with something like "By contrast, when using expand=True the split elements will expand out into separate columns." Doesn't need to be exactly those words but something along those lines - make sense?

As far as your example is concerned, make sure you run everything on the master branch. My guess is you are using an older version of pandas as the fix to propagate NaN was released in v0.21.1 (see #18462)

ah, right.
thanks!

WillAyd · 2018-03-11T21:18:54Z

pandas/core/strings.py

-    pattern, propagating NA values. Equivalent to :meth:`str.split`.
+    Split strings around given separator/delimiter.
+
+    Split each string in the Series/Index by given


I think generally writing Series/Index and DataFrame/MultiIndex is not very clear. I'd suggest saying "in the caller's values"

jorisvandenbossche

Nice improvements to the docstring! Added some more comments

jorisvandenbossche · 2018-03-12T10:52:41Z

pandas/core/strings.py

-        None, 0 and -1 will be interpreted as return all splits
+        Vary dimensionality of output.
+
+        * `None`, 0 and -1 will be interpreted as return all splits


I would not put this in a bullet point, as this breaks the flow (i.e. need to have a blank line for the list to render..). Lists with one bullet point are a bit strange anyhow

jorisvandenbossche · 2018-03-12T10:52:59Z

pandas/core/strings.py

    expand : bool, default False
-        * If True, return DataFrame/MultiIndex expanding dimensionality.
-        * If False, return Series/Index.
+        Expand the split strings into separate columns.


split -> splitted

jorisvandenbossche · 2018-03-12T10:53:24Z

pandas/core/strings.py


-    return_type : deprecated, use `expand`
+        * If `True`, return DataFrame/MultiIndex expanding dimensionality.
+        * If `False`, return Series/Index.


maybe add "containing lists of strings"

jorisvandenbossche · 2018-03-12T10:54:03Z

pandas/core/strings.py


    Returns
    -------
+    Type matches caller unless `expand=True` (return type is `DataFrame`)


return type for expand=True can also be MultiIndex I think

I think the wording around Index and MultiIndex is confusing here - how does this return a MultiIndex?

Ah, sorry @WillAyd, merged maybe a bit too quick

When you use this on an index, and do expand=True, you get a MultiIndex (similar as if Series -> DataFrame).

Maybe it would also be good to add an example of this.
@mananpal1997 Feel free to open a new PR to do this small follow-up change.

Ah OK makes sense. Yes I just find the text like "Series/Index" and "DataFrame/MultiIndex" wording to be rather confusing. Part of me wants to interpret those as being bound together, so that a Series is always indexed normally and a DataFrame is indexed with a MultiIndex. Obviously that's not the case, but I think that we could more clearly delineate.

@mananpal1997 an example would certainly help for your next contribution!

I'll add it 👍

@WillAyd @jorisvandenbossche
Any edit suggestions to this change before I push it?

Parameter ``expand=True`` returns DataFrame and MultiIndex objects for Series and Index objects respectively. >>> type(s.str.split(expand=True)) <class 'pandas.core.frame.DataFrame'> >>> i = pd.Index(["ba 100 001", "ba 101 002", "ba 102 003"]) >>> i.str.split(expand=True) MultiIndex(levels=[['ba'], ['100', '101', '102'], ['001', '002', '003']], labels=[[0, 0, 0], [0, 1, 2], [0, 1, 2]])

You can ping us both in the new PR, so let's further discss there.
But I think I would just show the actual result, instead of the types, it will be clear from the result what the different types are

jorisvandenbossche · 2018-03-12T10:55:00Z

pandas/core/strings.py

+    Split strings around given separator/delimiter.
+
+    Split each string in the Series/Index by given
+    pattern, propagating NaN values. Equivalent to :meth:`str.split`.

    Parameters
    ----------
    pat : string, default None


and ", default None" -> ", optional"

jorisvandenbossche · 2018-03-12T10:56:27Z

pandas/core/strings.py

+    1  but this  even better
+
+    Parameter `n` can be used to limit the number of columns in
+    expansion of output.


I would say here to "limit the number of splits". Of course that directly maps to the number of columns, but for n=1 you actually have two columns, which might be confusing

jorisvandenbossche · 2018-03-12T10:57:42Z

pandas/core/strings.py

+    Notes
+    -----
+    If `expand` parameter is `True` and:
+      - If n >= default splits, makes all splits


I think this is also true when expand=False ? (you just get less elements in the list)

my bad. I'll correct that

codecov · 2018-03-12T11:57:00Z

Codecov Report

❗ No coverage uploaded for pull request base (master@74e6c78). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master   #20282   +/-   ##
=========================================
  Coverage          ?   91.73%           
=========================================
  Files             ?      150           
  Lines             ?    49168           
  Branches          ?        0           
=========================================
  Hits              ?    45102           
  Misses            ?     4066           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`90.11% <ø> (?)`
#single	`41.86% <ø> (?)`

Impacted Files	Coverage Δ
pandas/core/strings.py	`98.32% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 74e6c78...da27e5f. Read the comment docs.

jorisvandenbossche · 2018-03-12T12:28:31Z

@mananpal1997 general comment for future reference: can you add commits when updating instead of squashing it into one commit? That makes it a bit easier to see what has changed after a review

jorisvandenbossche · 2018-03-12T12:51:16Z

pandas/core/strings.py


    Returns
    -------
+    Type matches caller unless `expand=True` (return type is `DataFrame` or
+    `MultiIndex`)


Can you put this one below the line below?
Like

split : Series/Index or DataFrame/MultiIndex Type matches caller unless `expand=True` (return type is `DataFrame` or `MultiIndex`)

jorisvandenbossche · 2018-03-12T14:51:03Z

@mananpal1997 I added a small commit with some edits regarding the usage of quotes (due to some in-clarities in the guidelines)

jorisvandenbossche · 2018-03-12T14:55:13Z

pandas/core/strings.py

+    -----
+    - If n >= default splits, makes all splits
+    - If n < default splits, makes first n splits only
+    - Appends `None` for padding if ``expand=True``


One final comment: I find this list not fully clear. What is 'n' and what is 'default splits' ?

I suppose it details how n is handles depending on whether the number of found splits is bigger/smaller than the specified value for n ?

Proposal (but not fully sure this is what you meant):

The handling of the `n` keyword depends on the number of found splits: - If found splits > `n`, make first `n` splits only - If found splits <= `n`, make all splits - If for a certain row the number of found splits < `n`, append `None` for padding up to `n` if ``expand=True`` ```

right!

@jorisvandenbossche wanted to ask about an issue I faced while setting up pandas.
I followed the same steps as mentioned in the environment setup guide. I tried it fresh 2 times and both times, my html doc won't generate throwing error nbsphinx couldn't be loaded and then I would manually install it with pip install nbsphinx.

Yes, nbsphinx is missing from the dev requirements, they are only included in the optional ones. Will open an issues about that.

pep8speaks · 2018-03-12T15:07:43Z

Hello @mananpal1997! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 12, 2018 at 15:14 Hours UTC

jorisvandenbossche · 2018-03-12T15:30:39Z

@mananpal1997 Thanks a lot!

mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from 2f3094e to f97aea1 Compare March 11, 2018 15:34

WillAyd requested changes Mar 11, 2018

View reviewed changes

mananpal1997 closed this Mar 11, 2018

mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from f97aea1 to afa6c42 Compare March 11, 2018 20:19

mananpal1997 reopened this Mar 11, 2018

mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from 1c79389 to 92f43d4 Compare March 11, 2018 21:08

mananpal1997 commented Mar 11, 2018

View reviewed changes

WillAyd requested changes Mar 11, 2018

View reviewed changes

mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from 92f43d4 to 53713a6 Compare March 11, 2018 23:05

jorisvandenbossche added the Docs label Mar 12, 2018

jorisvandenbossche reviewed Mar 12, 2018

View reviewed changes

mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from 53713a6 to 3ab9f24 Compare March 12, 2018 11:56

mananpal1997 closed this Mar 12, 2018

mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from 3ab9f24 to 0815c43 Compare March 12, 2018 12:14

updated doc for pandas.Series.str.split() method

9126c82

mananpal1997 reopened this Mar 12, 2018

jorisvandenbossche reviewed Mar 12, 2018

View reviewed changes

mananpal1997 and others added 2 commits March 12, 2018 18:41

updated doc for pandas.Series.str.split() method

2e13424

update backticks

0a1da96

jorisvandenbossche reviewed Mar 12, 2018

View reviewed changes

updated docstring for pandas.Series.str.split() method

da27e5f

mananpal1997 force-pushed the docstring_pandas.Series.str.split branch from b662388 to da27e5f Compare March 12, 2018 15:14

jorisvandenbossche approved these changes Mar 12, 2018

View reviewed changes

jorisvandenbossche merged commit 8f24748 into pandas-dev:master Mar 12, 2018

jorisvandenbossche added this to the 0.23.0 milestone Mar 12, 2018

mananpal1997 mentioned this pull request Mar 12, 2018

DOC: update the pandas.Series.str.split docstring #20307

Merged

4 tasks

DOC: update the pandas.Series.str.split docstring #20282

DOC: update the pandas.Series.str.split docstring #20282

Conversation

mananpal1997 commented Mar 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mananpal1997 left a comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mananpal1997 Mar 11, 2018 • edited Loading

Choose a reason for hiding this comment

WillAyd Mar 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mananpal1997 Mar 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Mar 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mananpal1997 Mar 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 12, 2018 • edited Loading

Codecov Report

jorisvandenbossche commented Mar 12, 2018

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 12, 2018

Choose a reason for hiding this comment

mananpal1997 Mar 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Mar 12, 2018 • edited Loading

Comment last updated on March 12, 2018 at 15:14 Hours UTC

jorisvandenbossche commented Mar 12, 2018

mananpal1997 commented Mar 11, 2018 •

edited

Loading

mananpal1997 Mar 11, 2018 •

edited

Loading

WillAyd Mar 11, 2018 •

edited

Loading

mananpal1997 Mar 11, 2018 •

edited

Loading

WillAyd Mar 12, 2018 •

edited

Loading

mananpal1997 Mar 12, 2018 •

edited

Loading

codecov bot commented Mar 12, 2018 •

edited

Loading

mananpal1997 Mar 12, 2018 •

edited

Loading

pep8speaks commented Mar 12, 2018 •

edited

Loading