Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: update str.cat example #23723

Merged
merged 11 commits into from
Jan 4, 2019
Merged

Conversation

h-vetinari
Copy link
Contributor

There have been a couple of changes recently due to the whole docstring validation effort (e.g. #22838), which got rid of some warnings emitted by the sample code.

As it stands, the code samples don't really reflect the running text anymore, and with the changes from #22264, it makes sense to update the running text as well.

@h-vetinari
Copy link
Contributor Author

Just realised while double-checking the code locally that there's still a FutureWarning, which did not get caught by any test after the pd.concat(sort=...)-transition, because it was being shadowed by the FutureWarning for join is None.

In other words, the doc build here will have a warning (=fail?), until #23725 is merged.

@h-vetinari
Copy link
Contributor Author

Ok, the build didn't fail - it's just a flaky hypothesis failure.

@gfyoung gfyoung added Docs Strings String extension data type and string data labels Nov 16, 2018
u
s.str.cat([u.values,
u.index.astype(str).values], na_rep='-')
s2 = s.set_axis(['a', 'b', 'c', 'd'], inplace=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not obvious at all, just directly construct the Series will be much more clear

u.index.astype(str).values], na_rep='-')
s2 = s.set_axis(['a', 'b', 'c', 'd'], inplace=False)
s2
u2 = u.set_axis(['b', 'd', 'a', 'c'], inplace=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here


.. ipython:: python

v
s.str.cat([u, v], join='outer', na_rep='-')
s.str.cat([v, u, u.values], join='outer', na_rep='-')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of accessing .values would also be preferable to use the container here, assuming you build one to instantiate a Series on next update

@codecov
Copy link

codecov bot commented Nov 18, 2018

Codecov Report

Merging #23723 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #23723   +/-   ##
=======================================
  Coverage   92.38%   92.38%           
=======================================
  Files         166      166           
  Lines       52490    52490           
=======================================
  Hits        48493    48493           
  Misses       3997     3997
Flag Coverage Δ
#multiple 90.81% <ø> (ø) ⬆️
#single 43.05% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6249355...3f8d4fd. Read the comment docs.

u
s.str.cat([u.values,
u.index.astype(str).values], na_rep='-')
s_values = np.array(['a', 'b', 'c', 'd'], dtype=object) # same as s.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove the comments here. Rather harmless with the example provided but I don't know if that comment will hold universally with all types (thinking EAs in particular) so don't want to give users an impression of that without context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I though this was clearly referencing the actual series s (as I wanted to motivate the variable name), rather than make a general statement about the relationship between np.array and .values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree, the comment is just confusing

@@ -306,21 +306,26 @@ The same alignment can be used when ``others`` is a ``DataFrame``:
Concatenating a Series and many objects into a Series
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

All one-dimensional list-likes can be combined in a list-like container (including iterators, ``dict``-views, etc.):
Several items can be combined a list-like container (including iterators, ``dict``-views, etc.), which may contain ``Series``, ``Index`` and ``np.ndarray``.
Note that ``Index`` will align as well, so we change the indexes of ``s`` and ``u`` to strings for the purpose of this example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what this last statement about Index actually means. can you reword.

u
s.str.cat([u.values,
u.index.astype(str).values], na_rep='-')
s_values = np.array(['a', 'b', 'c', 'd'], dtype=object) # same as s.values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree, the comment is just confusing

@jreback
Copy link
Contributor

jreback commented Nov 20, 2018

this example still is a bit complicated. can you simplify at all.

@h-vetinari
Copy link
Contributor Author

I don't like the example too much either, especially initialising the ndarray directly. After all, what I want to do is change the index of s to be the same as its values (due to alignment, as I note in the running test). For this, it might make sense to wait for #22225, because then it would be very easy:

>>> s2 = s.set_index(s.values)
>>> s2
a    a
b    b
c    c
d    d
dtype: object
>>> u2 = u.set_index(u.values)
>>> u2
b    b
d    d
a    a
c    c
dtype: object
>>> idx = pd.Index(['d', 'c', 'b', 'a'])
>>> idx
Index(['d', 'c', 'b', 'a'], dtype='object')
>>> u_values = u.values
>>> u_values
array(['b', 'd', 'a', 'c'], dtype=object)
>>> s2.str.cat([u2, idx, u_values], join='left')
a    aaab
b    bbbd
c    ccca
d    dddc
dtype: object

@jreback
Copy link
Contributor

jreback commented Dec 9, 2018

can you merge master and will look

@h-vetinari
Copy link
Contributor Author

@jreback
Could you have a look at #24046? #22225 is essentially ready and would make things much easier here.

I'll merge in master later.

u.index.astype(str).values], na_rep='-')
s_values = np.array(['a', 'b', 'c', 'd'], dtype=object)
s2 = pd.Series(s_values, index=s_values)
s2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this in multiple blocks as its too much to complicated here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole example needs to be simpler. maybe just leave the index as integers to avoid confusion, IOW focus less on the join in str.cat and more on the list-lke things that are going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing this example tries to show is that it works for Series, ndarray, and Index, and Index can only be aligned with a non-integer index. This whole example would be so so much easier with #22225. I've opened an issue for the DF discussion a while ago, and asked you to comment #24046.

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have very limited availability over the holidays. Will consider removing the Index portion completely, but asking you to please take a look at #22225 / #24046.

u.index.astype(str).values], na_rep='-')
s_values = np.array(['a', 'b', 'c', 'd'], dtype=object)
s2 = pd.Series(s_values, index=s_values)
s2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing this example tries to show is that it works for Series, ndarray, and Index, and Index can only be aligned with a non-integer index. This whole example would be so so much easier with #22225. I've opened an issue for the DF discussion a while ago, and asked you to comment #24046.

@h-vetinari
Copy link
Contributor Author

@jreback
Incorporated your feedback (removed Index from example pending progress on #22225, as the extra complexity to align an index without having Series.set_index bloated the example).

It's all green, except for #24564

@h-vetinari
Copy link
Contributor Author

@WillAyd
Your review is outdated, would you care to comment (or at least remove the change request), please?

@@ -303,23 +303,23 @@ The same alignment can be used when ``others`` is a ``DataFrame``:
Concatenating a Series and many objects into a Series
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

All one-dimensional list-likes can be combined in a list-like container (including iterators, ``dict``-views, etc.):
Several items can be combined a list-like container (including iterators, ``dict``-views, etc.), which may contain ``Series``, ``Index``, ``PandasArray`` and ``np.ndarray``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it was a typo to remove the word in here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks

@@ -303,23 +303,23 @@ The same alignment can be used when ``others`` is a ``DataFrame``:
Concatenating a Series and many objects into a Series
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

All one-dimensional list-likes can be combined in a list-like container (including iterators, ``dict``-views, etc.):
Several items can be combined a list-like container (including iterators, ``dict``-views, etc.), which may contain ``Series``, ``Index``, ``PandasArray`` and ``np.ndarray``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also some nits around verbage, but I think it would be easier to keep the Series, Index, etc... mentions closer to "Several items"; as is I had to read a few times to truly understand what was meant after the word which

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe even just `Several items (ex: Series, Index, ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I'm making an explicit list though, as only those types are allowed within list-likes


All elements must match in length to the calling ``Series`` (or ``Index``), except those having an index if ``join`` is not None:
All elements without an index (e.g. ``PandasArray`` and ``np.ndarray``) within the passed list-like must match in length to the calling ``Series`` (or ``Index``),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might have missed this but what's the reason for bringing up PandasArray? Not really something the end user would be using directly (at least in current form)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PandasArray is very user-facing:

>>> s = pd.Series(['a', 'b' ,'c', 'd'])
>>> s.array
<PandasArray>
['a', 'b', 'c', 'd']
Length: 4, dtype: object

and the current example was recently changed to use .array instead of .values. I think this should be documented clearly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks. I have been somewhat on the sidelines for that conversation so I'll defer to @jreback specifically on this piece

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to mention PandasArray here, its not very interesting, nor relevant. Just say array-likes. and remove u.array, the u.to_numpy() is the corrent idiom here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
There is a big difference between u.array and u.to_numpy():

>>> s = pd.Series(['a', 'b', 'c', 'd'])
>>> s.array
<PandasArray>
['a', 'b', 'c', 'd']
Length: 4, dtype: object
>>> s.to_numpy()
array(['a', 'b', 'c', 'd'], dtype=object)

I'm guessing .array will eventually replace .values-usage (e.g. to get rid of the index for .str.cat), since it is by design better suited for pandas-internal dtypes, and so the distinction above is not just an irrelevant detail IMO.

I want to show here the explicitly allowed item types to pass into a list-like, which have to pass:

nxt = others.pop(0)
[...]
if not (isinstance(nxt, (Series, Index))
        or (isinstance(nxt, np.ndarray) and nxt.ndim == 1)):
    raise ValueError(...)  # currently just a DeprecationWarning

Long story short, I want to show a list-like containing an np.ndarray, a PandasArray (to reiterate, this example was already changed by @TomAugspurger to use .array instead of .values in #23623), and a Series. (Including Index would be nice-to-have, but too complicated absent #22225).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@h-vetinari .to_numpy() replaces .values; .array is user-accessible, but generally is not visible to by the user. its not necessary here and is just noise.

Copy link
Contributor Author

@h-vetinari h-vetinari Jan 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback

.to_numpy() replaces .values;

As I did in the last few commits...

.array is user-accessible, but generally is not visible to by the user. its not necessary here and is just noise.

Will remove, but pinging @TomAugspurger since he added the current u.array in this example in #23623 (although likely "just" for replacing u.values?).

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review; pushed new commit

@@ -303,23 +303,23 @@ The same alignment can be used when ``others`` is a ``DataFrame``:
Concatenating a Series and many objects into a Series
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

All one-dimensional list-likes can be combined in a list-like container (including iterators, ``dict``-views, etc.):
Several items can be combined a list-like container (including iterators, ``dict``-views, etc.), which may contain ``Series``, ``Index``, ``PandasArray`` and ``np.ndarray``.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks

@@ -303,23 +303,23 @@ The same alignment can be used when ``others`` is a ``DataFrame``:
Concatenating a Series and many objects into a Series
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

All one-dimensional list-likes can be combined in a list-like container (including iterators, ``dict``-views, etc.):
Several items can be combined a list-like container (including iterators, ``dict``-views, etc.), which may contain ``Series``, ``Index``, ``PandasArray`` and ``np.ndarray``.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I'm making an explicit list though, as only those types are allowed within list-likes


All elements must match in length to the calling ``Series`` (or ``Index``), except those having an index if ``join`` is not None:
All elements without an index (e.g. ``PandasArray`` and ``np.ndarray``) within the passed list-like must match in length to the calling ``Series`` (or ``Index``),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PandasArray is very user-facing:

>>> s = pd.Series(['a', 'b' ,'c', 'd'])
>>> s.array
<PandasArray>
['a', 'b', 'c', 'd']
Length: 4, dtype: object

and the current example was recently changed to use .array instead of .values. I think this should be documented clearly.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise


All elements must match in length to the calling ``Series`` (or ``Index``), except those having an index if ``join`` is not None:
All elements without an index (e.g. ``PandasArray`` and ``np.ndarray``) within the passed list-like must match in length to the calling ``Series`` (or ``Index``),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks. I have been somewhat on the sidelines for that conversation so I'll defer to @jreback specifically on this piece

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL


All elements must match in length to the calling ``Series`` (or ``Index``), except those having an index if ``join`` is not None:
All elements without an index (e.g. ``PandasArray`` and ``np.ndarray``) within the passed list-like must match in length to the calling ``Series`` (or ``Index``),
Copy link
Contributor Author

@h-vetinari h-vetinari Jan 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback

.to_numpy() replaces .values;

As I did in the last few commits...

.array is user-accessible, but generally is not visible to by the user. its not necessary here and is just noise.

Will remove, but pinging @TomAugspurger since he added the current u.array in this example in #23623 (although likely "just" for replacing u.values?).

@jreback jreback added this to the 0.24.0 milestone Jan 4, 2019
@jreback jreback merged commit 25c1c38 into pandas-dev:master Jan 4, 2019
@jreback
Copy link
Contributor

jreback commented Jan 4, 2019

thanks @h-vetinari

@h-vetinari h-vetinari deleted the str_cat_docs branch January 4, 2019 16:58
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants