Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN/ERR: str.cat internals #22725

Merged
merged 9 commits into from
Oct 14, 2018
Merged

CLN/ERR: str.cat internals #22725

merged 9 commits into from
Oct 14, 2018

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented Sep 16, 2018

This is mainly a clean-up of internal methods for str.cat that I didn't want to touch within #20347.

As a side benefit of changing the implementation, this also solves #22721. Finally, I've also added a better message for TypeErrors (closes #22722)

closes #22721
closes #22722

  • tests modified / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Here's the ASV output (the original implementation of this PR (see first commit) that used more higher-level pandas-functions like fillna, drop_na, etc. was up to three times slower, so I tweaked it some more, and actually believe that the last solution with interleave_sep is the most elegant anyway):

       before           after         ratio
     [37455764]       [4d1710f1]
         10.9±1ms         9.11±1ms    ~0.83  strings.Cat.time_cat(0, ',', '-', 0.0)
         9.55±1ms         10.9±0ms    ~1.15  strings.Cat.time_cat(0, ',', '-', 0.001)
         12.5±2ms         14.1±2ms    ~1.12  strings.Cat.time_cat(0, ',', '-', 0.15)
         9.94±1ms       9.23±0.7ms     0.93  strings.Cat.time_cat(0, ',', None, 0.0)
         14.3±2ms         8.68±1ms    ~0.61  strings.Cat.time_cat(0, ',', None, 0.001)
-        13.7±1ms       11.7±0.8ms     0.86  strings.Cat.time_cat(0, ',', None, 0.15)
       9.11±0.7ms         7.81±2ms    ~0.86  strings.Cat.time_cat(0, None, '-', 0.0)
       9.38±0.8ms         10.9±1ms    ~1.17  strings.Cat.time_cat(0, None, '-', 0.001)
         11.4±2ms         10.9±2ms     0.96  strings.Cat.time_cat(0, None, '-', 0.15)
         13.4±2ms       9.38±0.6ms    ~0.70  strings.Cat.time_cat(0, None, None, 0.0)
         9.23±2ms         11.7±1ms    ~1.27  strings.Cat.time_cat(0, None, None, 0.001)
         10.2±2ms         10.9±1ms     1.08  strings.Cat.time_cat(0, None, None, 0.15)
-        70.3±4ms         54.7±4ms     0.78  strings.Cat.time_cat(3, ',', '-', 0.0)
         62.5±4ms        46.9±20ms    ~0.75  strings.Cat.time_cat(3, ',', '-', 0.001)
        93.8±10ms        66.4±10ms    ~0.71  strings.Cat.time_cat(3, ',', '-', 0.15)
         62.5±4ms         62.5±4ms     1.00  strings.Cat.time_cat(3, ',', None, 0.0)
+        46.9±4ms         85.9±4ms     1.83  strings.Cat.time_cat(3, ',', None, 0.001)
         52.1±2ms         50.8±6ms     0.97  strings.Cat.time_cat(3, ',', None, 0.15)
         50.8±8ms         54.7±6ms     1.08  strings.Cat.time_cat(3, None, '-', 0.0)
         54.7±4ms         62.5±4ms    ~1.14  strings.Cat.time_cat(3, None, '-', 0.001)
         62.5±5ms         54.7±3ms    ~0.88  strings.Cat.time_cat(3, None, '-', 0.15)
         46.9±4ms         39.1±0ms    ~0.83  strings.Cat.time_cat(3, None, None, 0.0)
         46.9±8ms         58.6±6ms    ~1.25  strings.Cat.time_cat(3, None, None, 0.001)
         46.9±9ms         54.7±4ms    ~1.17  strings.Cat.time_cat(3, None, None, 0.15)
                                                                  ^  ^     ^     ^
                                                                  |  |     |     |
                                                         other_cols  |   na_rep  |
                                                                     |           |
                                                                    sep        na_frac

There's a bunch of noise in there, but by and large, things don't look so bad IMO. Especially, when one excludes the not-so-common worst-case scenario of a very small but non-zero amount of NaNs (na_frac=0.001):

       before           after         ratio
     [37455764]       [4d1710f1]
         10.9±1ms         9.11±1ms    ~0.83  strings.Cat.time_cat(0, ',', '-', 0.0)
         12.5±2ms         14.1±2ms    ~1.12  strings.Cat.time_cat(0, ',', '-', 0.15)
         9.94±1ms       9.23±0.7ms     0.93  strings.Cat.time_cat(0, ',', None, 0.0)
-        13.7±1ms       11.7±0.8ms     0.86  strings.Cat.time_cat(0, ',', None, 0.15)
       9.11±0.7ms         7.81±2ms    ~0.86  strings.Cat.time_cat(0, None, '-', 0.0)
         11.4±2ms         10.9±2ms     0.96  strings.Cat.time_cat(0, None, '-', 0.15)
         13.4±2ms       9.38±0.6ms    ~0.70  strings.Cat.time_cat(0, None, None, 0.0)
         10.2±2ms         10.9±1ms     1.08  strings.Cat.time_cat(0, None, None, 0.15)
-        70.3±4ms         54.7±4ms     0.78  strings.Cat.time_cat(3, ',', '-', 0.0)
         62.5±4ms        46.9±20ms    ~0.75  strings.Cat.time_cat(3, ',', '-', 0.001)
        93.8±10ms        66.4±10ms    ~0.71  strings.Cat.time_cat(3, ',', '-', 0.15)
         62.5±4ms         62.5±4ms     1.00  strings.Cat.time_cat(3, ',', None, 0.0)
         52.1±2ms         50.8±6ms     0.97  strings.Cat.time_cat(3, ',', None, 0.15)
         50.8±8ms         54.7±6ms     1.08  strings.Cat.time_cat(3, None, '-', 0.0)
         62.5±5ms         54.7±3ms    ~0.88  strings.Cat.time_cat(3, None, '-', 0.15)
         46.9±4ms         39.1±0ms    ~0.83  strings.Cat.time_cat(3, None, None, 0.0)
         46.9±9ms         54.7±4ms    ~1.17  strings.Cat.time_cat(3, None, None, 0.15)
                                                                  ^  ^     ^     ^
                                                                  |  |     |     |
                                                         other_cols  |   na_rep  |
                                                                     |           |
                                                                    sep        na_frac

@pep8speaks
Copy link

pep8speaks commented Sep 16, 2018

Hello @h-vetinari! Thanks for updating the PR.

Comment last updated on September 17, 2018 at 09:16 Hours UTC

@gfyoung gfyoung added Strings String extension data type and string data Error Reporting Incorrect or improved errors from pandas Clean labels Sep 16, 2018
@gfyoung
Copy link
Member

gfyoung commented Sep 16, 2018

@WillAyd @jreback : The conversations went stale a little in the original issues, and I'm not sure how well aligned these changes with what you guys were suggesting or saying in them.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 16, 2018

@gfyoung @WillAyd @jreback
I opened #22721 because the behaviour was changed through this refactor. As I said in the issue, I'm happy to disable it, but that would need adding in some check against binary data (because np.sum -- contrary to the previous implementation -- doesn't throw an error for binary data).

I do however stand by:

Beyond that, it's IMO the .str-accessor that should be enforcing the correct types, but since there's no dedicated string-dtype yet, the methods that do work for other object data (e.g. lists) are also used like that (e.g. people use .str.len() to get the different length of a Series of lists).

In other words, I'm +epsilon on allowing (consistent, reasonable) off-label use of the .str-accessor until there is a string dtype.

#22722 is equally something that I'm not going to fight for. I got some less than ideal error messages while testing, and decided this could/should be improved. If you disagree, it's easy to remove the offending lines.

@gfyoung
Copy link
Member

gfyoung commented Sep 16, 2018

@h-vetinari : Not a problem. I was only reading through the issues and your PR and matching up what you were doing with what was being said. Just pinging to get their eyes on this.

@WillAyd
Copy link
Member

WillAyd commented Sep 17, 2018

I'm still very much against setting the expectation that the .str accessor will work with byte objects.

Also generally don't think that a "cleanup" should be performed in a same PR that changes the expected functionality of the codebase. Would be much easier to stick to one thing at a time, i.e. a clean up that doesn't introduce or change any existing behavior

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 17, 2018

@WillAyd

Would be much easier to stick to one thing at a time, i.e. a clean up that doesn't introduce or change any existing behavior

I reverted the (unrelated) fix for #22721, but the situation for #22722 is more delicate. The genesis was as follows:

  1. refactor
  2. see that behaviour changed
  3. (sorta) agree with new behaviour
  4. open issue so that changed behaviour in PR is explained

Turns out, that the situation is even a bit more delicate, as currently on master, bytes can be successfully concatenated as long as sep is explicitly set:

(pandas-dev) C:\Users\[...]\pddev>python
Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> pd.__version__
'0.24.0.dev0+586.g8a1c8ad4b'
>>> s = pd.Series(np.array(list('abc'), 'S1').astype(object))
>>> t = pd.Series(np.array(list('def'), 'S1').astype(object))
>>> s.str.cat(t)
TypeError: sequence item 0: expected str instance, bytes found
>>> s.str.cat(t, sep=b'')
0    b'ad'
1    b'be'
2    b'cf'
dtype: object
>>> s.str.cat(t, sep=b',')
0    b'a,d'
1    b'b,e'
2    b'c,f'
dtype: object

The problem was that tests/frame/test_strings.test_method_on_bytes only tests sep=None.

This is the sort of thing I was talking about with the missing string dtype. Without one, there are legitimate off-label uses for the .str-accessor, and concatenating bytes workes like a charm already (not to mention several other methods from .str). The only real change here then would be that sep=None does not automatically trigger the TypeError.

@h-vetinari h-vetinari force-pushed the cln_str_cat branch 2 times, most recently from a0975fd to f79c707 Compare September 17, 2018 09:16
@jreback
Copy link
Contributor

jreback commented Sep 18, 2018

@h-vetinari pls pls pls 1 thing per PR. We do NOT handle bytes in .str if you want to add tests and raise, pls do so, but not going to 'make it work better'. It is amazingly confusing and causes all sorts of errors. We probably don't have explicit checks on this (though I thought that we always infer on the strings that must be string/unicode and never bytes).

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments

@h-vetinari
Copy link
Contributor Author

@jreback

We do NOT handle bytes in .str

Yes you (currently) do. Just try the code I posted above.

pls pls pls 1 thing per PR.

The whatsnew-note notwithstanding, this PR only changes the implementation (you'll see in the test that I've not changed anything substantial)

I understand that you don't want people using .str for byte data, but it works currently. The problem is that there's no good dtype distinction, and inspecting every element of a Series when calling .str would come with a big perf hit.

@jreback
Copy link
Contributor

jreback commented Sep 18, 2018

Yes you (currently) do. Just try the code I posted above.

It may happen to work. Instead of refactoring this as I said above, would prefer tests / and better error messages with bytes inputs.

@h-vetinari
Copy link
Contributor Author

@jreback

It may happen to work. Instead of refactoring this as I said above, would prefer tests / and better error messages with bytes inputs.

The point is that this PR does not change the current behaviour and should stand on its merits, unrelated to the fact that you'd like to disallow .str on bytes.

@jreback
Copy link
Contributor

jreback commented Sep 18, 2018

@h-vetinari you removed tests, so clearly you are changing things.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 18, 2018

@jreback

you removed tests, so clearly you are changing things.

I removed one test for the internal method that has been factored away. Furthermore, this removed test (test_cat) is exactly replicated in the test directly below (test_str_cat).

@codecov
Copy link

codecov bot commented Sep 19, 2018

Codecov Report

Merging #22725 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22725      +/-   ##
==========================================
- Coverage    92.2%   92.19%   -0.01%     
==========================================
  Files         169      169              
  Lines       50924    50900      -24     
==========================================
- Hits        46952    46928      -24     
  Misses       3972     3972
Flag Coverage Δ
#multiple 90.61% <100%> (-0.01%) ⬇️
#single 42.32% <6.45%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/core/strings.py 98.58% <100%> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8ce3d0...e58ec9d. Read the comment docs.

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback

PTAL

two = np.array(['a', NA, 'b', 'd', 'foo', NA], dtype=np.object_)

# single array
result = strings.str_cat(one)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback

I removed one test for the internal method that has been factored away

Please have a look here - this is directly importing the internal method and testing it (not str.cat)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

rgx = 'All arrays must be same length'
three = Series(['1', '2', '3'])

with tm.assert_raises_regex(ValueError, rgx):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, this removed test (test_cat) is exactly replicated in the test directly below (test_str_cat).

I can't mark lines that are not in the diff, but check out

result = s.str.cat()

I replicated the removed test (acting on strings.test_cat) as a test acting on str.cat within #20347.

@h-vetinari
Copy link
Contributor Author

@jreback
While you're at it with the reviewing, please don't forget this one. :)

def str_cat(arr, others=None, sep=None, na_rep=None):
"""
def interleave_sep(all_cols, sep):
'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use triple-double quotes

def str_cat(arr, others=None, sep=None, na_rep=None):
"""
def interleave_sep(all_cols, sep):
'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_cols -> list_of_columns

result = str_cat(data, others=others, sep=sep, na_rep=na_rep)
return self._wrap_result(result,
use_codes=(not self._is_categorical))
data = data.astype(object).values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this astype needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used it because data may be categorical, and then values is not necessarily a numpy array. Changed to ensure_object which you mentioned below, hope this is better.

if na_rep is None:
return sep.join(data[~mask])
return sep.join(np.where(mask, na_rep, data))
return sep.join(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do a single sep.join, and just have the branches mask the data as needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

data, others = data.align(others, join=join)
others = [others[x] for x in others] # again list of Series

# str_cat discards index
res = str_cat(data, others=others, sep=sep, na_rep=na_rep)
all_cols = [x.astype(object).values for x in [data] + others]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need the astype, much prefer ensure_object generally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

masks = np.array([isna(x) for x in all_cols])
union_mask = np.logical_or.reduce(masks, axis=0)

if na_rep is None and union_mask.any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on these cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments

result[not_masked] = np.sum(all_cols, axis=0)
elif na_rep is not None and union_mask.any():
# fill NaNs
all_cols = [np.where(masks[i], na_rep, all_cols[i])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use zip(masks, all_cols)

return all_cols
result = [sep] * (2 * len(all_cols) - 1)
result[::2] = all_cols
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simply do np.sum(result) here, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's reasonable. Refactored the function as necessary

two = np.array(['a', NA, 'b', 'd', 'foo', NA], dtype=np.object_)

# single array
result = strings.str_cat(one)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -3136,7 +3089,7 @@ def test_method_on_bytes(self):
lhs = Series(np.array(list('abc'), 'S1').astype(object))
rhs = Series(np.array(list('def'), 'S1').astype(object))
if compat.PY3:
pytest.raises(TypeError, lhs.str.cat, rhs)
pytest.raises(TypeError, lhs.str.cat, rhs, sep=',')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the bytes concat?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review; pushed new commits

return all_cols
result = [sep] * (2 * len(all_cols) - 1)
result[::2] = all_cols
return result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's reasonable. Refactored the function as necessary

result = str_cat(data, others=others, sep=sep, na_rep=na_rep)
return self._wrap_result(result,
use_codes=(not self._is_categorical))
data = data.astype(object).values
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used it because data may be categorical, and then values is not necessarily a numpy array. Changed to ensure_object which you mentioned below, hope this is better.

if na_rep is None:
return sep.join(data[~mask])
return sep.join(np.where(mask, na_rep, data))
return sep.join(data)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

data, others = data.align(others, join=join)
others = [others[x] for x in others] # again list of Series

# str_cat discards index
res = str_cat(data, others=others, sep=sep, na_rep=na_rep)
all_cols = [x.astype(object).values for x in [data] + others]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

masks = np.array([isna(x) for x in all_cols])
union_mask = np.logical_or.reduce(masks, axis=0)

if na_rep is None and union_mask.any():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments

@@ -3136,7 +3089,7 @@ def test_method_on_bytes(self):
lhs = Series(np.array(list('abc'), 'S1').astype(object))
rhs = Series(np.array(list('def'), 'S1').astype(object))
if compat.PY3:
pytest.raises(TypeError, lhs.str.cat, rhs)
pytest.raises(TypeError, lhs.str.cat, rhs, sep=',')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@h-vetinari
Copy link
Contributor Author

@jreback

PTAL

1 similar comment
@h-vetinari
Copy link
Contributor Author

@jreback

PTAL

@h-vetinari
Copy link
Contributor Author

@WillAyd @jreback
Unfortunately, I have bad news. I started out in this PR with a very idiomatic solution (see the first couple commits), and it was just too slow.

Here's the ASV for the last commit:

All benchmarks:

       before           after         ratio
     [b28cf5aa]       [a97fe67e]
     <master>         <cln_str_cat>
         9.38±0ms         9.38±0ms     1.00  strings.Cat.time_cat(0, ',', '-', 0.0)
       9.38±0.8ms         10.9±0ms    ~1.17  strings.Cat.time_cat(0, ',', '-', 0.001)
         10.9±0ms       12.5±0.6ms    ~1.14  strings.Cat.time_cat(0, ',', '-', 0.15)
       9.38±0.6ms       8.59±0.8ms     0.92  strings.Cat.time_cat(0, ',', None, 0.0)
-      12.5±0.8ms         10.9±0ms     0.87  strings.Cat.time_cat(0, ',', None, 0.001)
-        14.1±0ms         10.9±0ms     0.78  strings.Cat.time_cat(0, ',', None, 0.15)
         9.38±0ms       9.38±0.8ms     1.00  strings.Cat.time_cat(0, None, '-', 0.0)
         9.38±0ms       9.38±0.8ms     1.00  strings.Cat.time_cat(0, None, '-', 0.001)
       10.9±0.8ms       12.5±0.8ms    ~1.14  strings.Cat.time_cat(0, None, '-', 0.15)
         10.9±2ms       7.81±0.8ms    ~0.71  strings.Cat.time_cat(0, None, None, 0.0)
         7.81±0ms       10.9±0.8ms    ~1.40  strings.Cat.time_cat(0, None, None, 0.001)
         10.9±2ms       10.9±0.8ms     1.00  strings.Cat.time_cat(0, None, None, 0.15)
         78.1±8ms         93.8±8ms    ~1.20  strings.Cat.time_cat(3, ',', '-', 0.0)
+        62.5±8ms          109±0ms     1.75  strings.Cat.time_cat(3, ',', '-', 0.001)
+        78.1±8ms          125±8ms     1.60  strings.Cat.time_cat(3, ',', '-', 0.15)
+        46.9±8ms         93.8±0ms     2.00  strings.Cat.time_cat(3, ',', None, 0.0)
         62.5±8ms          109±0ms    ~1.75  strings.Cat.time_cat(3, ',', None, 0.001)
+        46.9±8ms          102±8ms     2.17  strings.Cat.time_cat(3, ',', None, 0.15)
         62.5±0ms         78.1±6ms    ~1.25  strings.Cat.time_cat(3, None, '-', 0.0)
+        46.9±8ms         93.8±0ms     2.00  strings.Cat.time_cat(3, None, '-', 0.001)
+        62.5±8ms          125±6ms     2.00  strings.Cat.time_cat(3, None, '-', 0.15)
+        62.5±8ms         85.9±8ms     1.38  strings.Cat.time_cat(3, None, None, 0.0)
         46.9±6ms         93.8±0ms    ~2.00  strings.Cat.time_cat(3, None, None, 0.001)
         46.9±0ms         93.8±0ms    ~2.00  strings.Cat.time_cat(3, None, None, 0.15)

So especially when others is not None (all the pd.concat and dealing with DataFrames) we lose perf.
As a comparison, here's the ASV before the idiomatic changes @WillAyd requested:

All benchmarks:

       before           after         ratio
     [b28cf5aa]       [0d3c6d21]
     <master>         <cln_str_cat~2>
         9.38±0ms         9.38±0ms     1.00  strings.Cat.time_cat(0, ',', '-', 0.0)
         9.38±0ms         10.9±0ms    ~1.17  strings.Cat.time_cat(0, ',', '-', 0.001)
         10.9±0ms         12.5±0ms    ~1.14  strings.Cat.time_cat(0, ',', '-', 0.15)
       9.38±0.6ms         9.38±0ms     1.00  strings.Cat.time_cat(0, ',', None, 0.0)
       14.1±0.8ms         10.9±0ms    ~0.78  strings.Cat.time_cat(0, ',', None, 0.001)
       14.1±0.6ms       11.7±0.8ms    ~0.83  strings.Cat.time_cat(0, ',', None, 0.15)
         9.38±0ms         9.38±0ms     1.00  strings.Cat.time_cat(0, None, '-', 0.0)
       9.38±0.6ms       10.9±0.6ms    ~1.17  strings.Cat.time_cat(0, None, '-', 0.001)
         10.9±0ms         12.5±0ms    ~1.14  strings.Cat.time_cat(0, None, '-', 0.15)
         10.9±0ms         7.81±2ms    ~0.71  strings.Cat.time_cat(0, None, None, 0.0)
         9.38±0ms       10.9±0.8ms    ~1.17  strings.Cat.time_cat(0, None, None, 0.001)
       10.9±0.8ms       11.7±0.8ms     1.07  strings.Cat.time_cat(0, None, None, 0.15)
         78.1±0ms         78.1±8ms     1.00  strings.Cat.time_cat(3, ',', '-', 0.0)
         70.3±8ms         78.1±0ms    ~1.11  strings.Cat.time_cat(3, ',', '-', 0.001)
         93.8±8ms         93.8±0ms     1.00  strings.Cat.time_cat(3, ',', '-', 0.15)
         46.9±6ms         78.1±0ms    ~1.67  strings.Cat.time_cat(3, ',', None, 0.0)
         46.9±8ms         78.1±0ms    ~1.67  strings.Cat.time_cat(3, ',', None, 0.001)
         46.9±6ms         62.5±8ms    ~1.33  strings.Cat.time_cat(3, ',', None, 0.15)
         54.7±8ms         54.7±8ms     1.00  strings.Cat.time_cat(3, None, '-', 0.0)
         46.9±8ms         62.5±8ms    ~1.33  strings.Cat.time_cat(3, None, '-', 0.001)
         62.5±0ms        78.1±10ms    ~1.25  strings.Cat.time_cat(3, None, '-', 0.15)
         62.5±6ms         46.9±8ms    ~0.75  strings.Cat.time_cat(3, None, None, 0.0)
         46.9±6ms         62.5±0ms    ~1.33  strings.Cat.time_cat(3, None, None, 0.001)
         46.9±0ms         62.5±6ms    ~1.33  strings.Cat.time_cat(3, None, None, 0.15)

This isn't great, but not too bad IMO. Obviously it costs us to uselessly add in sep='' just to catch TypeErrors that should already be caught in the .str-accessor. I have something in mind there as well.
The direct comparison:

                        HEAD~2 vs. master  HEAD vs. master HEAD vs. HEAD~2 HvH2 increase
(3, ',', '-', 0.0)                   1.00             1.20            1.20       +20.00%
(3, ',', '-', 0.001)                 1.11             1.75            1.58       +57.66%
(3, ',', '-', 0.15)                  1.00             1.60            1.60       +60.00%
(3, ',', None, 0.0)                  1.67             2.00            1.20       +19.76%
(3, ',', None, 0.001)                1.67             1.75            1.05        +4.79%
(3, ',', None, 0.15)                 1.33             2.17            1.63       +63.16%
(3, None, '-', 0.0)                  1.00             1.25            1.25       +25.00%
(3, None, '-', 0.001)                1.33             2.00            1.50       +50.38%
(3, None, '-', 0.15)                 1.25             2.00            1.60       +60.00%
(3, None, None, 0.0)                 0.75             1.38            1.84       +84.00%
(3, None, None, 0.001)               1.33             2.00            1.50       +50.38%
(3, None, None, 0.15)                1.33             2.00            1.50       +50.38%

Finally, as a general warning about the results right after the run with SOME BENCHMARKS CHANGED SIGNIFICANTLY etc.: all benchmarks with a ~ in their ratio are falsely omitted from those results. I've opened airspeed-velocity/asv#752 for that.

@WillAyd
Copy link
Member

WillAyd commented Oct 11, 2018

IIUC you are saying the last batch of changes requested are causing performance to suffer anywhere between 20-80%? I've been wrong before but at the same time I've never seen instances where applying a function via list comprehension would be significantly faster than applying to the entire frame. Would be helpful if you could profile and debug further

@h-vetinari
Copy link
Contributor Author

IIUC you are saying the last batch of changes requested are causing performance to suffer anywhere between 20-80%? I've been wrong before but at the same time I've never seen instances where applying a function via list comprehension would be significantly faster than applying to the entire frame.

You do understand correctly. The list-comps themselves (i.e. not counting what goes on inside) are pretty fast, but the more important part is staying in numpy-land, and only going to pandas-land where absolutely necessary. You can check some of the earlier commits (and the ASVs at the top) yourself. In short: working on pandas objects like DataFrame, pd.concat is expensive compared to pure numpy.

Would be helpful if you could profile and debug further

I did it in the beginning of this PR already, with the above conclusion. Since IMO "Non-idiomatic" < PERF (by a large margin), this case is settled for me.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 11, 2018

I've never seen instances where applying a function via list comprehension would be significantly faster than applying to the entire frame.

It's not just that single comprehension either, we're concatenating more often (before it was just for the alignment), to always get a DataFrame.

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd
Some further explanation in the diff of the last commit:
https://github.com/pandas-dev/pandas/pull/22725/commits/e58ec9dfa82a459d9b316b678b77d50fc4901e9e

# concatenate others into DataFrame; need to add keys for uniqueness in
# case of duplicate columns (for join is None, all indexes are already
# the same after _get_series_list, which forces alignment in this case)
others = concat(others, axis=1,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd, this is the main reason for the slow-down. For working with a DataFrame below (as you wished), we first need to create it with pd.concat (expensive). Before, we were only using pd.concat if the indices need to be aligned (which they don't in the benchmarks).

@h-vetinari
Copy link
Contributor Author

@jreback @WillAyd
Can we sacrifice the idiomatic code for perf? Or how do we proceed here?

@WillAyd
Copy link
Member

WillAyd commented Oct 12, 2018

pd.concat is not expensive. In fact here's a small comparison of the initial part of both code branches:

In [50]: sers = [pd.Series(np.arange(100_000)) for x in range(10)] 
 
In [57]: %%timeit  
    ...: all_cols = [ensure_object(x) for x in sers]                                                                     
38.2 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [60]: %%timeit  
    ...: df = pd.concat(sers, axis=1) 
    ...: all_cols_df = ensure_object(df)                                                                 
37.8 ms ± 419 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I think the problem may be that ensure_object against a DataFrame returns an ndarray of ndarrays with shape of (100_000, 10) whereas the shape of all_cols has a shape of (10, 100_000). If you can inspect more closely would be helpful

Unless one of the other devs objects, I would really prefer this to be idiomatic from a pandas perspective. And just to be clear on what that actually means, list comprehensions over 2D data are NOT idiomatic when operations can be performed against a DataFrame instead. We can always optimize operations with the latter but are limited in regards to the former.

@h-vetinari
Copy link
Contributor Author

There's not much to go on - the last commit shows how little changed: https://github.com/pandas-dev/pandas/pull/22725/commits/e58ec9dfa82a459d9b316b678b77d50fc4901e9e (I make this format as code to prevent github from mangling the actual comparison url).

  • cat_core doesn't have to np.split
  • one less pd.concat (because it's now only necessary for different indexes)
  • list comp instead of DataFrame operations

I honestly don't see where it's coming from if not the useless concatenating and then splitting again (because, for interleaving sep in cat_core, we need a list of Series anyway). Of course it'd be nicer to have idiomatic code (I tried that right off the bat, but it was 2-3x slower), but ultimately perf should dictate this. All of the cython code isn't pandas-idiomatic either. ;-)

@jreback
Copy link
Contributor

jreback commented Oct 12, 2018

@h-vetinari code maintainablinity is actually the most important property of changes
pls use more idiomatic constructions as @WillAyd indicates

@jreback
Copy link
Contributor

jreback commented Oct 14, 2018

looks ok to me; @WillAyd merge when satisfied

@WillAyd WillAyd merged commit f9d237b into pandas-dev:master Oct 14, 2018
@WillAyd
Copy link
Member

WillAyd commented Oct 14, 2018

Thanks @h-vetinari !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Error Reporting Incorrect or improved errors from pandas Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve TypeError message for str.cat str.cat not working with binary data on Python3
5 participants