-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLN/ERR: str.cat internals #22725
CLN/ERR: str.cat internals #22725
Conversation
Hello @h-vetinari! Thanks for updating the PR.
Comment last updated on September 17, 2018 at 09:16 Hours UTC |
@gfyoung @WillAyd @jreback I do however stand by:
In other words, I'm +epsilon on allowing (consistent, reasonable) off-label use of the #22722 is equally something that I'm not going to fight for. I got some less than ideal error messages while testing, and decided this could/should be improved. If you disagree, it's easy to remove the offending lines. |
@h-vetinari : Not a problem. I was only reading through the issues and your PR and matching up what you were doing with what was being said. Just pinging to get their eyes on this. |
I'm still very much against setting the expectation that the Also generally don't think that a "cleanup" should be performed in a same PR that changes the expected functionality of the codebase. Would be much easier to stick to one thing at a time, i.e. a clean up that doesn't introduce or change any existing behavior |
5506842
to
87b815b
Compare
I reverted the (unrelated) fix for #22721, but the situation for #22722 is more delicate. The genesis was as follows:
Turns out, that the situation is even a bit more delicate, as currently on master, bytes can be successfully concatenated as long as
The problem was that This is the sort of thing I was talking about with the missing string dtype. Without one, there are legitimate off-label uses for the |
a0975fd
to
f79c707
Compare
@h-vetinari pls pls pls 1 thing per PR. We do NOT handle bytes in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments
Yes you (currently) do. Just try the code I posted above.
The whatsnew-note notwithstanding, this PR only changes the implementation (you'll see in the test that I've not changed anything substantial) I understand that you don't want people using |
It may happen to work. Instead of refactoring this as I said above, would prefer tests / and better error messages with bytes inputs. |
f79c707
to
fcd11b5
Compare
The point is that this PR does not change the current behaviour and should stand on its merits, unrelated to the fact that you'd like to disallow |
@h-vetinari you removed tests, so clearly you are changing things. |
I removed one test for the internal method that has been factored away. Furthermore, this removed test ( |
Codecov Report
@@ Coverage Diff @@
## master #22725 +/- ##
==========================================
- Coverage 92.2% 92.19% -0.01%
==========================================
Files 169 169
Lines 50924 50900 -24
==========================================
- Hits 46952 46928 -24
Misses 3972 3972
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
two = np.array(['a', NA, 'b', 'd', 'foo', NA], dtype=np.object_) | ||
|
||
# single array | ||
result = strings.str_cat(one) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed one test for the internal method that has been factored away
Please have a look here - this is directly importing the internal method and testing it (not str.cat
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
rgx = 'All arrays must be same length' | ||
three = Series(['1', '2', '3']) | ||
|
||
with tm.assert_raises_regex(ValueError, rgx): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, this removed test (
test_cat
) is exactly replicated in the test directly below (test_str_cat
).
I can't mark lines that are not in the diff, but check out
pandas/pandas/tests/test_strings.py
Line 119 in fcd11b5
result = s.str.cat() |
I replicated the removed test (acting on strings.test_cat
) as a test acting on str.cat
within #20347.
@jreback |
pandas/core/strings.py
Outdated
def str_cat(arr, others=None, sep=None, na_rep=None): | ||
""" | ||
def interleave_sep(all_cols, sep): | ||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use triple-double quotes
pandas/core/strings.py
Outdated
def str_cat(arr, others=None, sep=None, na_rep=None): | ||
""" | ||
def interleave_sep(all_cols, sep): | ||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all_cols -> list_of_columns
pandas/core/strings.py
Outdated
result = str_cat(data, others=others, sep=sep, na_rep=na_rep) | ||
return self._wrap_result(result, | ||
use_codes=(not self._is_categorical)) | ||
data = data.astype(object).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this astype needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used it because data may be categorical, and then values is not necessarily a numpy array. Changed to ensure_object
which you mentioned below, hope this is better.
if na_rep is None: | ||
return sep.join(data[~mask]) | ||
return sep.join(np.where(mask, na_rep, data)) | ||
return sep.join(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do a single sep.join, and just have the branches mask the data as needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/strings.py
Outdated
data, others = data.align(others, join=join) | ||
others = [others[x] for x in others] # again list of Series | ||
|
||
# str_cat discards index | ||
res = str_cat(data, others=others, sep=sep, na_rep=na_rep) | ||
all_cols = [x.astype(object).values for x in [data] + others] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need the astype, much prefer ensure_object
generally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
masks = np.array([isna(x) for x in all_cols]) | ||
union_mask = np.logical_or.reduce(masks, axis=0) | ||
|
||
if na_rep is None and union_mask.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on these cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments
pandas/core/strings.py
Outdated
result[not_masked] = np.sum(all_cols, axis=0) | ||
elif na_rep is not None and union_mask.any(): | ||
# fill NaNs | ||
all_cols = [np.where(masks[i], na_rep, all_cols[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use zip(masks, all_cols)
pandas/core/strings.py
Outdated
return all_cols | ||
result = [sep] * (2 * len(all_cols) - 1) | ||
result[::2] = all_cols | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would simply do np.sum(result)
here, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's reasonable. Refactored the function as necessary
two = np.array(['a', NA, 'b', 'd', 'foo', NA], dtype=np.object_) | ||
|
||
# single array | ||
result = strings.str_cat(one) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
pandas/tests/test_strings.py
Outdated
@@ -3136,7 +3089,7 @@ def test_method_on_bytes(self): | |||
lhs = Series(np.array(list('abc'), 'S1').astype(object)) | |||
rhs = Series(np.array(list('def'), 'S1').astype(object)) | |||
if compat.PY3: | |||
pytest.raises(TypeError, lhs.str.cat, rhs) | |||
pytest.raises(TypeError, lhs.str.cat, rhs, sep=',') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the bytes concat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review; pushed new commits
pandas/core/strings.py
Outdated
return all_cols | ||
result = [sep] * (2 * len(all_cols) - 1) | ||
result[::2] = all_cols | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's reasonable. Refactored the function as necessary
pandas/core/strings.py
Outdated
result = str_cat(data, others=others, sep=sep, na_rep=na_rep) | ||
return self._wrap_result(result, | ||
use_codes=(not self._is_categorical)) | ||
data = data.astype(object).values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used it because data may be categorical, and then values is not necessarily a numpy array. Changed to ensure_object
which you mentioned below, hope this is better.
if na_rep is None: | ||
return sep.join(data[~mask]) | ||
return sep.join(np.where(mask, na_rep, data)) | ||
return sep.join(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/strings.py
Outdated
data, others = data.align(others, join=join) | ||
others = [others[x] for x in others] # again list of Series | ||
|
||
# str_cat discards index | ||
res = str_cat(data, others=others, sep=sep, na_rep=na_rep) | ||
all_cols = [x.astype(object).values for x in [data] + others] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
masks = np.array([isna(x) for x in all_cols]) | ||
union_mask = np.logical_or.reduce(masks, axis=0) | ||
|
||
if na_rep is None and union_mask.any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments
pandas/tests/test_strings.py
Outdated
@@ -3136,7 +3089,7 @@ def test_method_on_bytes(self): | |||
lhs = Series(np.array(list('abc'), 'S1').astype(object)) | |||
rhs = Series(np.array(list('def'), 'S1').astype(object)) | |||
if compat.PY3: | |||
pytest.raises(TypeError, lhs.str.cat, rhs) | |||
pytest.raises(TypeError, lhs.str.cat, rhs, sep=',') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
PTAL |
1 similar comment
PTAL |
@WillAyd @jreback Here's the ASV for the last commit:
So especially when
This isn't great, but not too bad IMO. Obviously it costs us to uselessly add in
Finally, as a general warning about the results right after the run with |
fad997f
to
a97fe67
Compare
IIUC you are saying the last batch of changes requested are causing performance to suffer anywhere between 20-80%? I've been wrong before but at the same time I've never seen instances where applying a function via list comprehension would be significantly faster than applying to the entire frame. Would be helpful if you could profile and debug further |
You do understand correctly. The list-comps themselves (i.e. not counting what goes on inside) are pretty fast, but the more important part is staying in numpy-land, and only going to pandas-land where absolutely necessary. You can check some of the earlier commits (and the ASVs at the top) yourself. In short: working on pandas objects like
I did it in the beginning of this PR already, with the above conclusion. Since IMO "Non-idiomatic" < PERF (by a large margin), this case is settled for me. |
It's not just that single comprehension either, we're concatenating more often (before it was just for the alignment), to always get a DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd
Some further explanation in the diff of the last commit:
https://github.com/pandas-dev/pandas/pull/22725/commits/e58ec9dfa82a459d9b316b678b77d50fc4901e9e
pandas/core/strings.py
Outdated
# concatenate others into DataFrame; need to add keys for uniqueness in | ||
# case of duplicate columns (for join is None, all indexes are already | ||
# the same after _get_series_list, which forces alignment in this case) | ||
others = concat(others, axis=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd, this is the main reason for the slow-down. For working with a DataFrame
below (as you wished), we first need to create it with pd.concat
(expensive). Before, we were only using pd.concat
if the indices need to be aligned (which they don't in the benchmarks).
In [50]: sers = [pd.Series(np.arange(100_000)) for x in range(10)]
In [57]: %%timeit
...: all_cols = [ensure_object(x) for x in sers]
38.2 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [60]: %%timeit
...: df = pd.concat(sers, axis=1)
...: all_cols_df = ensure_object(df)
37.8 ms ± 419 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) I think the problem may be that Unless one of the other devs objects, I would really prefer this to be idiomatic from a pandas perspective. And just to be clear on what that actually means, list comprehensions over 2D data are NOT idiomatic when operations can be performed against a DataFrame instead. We can always optimize operations with the latter but are limited in regards to the former. |
There's not much to go on - the last commit shows how little changed:
I honestly don't see where it's coming from if not the useless concatenating and then splitting again (because, for interleaving |
@h-vetinari code maintainablinity is actually the most important property of changes |
looks ok to me; @WillAyd merge when satisfied |
Thanks @h-vetinari ! |
This is mainly a clean-up of internal methods for
str.cat
that I didn't want to touch within #20347.As a side benefit of changing the implementation, this also solves #22721. Finally, I've also added a better message for TypeErrors (closes #22722)closes #22721closes #22722git diff upstream/master -u -- "*.py" | flake8 --diff
Here's the ASV output (the original implementation of this PR (see first commit) that used more higher-level pandas-functions like
fillna
,drop_na
, etc. was up to three times slower, so I tweaked it some more, and actually believe that the last solution withinterleave_sep
is the most elegant anyway):There's a bunch of noise in there, but by and large, things don't look so bad IMO. Especially, when one excludes the not-so-common worst-case scenario of a very small but non-zero amount of NaNs (
na_frac=0.001
):