Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible performance regression in 9a42cbe85461c28417a5130bc80b035044c5575a #26639

Closed
TomAugspurger opened this issue Jun 3, 2019 · 17 comments
Closed
Labels
Blocker Blocking issue or pull request for an upcoming release Performance Memory or execution speed performance Strings String extension data type and string data
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Possible regression in 9a42cbe

ASV: http://pandas.pydata.org/speed/pandas/index.html#strings.Methods.time_normalize?commits=9a42cbe85461c28417a5130bc80b035044c5575a

A bunch of benchmarks are ~20% slower. Not sure if that was talked about in the issue.

cc @h-vetinari

@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Strings String extension data type and string data labels Jun 3, 2019
@TomAugspurger TomAugspurger added this to the 0.25.0 milestone Jun 3, 2019
@h-vetinari
Copy link
Contributor

@TomAugspurger
It was not discussed, but I'm not too surprised. The wrapper doesn't add much, but there's a cost for running infer_dtype on every call.

I should add that that #23167 came about mainly because @jreback was adamant (in several PRs I had concerning .str.cat) that the .str-accessor should absolutely NOT work for bytes, and if it happens to, that's an accident (e.g. here incl. the next couple of comments). I myself have a more "live and let live" approach, so I don't mind if .str works with bytes.

Actually, I had to degrade the performance of .str.cat in #22725 due to some PY2/bytes tests, which is why I started working on #23167 (to restore it after forbidding bytes). Ironically, now that PY2 is gone and I'm trying to add back the perf-improvement (#26605), it received quite a cold welcome.

Coming back to the perf-hit of #23167: I'm fine with reverting, it that's the consensus. From my reading, the perf hit is almost exclusively the cost of inferring dtypes (to be able to exclude some where deemed necessary) - the question is if people are happy with that trade-off.

One other option might be to cache the inferred dtype somewhere, but that would likely have to be on the Series itself, since the StringMethods-object gets reconstructed on every call.

@TomAugspurger
Copy link
Contributor Author

This may be a blocker. Verifying that the type checking is the cause of the slowdown now.

@TomAugspurger
Copy link
Contributor Author

As a basic test, I simply commented out the @forbid_nonstring_types(['bytes']) on StringMethods.normalize. The timings are the same. So perhaps it's not the validation.

@h-vetinari
Copy link
Contributor

h-vetinari commented Jun 25, 2019

@TomAugspurger
The type checking (e.g. in @forbid_nonstring_types) is not expensive, it's the type inference in ._validate that's expensive, and this get's called on every StringMethods.__init__.

I guess it would be possible to shift the inference into the wrapper, and call that wrapper more sparingly.

@jreback
Copy link
Contributor

jreback commented Jun 26, 2019

@h-vetinari I asked u to update the benchmarks in #26605
i actually doubt that PR has any effect but the benchmarks will tell

@h-vetinari
Copy link
Contributor

h-vetinari commented Jun 26, 2019

@jreback
The improvement in #26605 is only for .str.cat (not all str-methods), and only for the case that sep=None/sep=''. Running the ASV is on my todo-list

@TomAugspurger
One other solution would be to remove the expensive lib.infer_dtype inference, and only exclude bytes when an array explicitly has a bytes-dtype. This would obviously let bytes hiding behind object dtype through, but that could be a more acceptable trade-off considering the perf impact, while still achieving some dtype enforcement (even more so, considering that we have to let lib.infer_dtype(...) == 'mixed' pass as well).

@jorisvandenbossche jorisvandenbossche added the Blocker Blocking issue or pull request for an upcoming release label Jun 28, 2019
@jorisvandenbossche
Copy link
Member

Didn't look in detail to #23167, but if we want to remove the call to infer_type, does that mean basically reverting the full PR, or are there other parts of the PR to keep?

and only exclude bytes when an array explicitly has a bytes-dtype

I don't think this is possible? We only have object dtype (when a bytes dtype is passed in to Series, we convert it to object dtype)

@jorisvandenbossche
Copy link
Member

I don't think this is possible? We only have object dtype (when a bytes dtype is passed in to Series, we convert it to object dtype)

Actually, we seem to keep the bytes dtype, never realised we do that. That might be a bug actually:

In [5]: np.array([b'dsf']) 
Out[5]: array([b'dsf'], dtype='|S3')

In [6]: pd.Series(np.array([b'dsf']))
Out[6]: 
0    b'dsf'
dtype: bytes24

@jorisvandenbossche
Copy link
Member

Although Tom, you were saying that undoing the infer_type doesn't improve the performance?

@h-vetinari
Copy link
Contributor

@jorisvandenbossche: Didn't look in detail to #23167, but if we want to remove the call to infer_type, does that mean basically reverting the full PR, or are there other parts of the PR to keep?

There are IMO parts to keep, like #23551, but I guess I could also re-add that separately. In any case, a reversion will probably create merge conflicts.

@jorisvandenbossche: Actually, we seem to keep the bytes dtype, never realised we do that. That might be a bug actually:

If the behaviour has been there long enough, it's now a "feature"... (similar to the behaviour that .str works with bytes, which we're trying to get rid of here, I guess...)

@jorisvandenbossche: Although Tom, you were saying that undoing the infer_type doesn't improve the performance?

As I mentioned above, he didn't undo the infer_dtype, but just the type-check.

@h-vetinari
Copy link
Contributor

IMO, the most reasonable solution would be to remove the infer_dtypes but keep the type-checking for dtypes-only. But I don't mind if a reversion ends up being the consensus.

@jorisvandenbossche
Copy link
Member

I don't think we currently want to tie us to the bytes dtype. It also seems "relatively" recent, eg in 0.18.0 it was still converted to object dtype (but didn't check other versions)

@h-vetinari
Copy link
Contributor

I'd be happy to help with solving this, just have to add that I'll be away for the next two weeks. In case you urgently need a solution, I won't be able to help, but I can have a look afterwards.

@jreback jreback modified the milestones: 0.25.0, 1.0 Jul 17, 2019
@TomAugspurger
Copy link
Contributor Author

Maybe not worth worrying about at this point? For StringDtype we skip this validation since it isn't necessary.

@jbrockmendel
Copy link
Member

Maybe not worth worrying about at this point?

I think this is likely correct. Not to mention #30202 should improve the perf on infer_dtype

@groutr
Copy link
Contributor

groutr commented Dec 23, 2019

@jbrockmendel #30202 improves performance for only numpy dtypes when skipna=True.

@TomAugspurger
Copy link
Contributor Author

I think we'll close this, since StringDtype avoids it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

6 participants