Possible performance regression in 9a42cbe85461c28417a5130bc80b035044c5575a #26639

TomAugspurger · 2019-06-03T21:23:24Z

Possible regression in 9a42cbe

ASV: http://pandas.pydata.org/speed/pandas/index.html#strings.Methods.time_normalize?commits=9a42cbe85461c28417a5130bc80b035044c5575a

A bunch of benchmarks are ~20% slower. Not sure if that was talked about in the issue.

cc @h-vetinari

The text was updated successfully, but these errors were encountered:

h-vetinari · 2019-06-03T21:48:58Z

@TomAugspurger
It was not discussed, but I'm not too surprised. The wrapper doesn't add much, but there's a cost for running infer_dtype on every call.

I should add that that #23167 came about mainly because @jreback was adamant (in several PRs I had concerning .str.cat) that the .str-accessor should absolutely NOT work for bytes, and if it happens to, that's an accident (e.g. here incl. the next couple of comments). I myself have a more "live and let live" approach, so I don't mind if .str works with bytes.

Actually, I had to degrade the performance of .str.cat in #22725 due to some PY2/bytes tests, which is why I started working on #23167 (to restore it after forbidding bytes). Ironically, now that PY2 is gone and I'm trying to add back the perf-improvement (#26605), it received quite a cold welcome.

Coming back to the perf-hit of #23167: I'm fine with reverting, it that's the consensus. From my reading, the perf hit is almost exclusively the cost of inferring dtypes (to be able to exclude some where deemed necessary) - the question is if people are happy with that trade-off.

One other option might be to cache the inferred dtype somewhere, but that would likely have to be on the Series itself, since the StringMethods-object gets reconstructed on every call.

TomAugspurger · 2019-06-25T22:13:56Z

This may be a blocker. Verifying that the type checking is the cause of the slowdown now.

TomAugspurger · 2019-06-25T22:27:19Z

As a basic test, I simply commented out the @forbid_nonstring_types(['bytes']) on StringMethods.normalize. The timings are the same. So perhaps it's not the validation.

h-vetinari · 2019-06-25T23:04:07Z

@TomAugspurger
The type checking (e.g. in @forbid_nonstring_types) is not expensive, it's the type inference in ._validate that's expensive, and this get's called on every StringMethods.__init__.

I guess it would be possible to shift the inference into the wrapper, and call that wrapper more sparingly.

jreback · 2019-06-26T00:29:46Z

@h-vetinari I asked u to update the benchmarks in #26605
i actually doubt that PR has any effect but the benchmarks will tell

h-vetinari · 2019-06-26T05:39:09Z

@jreback
The improvement in #26605 is only for .str.cat (not all str-methods), and only for the case that sep=None/sep=''. Running the ASV is on my todo-list

@TomAugspurger
One other solution would be to remove the expensive lib.infer_dtype inference, and only exclude bytes when an array explicitly has a bytes-dtype. This would obviously let bytes hiding behind object dtype through, but that could be a more acceptable trade-off considering the perf impact, while still achieving some dtype enforcement (even more so, considering that we have to let lib.infer_dtype(...) == 'mixed' pass as well).

jorisvandenbossche · 2019-06-28T15:15:26Z

Didn't look in detail to #23167, but if we want to remove the call to infer_type, does that mean basically reverting the full PR, or are there other parts of the PR to keep?

and only exclude bytes when an array explicitly has a bytes-dtype

I don't think this is possible? We only have object dtype (when a bytes dtype is passed in to Series, we convert it to object dtype)

jorisvandenbossche · 2019-06-28T15:16:30Z

I don't think this is possible? We only have object dtype (when a bytes dtype is passed in to Series, we convert it to object dtype)

Actually, we seem to keep the bytes dtype, never realised we do that. That might be a bug actually:

In [5]: np.array([b'dsf']) 
Out[5]: array([b'dsf'], dtype='|S3')

In [6]: pd.Series(np.array([b'dsf']))
Out[6]: 
0    b'dsf'
dtype: bytes24

jorisvandenbossche · 2019-06-28T15:17:10Z

Although Tom, you were saying that undoing the infer_type doesn't improve the performance?

h-vetinari · 2019-06-28T17:07:18Z

@jorisvandenbossche: Didn't look in detail to #23167, but if we want to remove the call to infer_type, does that mean basically reverting the full PR, or are there other parts of the PR to keep?

There are IMO parts to keep, like #23551, but I guess I could also re-add that separately. In any case, a reversion will probably create merge conflicts.

@jorisvandenbossche: Actually, we seem to keep the bytes dtype, never realised we do that. That might be a bug actually:

If the behaviour has been there long enough, it's now a "feature"... (similar to the behaviour that .str works with bytes, which we're trying to get rid of here, I guess...)

@jorisvandenbossche: Although Tom, you were saying that undoing the infer_type doesn't improve the performance?

As I mentioned above, he didn't undo the infer_dtype, but just the type-check.

h-vetinari · 2019-06-28T17:11:15Z

IMO, the most reasonable solution would be to remove the infer_dtypes but keep the type-checking for dtypes-only. But I don't mind if a reversion ends up being the consensus.

jorisvandenbossche · 2019-06-28T18:26:52Z

I don't think we currently want to tie us to the bytes dtype. It also seems "relatively" recent, eg in 0.18.0 it was still converted to object dtype (but didn't check other versions)

h-vetinari · 2019-06-30T20:37:04Z

I'd be happy to help with solving this, just have to add that I'll be away for the next two weeks. In case you urgently need a solution, I won't be able to help, but I can have a look afterwards.

TomAugspurger · 2019-11-12T20:48:08Z

Maybe not worth worrying about at this point? For StringDtype we skip this validation since it isn't necessary.

jbrockmendel · 2019-12-21T01:51:02Z

Maybe not worth worrying about at this point?

I think this is likely correct. Not to mention #30202 should improve the perf on infer_dtype

groutr · 2019-12-23T16:23:14Z

@jbrockmendel #30202 improves performance for only numpy dtypes when skipna=True.

TomAugspurger · 2019-12-30T17:45:24Z

I think we'll close this, since StringDtype avoids it.

TomAugspurger added Performance Memory or execution speed performance Strings String extension data type and string data labels Jun 3, 2019

TomAugspurger added this to the 0.25.0 milestone Jun 3, 2019

jorisvandenbossche added the Blocker Blocking issue or pull request for an upcoming release label Jun 28, 2019

jreback modified the milestones: 0.25.0, 1.0 Jul 17, 2019

TomAugspurger closed this as completed Dec 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible performance regression in 9a42cbe85461c28417a5130bc80b035044c5575a #26639

Possible performance regression in 9a42cbe85461c28417a5130bc80b035044c5575a #26639

TomAugspurger commented Jun 3, 2019

h-vetinari commented Jun 3, 2019

TomAugspurger commented Jun 25, 2019

TomAugspurger commented Jun 25, 2019

h-vetinari commented Jun 25, 2019 •

edited

Loading

jreback commented Jun 26, 2019

h-vetinari commented Jun 26, 2019 •

edited

Loading

jorisvandenbossche commented Jun 28, 2019

jorisvandenbossche commented Jun 28, 2019

jorisvandenbossche commented Jun 28, 2019

h-vetinari commented Jun 28, 2019

h-vetinari commented Jun 28, 2019

jorisvandenbossche commented Jun 28, 2019

h-vetinari commented Jun 30, 2019

TomAugspurger commented Nov 12, 2019

jbrockmendel commented Dec 21, 2019

groutr commented Dec 23, 2019 •

edited

Loading

TomAugspurger commented Dec 30, 2019

Possible performance regression in 9a42cbe85461c28417a5130bc80b035044c5575a #26639

Possible performance regression in 9a42cbe85461c28417a5130bc80b035044c5575a #26639

Comments

TomAugspurger commented Jun 3, 2019

h-vetinari commented Jun 3, 2019

TomAugspurger commented Jun 25, 2019

TomAugspurger commented Jun 25, 2019

h-vetinari commented Jun 25, 2019 • edited Loading

jreback commented Jun 26, 2019

h-vetinari commented Jun 26, 2019 • edited Loading

jorisvandenbossche commented Jun 28, 2019

jorisvandenbossche commented Jun 28, 2019

jorisvandenbossche commented Jun 28, 2019

h-vetinari commented Jun 28, 2019

h-vetinari commented Jun 28, 2019

jorisvandenbossche commented Jun 28, 2019

h-vetinari commented Jun 30, 2019

TomAugspurger commented Nov 12, 2019

jbrockmendel commented Dec 21, 2019

groutr commented Dec 23, 2019 • edited Loading

TomAugspurger commented Dec 30, 2019

h-vetinari commented Jun 25, 2019 •

edited

Loading

h-vetinari commented Jun 26, 2019 •

edited

Loading

groutr commented Dec 23, 2019 •

edited

Loading