PERF: tab completion with a large index #18587

jreback · 2017-12-01T11:37:14Z

If you have a very large index, _dir_additions (for tab completion) actually takes quite a bit of time

So what I would do is if the index is say < 100, use the currently _dir_addition, otherwise return an empty list! (its essentially too big to use tab completion for anyhow). can you make this change and add an asv for this (could be a separate PR as well)

The text was updated successfully, but these errors were encountered:

jreback · 2017-12-01T11:37:23Z

cc @jorisvandenbossche @TomAugspurger

BibMartin · 2017-12-01T13:29:55Z

One may have a very large index with few distinct values. I would suggest to limit the number of values returned rather than the size of the index. (It seems that the delay is due to the handling of the results rather than the computation of dir)
Something like:

additions = set([c for c in self._info_axis.get_level_values(0).unique()[:100]
                 if isinstance(c, string_types) and isidentifier(c)])

Anyway, I think I can address this issue in #16326 ; the topics are quite related.

TomAugspurger · 2017-12-01T19:55:32Z

Do we know why _dir_additions is slow for large objects?

jreback · 2017-12-02T15:52:00Z

you can use self._info_axis.unique(level=0) here as a generic way to do this.

…s-dev#16326, pandas-dev#18587)

BibMartin · 2017-12-05T17:22:22Z

@TomAugspurger

Do we know why _dir_additions is slow for large objects?

I don't know exactly, but the slowdown seem to come from the IHM: When I create a large Series (s = Series(index=tm.makeStringIndex(10000))) in a notebook or in ipython console, then dir(s) is fast (much less than 1 sec) while asking for tab-completion is slow (several seconds).

@jreback

you can use self._info_axis.unique(level=0) here as a generic way to do this.

Yes thanks, that's an awesome new feature.

…s-dev#16326, pandas-dev#18587)

TomAugspurger · 2019-04-07T19:10:24Z

Was this fixed by #20834? Tab completion on the following seems quick

In [21]: s = Series(index=tm.makeStringIndex(10000))

In [22]: s.<tab>

jamespreed · 2019-08-02T19:20:56Z

I would like to add to this issue. My team often works with data sets that have hundreds of columns. The reduction in the number of columns available for tab-completion to 100 has been a hindrance. I am fine with capping the number for the sake of performance, just the choice of 100 seems arbitrary.
Currently I work around this by editing the generics.py file in the pandas/core directory.

Suggestion:

Increase the cap on _dir_additions to 1000.

Analysis

I performed the following benchmarks on tab-completion timings using %timeit in IPython on two separate laptops. In both cases, the benchmarks were created using Pandas 0.25.0, first with install as-is, and again after modifying generics.py to remove the slice in the set-comprehension at line 5199.

Laptop 1

Intel Core i7-9750H at 2.6Ghz
Windows 10, 1903
Pandas 0.25.0

n_cols : benchmark for tab-completion
     1 : 134 µs ± 2.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
     3 : 143 µs ± 8.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
     7 : 133 µs ± 893 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    10 : 132 µs ± 398 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    30 : 144 µs ± 9.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    70 : 133 µs ± 695 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
   100 : 692 µs ± 865 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   300 : 684 µs ± 874 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   700 : 792 µs ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  1000 : 681 µs ± 870 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  3000 : 687 µs ± 879 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  7000 : 686 µs ± 875 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
 10000 : 761 µs ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
 30000 : 698 µs ± 901 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
 70000 : 692 µs ± 881 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000 : 679 µs ± 867 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
300000 : 961 µs ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas 0.25.0, modified generics.py

n_cols : benchmark for tab-completion
     1 : 139 µs ± 9.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
     3 : 132 µs ± 863 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
     7 : 153 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    10 : 690 µs ± 847 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    30 : 745 µs ± 935 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    70 : 623 µs ± 775 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   100 : 676 µs ± 863 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   300 : 1.61 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   700 : 2.08 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  1000 : 2.59 ms ± 3.89 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  3000 : 30.3 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  7000 : 57.6 ms ± 750 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
 10000 : 81 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
 30000 : 244 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 70000 : 584 ms ± 5.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100000 : 845 ms ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
300000 : 2.54 s ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Laptop 2

Intel Core i3-3110M at 2.4Ghz
Windows 10, 1903
Pandas 0.25.0

n_cols : benchmark for tab-completion
     1 : 3.6 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
     3 : 1.25 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
     7 : 1.29 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    10 : 1.27 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    30 : 1.42 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    70 : 1.66 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   100 : 1.83 ms ± 2.44 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   300 : 1.94 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   700 : 1.92 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  1000 : 1.89 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  3000 : 1.93 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  7000 : 2.35 ms ± 3.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
 10000 : 1.99 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
 30000 : 1.82 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
 70000 : 1.83 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000 : 2.03 ms ± 2.73 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
300000 : 1.86 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas 0.25.0, modified generics.py

n_cols : benchmark for tab-completion
     1 : 1.22 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
     3 : 1.21 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
     7 : 1.24 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    10 : 1.24 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    30 : 1.37 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    70 : 1.65 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   100 : 1.83 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   300 : 3.07 ms ± 4.42 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   700 : 5.58 ms ± 8.38 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  1000 : 26.9 ms ± 287 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  3000 : 75.7 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  7000 : 206 ms ± 48.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 10000 : 264 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 30000 : 728 ms ± 6.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 70000 : 1.72 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100000 : 2.5 s ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
300000 : 7.43 s ± 27.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Even on my 10 year old laptop, the time for tab-completion with 1000 columns is under 30ms. Still very responsive.

jamespreed · 2019-08-02T19:22:33Z

Additionally, it may be worth issuing a warning to the user when dir is called under the condition that _dir_additions is dropping attributes.

It goes against the philosophy of Python to not let the user know, in my opinion.

jreback added Output-Formatting __repr__ of pandas objects, to_string Performance Memory or execution speed performance labels Dec 1, 2017

jreback added this to the 0.21.1 milestone Dec 1, 2017

jreback mentioned this issue Dec 1, 2017

ENH: _dir_additions returns also the first level of a MultiIndex #16326

Merged

3 tasks

jreback modified the milestones: 0.21.1, Next Major Release Dec 2, 2017

BibMartin pushed a commit to BibMartin/pandas that referenced this issue Dec 5, 2017

DOC: Update whatsnew about NDFrame._dir_additions enhancements (panda…

33ace7b

…s-dev#16326, pandas-dev#18587)

BibMartin pushed a commit to BibMartin/pandas that referenced this issue Dec 6, 2017

DOC: Update whatsnew about NDFrame._dir_additions enhancements (panda…

308ea2b

…s-dev#16326, pandas-dev#18587)

BibMartin pushed a commit to BibMartin/pandas that referenced this issue Dec 6, 2017

DOC: Update whatsnew about NDFrame._dir_additions enhancements (panda…

46eb051

…s-dev#16326, pandas-dev#18587)

BibMartin pushed a commit to BibMartin/pandas that referenced this issue Dec 8, 2017

DOC: Update whatsnew about NDFrame._dir_additions enhancements (panda…

1724c72

…s-dev#16326, pandas-dev#18587)

BibMartin pushed a commit to BibMartin/pandas that referenced this issue Dec 8, 2017

DOC: Update whatsnew about NDFrame._dir_additions enhancements (panda…

edb184a

…s-dev#16326, pandas-dev#18587)

bthyreau mentioned this issue Jul 10, 2020

ENH: raise limit for completion number of columns and warn beyond #35207

Closed

arw2019 mentioned this issue Nov 22, 2020

ENH: raise column number limit for user-completion or add warning #37996

Closed

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: tab completion with a large index #18587

PERF: tab completion with a large index #18587

jreback commented Dec 1, 2017

jreback commented Dec 1, 2017

BibMartin commented Dec 1, 2017

TomAugspurger commented Dec 1, 2017 •

edited

Loading

jreback commented Dec 2, 2017

BibMartin commented Dec 5, 2017

TomAugspurger commented Apr 7, 2019

jamespreed commented Aug 2, 2019 •

edited

Loading

jamespreed commented Aug 2, 2019

PERF: tab completion with a large index #18587

PERF: tab completion with a large index #18587

Comments

jreback commented Dec 1, 2017

jreback commented Dec 1, 2017

BibMartin commented Dec 1, 2017

TomAugspurger commented Dec 1, 2017 • edited Loading

jreback commented Dec 2, 2017

BibMartin commented Dec 5, 2017

TomAugspurger commented Apr 7, 2019

jamespreed commented Aug 2, 2019 • edited Loading

Suggestion:

Analysis

Laptop 1

Laptop 2

jamespreed commented Aug 2, 2019

TomAugspurger commented Dec 1, 2017 •

edited

Loading

jamespreed commented Aug 2, 2019 •

edited

Loading