Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: tab completion with a large index #18587

Open
jreback opened this issue Dec 1, 2017 · 8 comments
Open

PERF: tab completion with a large index #18587

jreback opened this issue Dec 1, 2017 · 8 comments
Labels
Output-Formatting __repr__ of pandas objects, to_string Performance Memory or execution speed performance

Comments

@jreback
Copy link
Contributor

jreback commented Dec 1, 2017

from #16326 (comment)

If you have a very large index, _dir_additions (for tab completion) actually takes quite a bit of time

So what I would do is if the index is say < 100, use the currently _dir_addition, otherwise return an empty list! (its essentially too big to use tab completion for anyhow). can you make this change and add an asv for this (could be a separate PR as well)

@jreback jreback added Output-Formatting __repr__ of pandas objects, to_string Performance Memory or execution speed performance labels Dec 1, 2017
@jreback jreback added this to the 0.21.1 milestone Dec 1, 2017
@jreback
Copy link
Contributor Author

jreback commented Dec 1, 2017

@BibMartin
Copy link
Contributor

One may have a very large index with few distinct values. I would suggest to limit the number of values returned rather than the size of the index. (It seems that the delay is due to the handling of the results rather than the computation of dir)
Something like:

additions = set([c for c in self._info_axis.get_level_values(0).unique()[:100]
                 if isinstance(c, string_types) and isidentifier(c)])

Anyway, I think I can address this issue in #16326 ; the topics are quite related.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 1, 2017

Do we know why _dir_additions is slow for large objects?

@jreback
Copy link
Contributor Author

jreback commented Dec 2, 2017

you can use self._info_axis.unique(level=0) here as a generic way to do this.

@jreback jreback modified the milestones: 0.21.1, Next Major Release Dec 2, 2017
@BibMartin
Copy link
Contributor

@TomAugspurger

Do we know why _dir_additions is slow for large objects?

I don't know exactly, but the slowdown seem to come from the IHM: When I create a large Series (s = Series(index=tm.makeStringIndex(10000))) in a notebook or in ipython console, then dir(s) is fast (much less than 1 sec) while asking for tab-completion is slow (several seconds).

@jreback

you can use self._info_axis.unique(level=0) here as a generic way to do this.

Yes thanks, that's an awesome new feature.

@TomAugspurger
Copy link
Contributor

Was this fixed by #20834? Tab completion on the following seems quick

In [21]: s = Series(index=tm.makeStringIndex(10000))

In [22]: s.<tab>

@jamespreed
Copy link

jamespreed commented Aug 2, 2019

I would like to add to this issue. My team often works with data sets that have hundreds of columns. The reduction in the number of columns available for tab-completion to 100 has been a hindrance. I am fine with capping the number for the sake of performance, just the choice of 100 seems arbitrary.
Currently I work around this by editing the generics.py file in the pandas/core directory.

Suggestion:

Increase the cap on _dir_additions to 1000.

Analysis

I performed the following benchmarks on tab-completion timings using %timeit in IPython on two separate laptops. In both cases, the benchmarks were created using Pandas 0.25.0, first with install as-is, and again after modifying generics.py to remove the slice in the set-comprehension at line 5199.

Laptop 1

  • Intel Core i7-9750H at 2.6Ghz
  • Windows 10, 1903
  • Pandas 0.25.0
n_cols : benchmark for tab-completion
     1 : 134 µs ± 2.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
     3 : 143 µs ± 8.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
     7 : 133 µs ± 893 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    10 : 132 µs ± 398 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    30 : 144 µs ± 9.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    70 : 133 µs ± 695 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
   100 : 692 µs ± 865 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   300 : 684 µs ± 874 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   700 : 792 µs ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  1000 : 681 µs ± 870 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  3000 : 687 µs ± 879 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  7000 : 686 µs ± 875 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
 10000 : 761 µs ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
 30000 : 698 µs ± 901 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
 70000 : 692 µs ± 881 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000 : 679 µs ± 867 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
300000 : 961 µs ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas 0.25.0, modified generics.py

n_cols : benchmark for tab-completion
     1 : 139 µs ± 9.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
     3 : 132 µs ± 863 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
     7 : 153 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    10 : 690 µs ± 847 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    30 : 745 µs ± 935 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    70 : 623 µs ± 775 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   100 : 676 µs ± 863 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   300 : 1.61 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   700 : 2.08 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  1000 : 2.59 ms ± 3.89 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  3000 : 30.3 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  7000 : 57.6 ms ± 750 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
 10000 : 81 ms ± 538 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
 30000 : 244 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 70000 : 584 ms ± 5.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100000 : 845 ms ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
300000 : 2.54 s ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Laptop 2

Intel Core i3-3110M at 2.4Ghz
Windows 10, 1903
Pandas 0.25.0

n_cols : benchmark for tab-completion
     1 : 3.6 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
     3 : 1.25 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
     7 : 1.29 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    10 : 1.27 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    30 : 1.42 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    70 : 1.66 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   100 : 1.83 ms ± 2.44 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   300 : 1.94 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   700 : 1.92 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  1000 : 1.89 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  3000 : 1.93 ms ± 2.61 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  7000 : 2.35 ms ± 3.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
 10000 : 1.99 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
 30000 : 1.82 ms ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
 70000 : 1.83 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000 : 2.03 ms ± 2.73 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
300000 : 1.86 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas 0.25.0, modified generics.py

n_cols : benchmark for tab-completion
     1 : 1.22 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
     3 : 1.21 ms ± 1.46 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
     7 : 1.24 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    10 : 1.24 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    30 : 1.37 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    70 : 1.65 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   100 : 1.83 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   300 : 3.07 ms ± 4.42 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
   700 : 5.58 ms ± 8.38 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
  1000 : 26.9 ms ± 287 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  3000 : 75.7 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  7000 : 206 ms ± 48.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 10000 : 264 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 30000 : 728 ms ± 6.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 70000 : 1.72 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100000 : 2.5 s ± 26.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
300000 : 7.43 s ± 27.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Even on my 10 year old laptop, the time for tab-completion with 1000 columns is under 30ms. Still very responsive.

@jamespreed
Copy link

Additionally, it may be worth issuing a warning to the user when dir is called under the condition that _dir_additions is dropping attributes.

It goes against the philosophy of Python to not let the user know, in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants