Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Special case NumericIndex._append_same_dtype() #17307

Merged
merged 2 commits into from
Oct 28, 2017

Conversation

toobaz
Copy link
Member

@toobaz toobaz commented Aug 22, 2017

  • tests passed
  • passes git diff master -u -- "*.py" | flake8 --diff

Simple patch which results in a modest speedup (~8%) of the following:

In [2]: idxes = [pd.Int64Index(range(i)) for i in range(1, 10)]*200
In [3]: %timeit idxes[0].append(idxes[1:])

I don't know whether this is worth the extra code... just inquiring so that I can finalize #16236.

@toobaz toobaz mentioned this pull request Aug 22, 2017
2 tasks
@toobaz toobaz force-pushed the numericindex_append branch 2 times, most recently from 42acf07 to b6c0f19 Compare August 22, 2017 09:16
@codecov
Copy link

codecov bot commented Aug 22, 2017

Codecov Report

Merging #17307 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17307      +/-   ##
==========================================
- Coverage   91.03%   91.01%   -0.02%     
==========================================
  Files         162      162              
  Lines       49567    49572       +5     
==========================================
- Hits        45121    45117       -4     
- Misses       4446     4455       +9
Flag Coverage Δ
#multiple 88.79% <100%> (ø) ⬆️
#single 40.25% <100%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/numeric.py 97.22% <100%> (+0.03%) ⬆️
pandas/core/dtypes/concat.py 98.29% <100%> (+0.01%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.72% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2f00159...b6c0f19. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 22, 2017

Codecov Report

Merging #17307 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17307      +/-   ##
==========================================
- Coverage   91.24%   91.23%   -0.02%     
==========================================
  Files         163      163              
  Lines       50168    50176       +8     
==========================================
  Hits        45777    45777              
- Misses       4391     4399       +8
Flag Coverage Δ
#multiple 89.04% <100%> (ø) ⬆️
#single 40.28% <80%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/numeric.py 97.27% <100%> (+0.03%) ⬆️
pandas/core/dtypes/concat.py 99.14% <100%> (+0.01%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️
pandas/core/indexes/datetimes.py 95.5% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4489389...31c8a52. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Aug 22, 2017

looks fine
can u add some asv for concat of Index
ones that hit your various paths would be good
IOW concat RangeIndex, IntIndex and Index cases would be great

@toobaz
Copy link
Member Author

toobaz commented Aug 22, 2017

can u add some asv for concat of Index

pd.concat() doesn't accept Index inputs, so I guess you mean Index.append(): but then I think my previous PR already addressed all the cases you mention. I'm pretty sure at least it touches the code paths I added in #16213, in #16236 and in this PR.

@jorisvandenbossche
Copy link
Member

@toobaz indeed append, but question was about asv benchmark, not tests

@toobaz
Copy link
Member Author

toobaz commented Aug 22, 2017

@toobaz indeed append, but question was about asv benchmark, not tests

Aha, OK, I had no idea of what "asv" meant. Looking into it.

@jreback jreback added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 22, 2017
@toobaz toobaz force-pushed the numericindex_append branch from 47fa09f to 1b00189 Compare August 22, 2017 18:28
@toobaz
Copy link
Member Author

toobaz commented Aug 23, 2017

I guess the benchmarking part is ready... but I wasn't able to test it (by the way, can you confirm that the content of this page is obsolete?). Is it automatically ran anywhere?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 23, 2017

@toobaz yeah, that page is out of date. New docs are here. I'm tempted to just remove that wiki page. Any objections?

Results are run nightly and posted to http://pandas.pydata.org/speed/pandas/ (still setting this up)

@jorisvandenbossche
Copy link
Member

I removed the content of the page and put a link to the contributing docs for now.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment about the location of the benchmarks

from .pandas_vb_common import *


class Int64Indexing(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this one in index_object.py ? And maybe call the class IndexOps or something (not 'indexing' as it append has nothing to do with indexing)

def setup(self):
idx = Index(range(10))
self.ridx = [idx] * 10
self.iidx = [idx.astye(int)] * 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

astype

class Int64Indexing(object):
goal_time = 0.2

def setup(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this bigger like 1000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toobaz to see what size is best (without being able to run asv), just do the timing yourself with %timeit, and set the size so that it takes something in the range of a few milliseconds to few tens of milliseconds

@toobaz toobaz force-pushed the numericindex_append branch 2 times, most recently from bcc85e8 to 3827072 Compare August 23, 2017 16:06
@toobaz
Copy link
Member Author

toobaz commented Aug 23, 2017

OK, the benchmarks now take ~5 ms. each.

idx = Index(range(10))
self.ridx = [idx] * N
self.iidx = [idx.astype(int)] * N
self.oidx = [idx.astype(str)] * N
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can can you post these results (commit before your recent fix of range index concat) to this commit

@toobaz toobaz force-pushed the numericindex_append branch from 3827072 to 6324b86 Compare August 24, 2017 07:01
@pep8speaks
Copy link

pep8speaks commented Aug 24, 2017

Hello @toobaz! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on October 28, 2017 at 06:48 Hours UTC

@toobaz
Copy link
Member Author

toobaz commented Aug 24, 2017

can can you post these results (commit before your recent fix of range index concat) to this commit

You mean here?

(notice I changed the benchmarks so the RangeIndex special-casing is actually tested - indexes are consecutive)

Before special-casing RangeIndex.append() (with 94ef7b6 ):

In [19]: %timeit ridx[0].append(ridx[1:])
100 loops, best of 3: 10.3 ms per loop

In [20]: %timeit iidx[0].append(iidx[1:])
100 loops, best of 3: 9.5 ms per loop

In [21]: %timeit oidx[0].append(oidx[1:])
100 loops, best of 3: 10.6 ms per loop

Before special-casing NumericIndex._concat_same_dtype() (with 2f00159 ):

In [6]: %timeit ridx[0].append(ridx[1:])
100 loops, best of 3: 6.15 ms per loop

In [7]: %timeit iidx[0].append(iidx[1:])
100 loops, best of 3: 10.9 ms per loop

In [8]: %timeit oidx[0].append(oidx[1:])
100 loops, best of 3: 11.7 ms per loop

Now (with 6324b86 ):

In [6]: %timeit ridx[0].append(ridx[1:])
100 loops, best of 3: 6.08 ms per loop

In [7]: %timeit iidx[0].append(iidx[1:])
100 loops, best of 3: 9.71 ms per loop

In [8]: %timeit oidx[0].append(oidx[1:])
100 loops, best of 3: 11.7 ms per loop

@toobaz toobaz force-pushed the numericindex_append branch from 6324b86 to b941cce Compare August 24, 2017 09:47
goal_time = 0.2

def setup(self):
N = 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this 10000

@jreback
Copy link
Contributor

jreback commented Aug 24, 2017

@toobaz I wanted to see the asv results for these benchmarks;
```asv continuous master thisbranch --bench concat `` (you might want to name these index_concat actually)

and I don't care about the internal benchmarks (a _*) function, rather the public API .append()

@toobaz toobaz force-pushed the numericindex_append branch from b941cce to c45d2a5 Compare August 24, 2017 10:28
@toobaz
Copy link
Member Author

toobaz commented Aug 24, 2017

@toobaz I wanted to see the asv results for these benchmarks;
```asv continuous master thisbranch --bench concat `` (you might want to name these index_concat actually)

OK, when I'll find the time to set up asv.

and I don't care about the internal benchmarks (a _*) function, rather the public API .append()

Not following you here... indeed I'm testing only public methods.

@jreback
Copy link
Contributor

jreback commented Aug 25, 2017

asv_bench/benchmarks/index_concat.py is included but no changes.

@jorisvandenbossche
Copy link
Member

asv_bench/benchmarks/index_concat.py is included but no changes.

that's a leftover that should be removed, but the actual benchmarks are in the other file and time append

@toobaz toobaz force-pushed the numericindex_append branch from c45d2a5 to fb05c39 Compare August 25, 2017 22:12
@toobaz
Copy link
Member Author

toobaz commented Aug 25, 2017

asv_bench/benchmarks/index_concat.py is included but no changes.

oops fixed, sorry

@toobaz toobaz force-pushed the numericindex_append branch 4 times, most recently from 11b3172 to 9f69db2 Compare August 26, 2017 10:37
@toobaz
Copy link
Member Author

toobaz commented Aug 26, 2017

@jreback @jorisvandenbossche
Anyone has a hint of what's happening in pandas.tests.indexes.test_category.TestCategoricalIndex.test_reindexing? It fails randomly on Travis, (on other branches too) and I really can't understand what could cause the error.

@toobaz toobaz force-pushed the numericindex_append branch from 9f69db2 to 795305b Compare August 26, 2017 15:12
@jreback
Copy link
Contributor

jreback commented Aug 26, 2017

#17323

@toobaz toobaz force-pushed the numericindex_append branch from 795305b to 3123fcf Compare October 24, 2017 11:27
@toobaz
Copy link
Member Author

toobaz commented Oct 24, 2017

Rebased and ran asv:

· Running 16 total benchmarks (2 commits * 1 environments * 8 benchmarks)
[  0.00%] · For pandas commit hash 3123fcff:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.......................................................
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  6.25%] ··· Running categoricals.Categoricals.time_concat                                                                                                                          4.67±0.07ms
[ 12.50%] ··· Running index_object.IndexOps.time_concat_int                                                                                                                           116±0.01ms
[ 18.75%] ··· Running index_object.IndexOps.time_concat_obj                                                                                                                            148±0.6ms
[ 25.00%] ··· Running index_object.IndexOps.time_concat_range                                                                                                                        67.2±0.06ms
[ 31.25%] ··· Running join_merge.Concat.time_concat_empty_frames1                                                                                                                        207±1μs
[ 37.50%] ··· Running join_merge.Concat.time_concat_empty_frames2                                                                                                                        211±2μs
[ 43.75%] ··· Running join_merge.Concat.time_concat_series_axis1                                                                                                                         111±1ms
[ 50.00%] ··· Running join_merge.Concat.time_concat_small_frames                                                                                                                      38.1±0.1ms
[ 50.00%] · For pandas commit hash e1dabf37:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 56.25%] ··· Running categoricals.Categoricals.time_concat                                                                                                                          4.88±0.05ms
[ 62.50%] ··· Running index_object.IndexOps.time_concat_int                                                                                                                            131±0.2ms
[ 68.75%] ··· Running index_object.IndexOps.time_concat_obj                                                                                                                            144±0.7ms
[ 75.00%] ··· Running index_object.IndexOps.time_concat_range                                                                                                                         69.6±0.7ms
[ 81.25%] ··· Running join_merge.Concat.time_concat_empty_frames1                                                                                                                        208±2μs
[ 87.50%] ··· Running join_merge.Concat.time_concat_empty_frames2                                                                                                                        206±1μs
[ 93.75%] ··· Running join_merge.Concat.time_concat_series_axis1                                                                                                                         110±1ms
[100.00%] ··· Running join_merge.Concat.time_concat_small_frames                                                                                                                      38.2±0.2ms       before           after         ratio
     [e1dabf37]       [3123fcff]
-       131±0.2ms       116±0.01ms     0.88  index_object.IndexOps.time_concat_int

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback
Copy link
Contributor

jreback commented Oct 24, 2017

IIRC was the purpose of this PR to make code more consistent; small perf bump as a result?

@toobaz
Copy link
Member Author

toobaz commented Oct 24, 2017

The patch avoids an int -> object -> int roundtrip by not relying on _concat_index_asobject, hence yes, resulting in the modest speedup.

@jreback jreback added this to the 0.22.0 milestone Oct 28, 2017
@jreback
Copy link
Contributor

jreback commented Oct 28, 2017

lgtm. pls rebase just to make sure. ping on green.

@toobaz toobaz force-pushed the numericindex_append branch from 3123fcf to 31c8a52 Compare October 28, 2017 06:48
@toobaz
Copy link
Member Author

toobaz commented Oct 28, 2017

@jreback ping

@jreback jreback merged commit 46826ad into pandas-dev:master Oct 28, 2017
@jreback
Copy link
Contributor

jreback commented Oct 28, 2017

thanks!

@toobaz toobaz deleted the numericindex_append branch October 28, 2017 17:06
peterpanmj pushed a commit to peterpanmj/pandas that referenced this pull request Oct 31, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants