Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Cythonize Groupby Rank #19481

Merged
merged 35 commits into from
Feb 10, 2018
Merged

PERF: Cythonize Groupby Rank #19481

merged 35 commits into from
Feb 10, 2018

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Jan 31, 2018

This is not complete but I wanted to submit for review on the direction. In particular, I wanted to know if my way of passing the named rank arguments back to the Cython layer makes sense, or if we'd rather bypass using kwargs and call that function directly from the GroupBy instance method (similar to how shift does it).

Right now this only increments values ascending, doesn't handle tiebreakers, nor does it allow for the return of a percentage. I also plan on adding some test cases to cover the arguments as I can't find them in the pandas.tests.groupby package. More to come but again wanted feedback before going too far in

@pep8speaks
Copy link

pep8speaks commented Jan 31, 2018

Hello @WillAyd! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 09, 2018 at 18:37 Hours UTC

ndarray[{{c_type}}, ndim=2] values,
ndarray[int64_t] labels,
bint is_datetimelike, **kwargs):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok. don't pass kwargs generally around (esp in cython). we want to specify the args directly. this may make the call slightly more tricky in cython_transform (e.g. you may need to use a partial or lambda to hold the additional args)

@jreback jreback added Groupby Performance Memory or execution speed performance labels Feb 1, 2018
@WillAyd
Copy link
Member Author

WillAyd commented Feb 5, 2018

Still not complete but sharing in interim for any code review as this change is large. At a minimum I need to go back and use something besides kwargs to pass the appropriate values back

out[_as[j], 0] = i - grp_start + 1
elif tiebreak == TIEBREAK_FIRST:
for j in range(i - dups + 1, i + 1):
if ascending:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could arguably also be done using the TIEBREAK_FIRST_DESCENDING flag, but I figured it was worth just using TIEBREAK_FIRST and adding a conditional given the former is what is being passed

grp_na_count += 1
out[_as[i], 0] = np.nan
else:
if tiebreak == TIEBREAK_AVERAGE:
Copy link
Member Author

@WillAyd WillAyd Feb 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looping mechanism isn't very efficient because it overwrites "duplicate" values continually in a loop. Given the benchmarks were still significantly faster I left as is and was planning to open up a separate change to optimize further, but could review as part of this if you feel this loop mechanism is not acceptable

dups += 1
sum_ranks += i - grp_start + 1

if keep_na and masked_vals[_as[i]] == nan_fill_val:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comparison to the nan_fill_val gets a little tricky because it obfuscates np.nan with np.inf (or whatever fill value is being used). That said, that is a pre-existing limitation which I opened #19538 to address

ascending = kwargs['ascending']
pct = kwargs['pct']
keep_na = kwargs['na_option'] == 'keep'
N, K = (<object> values).shape
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to my limited understanding I am not using the K value that gets extracted here, as I couldn't figure out under what circumstance K was ever not equal to 0. Can you advise how that works or what to look at to help my comprehension?

@WillAyd
Copy link
Member Author

WillAyd commented Feb 5, 2018

Benchmarks provided below for reference

Before     after       ratio
  [f391cbfe] [27cb6b57]
+  139.69μs   535.43μs      3.83  groupby.GroupByMethods.time_method('float', 'cummin')
+  142.06μs   522.73μs      3.68  groupby.GroupByMethods.time_method('float', 'cummax')
+  145.02μs   527.11μs      3.63  groupby.GroupByMethods.time_method('int', 'cummin')
+  162.18μs   574.63μs      3.54  groupby.GroupByMethods.time_method('float', 'cumsum')
+  145.86μs   512.28μs      3.51  groupby.GroupByMethods.time_method('int', 'cummax')
+  159.64μs   533.92μs      3.34  groupby.GroupByMethods.time_method('int', 'cumsum')
+  311.50μs   728.59μs      2.34  groupby.GroupByMethods.time_method('float', 'cumprod')
+  315.28μs   650.14μs      2.06  groupby.GroupByMethods.time_method('int', 'cumprod')
+  872.00μs     1.24ms      1.42  groupby.GroupByMethods.time_method('int', 'sem')
+  901.00μs     1.21ms      1.34  groupby.GroupByMethods.time_method('float', 'sem')
+  161.36μs   203.44μs      1.26  groupby.GroupByMethods.time_method('int', 'min')
+  200.87ms   235.93ms      1.17  groupby.GroupByMethods.time_method('float', 'all')
+   64.93μs    75.73μs      1.17  groupby.GroupByMethods.time_method('int', 'size')
+  299.42μs   345.95μs      1.16  groupby.GroupByMethods.time_method('int', 'std')
+  201.29ms   231.93ms      1.15  groupby.GroupByMethods.time_method('float', 'any')
+  579.50ms   659.22ms      1.14  groupby.GroupByMethods.time_method('int', 'pct_change')
+     2.84s      3.19s      1.12  groupby.GroupByMethods.time_method('float', 'describe')
+  192.43μs   215.57μs      1.12  groupby.GroupByMethods.time_method('int', 'cumcount')
+  319.90μs   356.71μs      1.12  groupby.GroupByMethods.time_method('int', 'median')
+  156.03μs   173.91μs      1.11  groupby.GroupByMethods.time_method('float', 'min')
+  121.26μs   134.29μs      1.11  groupby.GroupByMethods.time_method('int', 'shift')
+  173.37μs   191.73μs      1.11  groupby.GroupByMethods.time_method('float', 'var')
-  193.48ms   795.71μs      0.00  groupby.GroupByMethods.time_method('int', 'rank')
-  301.72ms   787.03μs      0.00  groupby.GroupByMethods.time_method('float', 'rank')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

])
def test_rank_args(self, grps, vals, ties_method, ascending, pct, exp):
if ties_method == 'first' and vals[0] == 'bar':
pytest.xfail("See GH 19482")
Copy link
Member Author

@WillAyd WillAyd Feb 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this would actually fail but as noted in #19482 the current behavior is not ideal, so I didn't bother to emulate that. I figure this is better left to being fixed to raise in a separate change

@codecov
Copy link

codecov bot commented Feb 6, 2018

Codecov Report

Merging #19481 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19481      +/-   ##
==========================================
+ Coverage   91.59%   91.62%   +0.02%     
==========================================
  Files         150      150              
  Lines       48795    48803       +8     
==========================================
+ Hits        44696    44715      +19     
+ Misses       4099     4088      -11
Flag Coverage Δ
#multiple 89.99% <100%> (+0.02%) ⬆️
#single 41.72% <33.33%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby.py 92.2% <100%> (+0.07%) ⬆️
pandas/util/testing.py 83.64% <0%> (-0.21%) ⬇️
pandas/plotting/_converter.py 66.95% <0%> (+1.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a214915...aa4578d. Read the comment docs.

@@ -11,3 +11,11 @@ cdef inline Py_ssize_t swap(numeric *a, numeric *b) nogil:
a[0] = b[0]
b[0] = t
return 0

cdef:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might be able to make these an enum

ndarray[object, ndim=2] values,
ndarray[int64_t] labels,
bint is_datetimelike, object ties_method,
bint ascending, bint pct, object na_option):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally could incorporate this in the templated one. though I would just raise on object dtype. seems odd to support this (as its a very specific encoding, meaning lexical).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply raising on an object type would be easy. That said, wouldn't that then make this inconsistent with the nth / first / last functions?

It seems strange to me if we end up allowing users to pick the nth object from a group of objects but do not allow them to rank those same objects

Copy link
Member Author

@WillAyd WillAyd Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can ignore this comment. Was thinking at the time that nth had to do with the value of an item but it instead has to do with the position. I'll push another PR raising for objects. Will have to communicate as a breaking change in case anyone had previously been using that behavior

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback just scoping this out I think this is a fairly significant change. The main problem I see is that the current object ranking implementation is used by Categoricals as well. I imagine we'd have to then update it so that ordered Categoricals can use rank but non-ordered Categoricals cannot. Additionally would have to update the code across Series, Frame and GroupBy objects to ensure those handle consistently.

I think that strays a little too far from the original goal here of optimizing performance and would rather handle in a separate change. Let me know if you disagree otherwise happy to open that one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely. let's keep this to one item for this PR (you can open a separate issue for other)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #19560 for this issue. Could use some further discussion.

Otherwise if this round of tests pass I plan to update whatsnew and wrap this one up. If there's anything outstanding please let me know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to raise on groupby_object_rank rather than implement it here. let's keep that for another discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simply not define it at all. This will be caught at a higher level (IOW it will try to compose the function name it won't exist), I think the error is ok.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have benchmarks for groupby-rank in asv? if not can you add, post results anyhow

ndarray[object, ndim=2] values,
ndarray[int64_t] labels,
bint is_datetimelike, object ties_method,
bint ascending, bint pct, object na_option):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to raise on groupby_object_rank rather than implement it here. let's keep that for another discussion.

@@ -16,14 +16,20 @@ from numpy cimport (ndarray,
from libc.stdlib cimport malloc, free

from util cimport numeric, get_nat
from algos cimport swap
from algos import take_2d_axis1_float64_float64, groupsort_indexer
from algos cimport (swap, TiebreakEnumType, TIEBREAK_AVERAGE, TIEBREAK_MIN,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe import directly as the name of the enum?

can you also share this name with the non-grouping rank functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this back in d09268b but all of the Py27 builds failed as a result with the error below (py3 was fine). Tried digging up info but couldn't find anything - any chance you've seen this before?

cythoning pandas/_libs/groupby.pyx to pandas/_libs/groupby.c

Error compiling Cython file:
------------------------------------------------------------
...

        if keep_na and (values[_as[i], 0] != values[_as[i], 0]):
            grp_na_count += 1
            out[_as[i], 0] = np.nan
        else:
            if tiebreak == TiebreakEnumType.TIEBREAK_AVERAGE:
                                          ^
------------------------------------------------------------

pandas/_libs/groupby.pyx:178:43: Compiler crash in AnalyseExpressionsTransform

ModuleNode.body = StatListNode(groupby.pyx:4:0)
StatListNode.stats[16] = StatListNode(groupby.pyx:125:0)
StatListNode.stats[0] = CompilerDirectivesNode(groupby.pyx:125:0)
CompilerDirectivesNode.body = StatListNode(groupby.pyx:125:0)
StatListNode.stats[0] = DefNode(groupby.pyx:125:0,
    doc = u'\n    Only transforms on axis=0\n    ',
    modifiers = [...]/0,
    name = u'group_rank_object',
    num_required_args = 8,
    py_wrapper_required = True,
    reqd_kw_flags_cname = '0',
    used = True)
File 'Nodes.py', line 430, in analyse_expressions: StatListNode(groupby.pyx:130:4)
File 'Nodes.py', line 430, in analyse_expressions: StatListNode(groupby.pyx:170:4)
File 'Nodes.py', line 6181, in analyse_expressions: ForInStatNode(groupby.pyx:170:4)
File 'Nodes.py', line 430, in analyse_expressions: StatListNode(groupby.pyx:171:8)
File 'Nodes.py', line 5842, in analyse_expressions: IfStatNode(groupby.pyx:174:8)
File 'Nodes.py', line 430, in analyse_expressions: StatListNode(groupby.pyx:178:12)
File 'Nodes.py', line 5840, in analyse_expressions: IfStatNode(groupby.pyx:178:12)
File 'Nodes.py', line 5885, in analyse_expressions: IfClauseNode(groupby.pyx:178:15)
File 'ExprNodes.py', line 541, in analyse_temp_boolean_expression: PrimaryCmpNode(groupby.pyx:178:24,
    operator = u'==',
    result_is_used = True,
    use_managed_ref = True)
File 'ExprNodes.py', line 11893, in analyse_types: PrimaryCmpNode(groupby.pyx:178:24,
    operator = u'==',
    result_is_used = True,
    use_managed_ref = True)
File 'ExprNodes.py', line 6329, in analyse_types: AttributeNode(groupby.pyx:178:43,
    attribute = u'TIEBREAK_AVERAGE',
    initialized_check = True,
    is_attribute = 1,
    needs_none_check = True,
    result_is_used = True,
    use_managed_ref = True)
File 'ExprNodes.py', line 6395, in analyse_as_type_attribute: AttributeNode(groupby.pyx:178:43,
    attribute = u'TIEBREAK_AVERAGE',
    initialized_check = True,
    is_attribute = 1,
    needs_none_check = True,
    result_is_used = True,
    use_managed_ref = True)
File 'ExprNodes.py', line 6438, in as_name_node: AttributeNode(groupby.pyx:178:43,
    attribute = u'TIEBREAK_AVERAGE',
    initialized_check = True,
    is_attribute = 1,
    needs_none_check = True,
    result_is_used = True,
    use_managed_ref = True)
File 'ExprNodes.py', line 1906, in analyse_rvalue_entry: NameNode(groupby.pyx:178:43,
    cf_maybe_null = True,
    is_name = True,
    name = u'TIEBREAK_AVERAGE',
    result_is_used = True,
    use_managed_ref = True)
File 'ExprNodes.py', line 1939, in analyse_entry: NameNode(groupby.pyx:178:43,
    cf_maybe_null = True,
    is_name = True,
    name = u'TIEBREAK_AVERAGE',
    result_is_used = True,
    use_managed_ref = True)
File 'ExprNodes.py', line 1953, in check_identifier_kind: NameNode(groupby.pyx:178:43,
    cf_maybe_null = True,
    is_name = True,
    name = u'TIEBREAK_AVERAGE',
    result_is_used = True,
    use_managed_ref = True)

Compiler crash traceback from this point on:
  File "/Users/williamayd/miniconda3/envs/pandas_dev2/lib/python2.7/site-packages/Cython/Compiler/ExprNodes.py", line 1953, in check_identifier_kind
    if entry.is_type and entry.type.is_extension_type:
AttributeError: 'NoneType' object has no attribute 'is_type'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats annoying. havne't seen that

@@ -2159,6 +2172,15 @@ def get_group_levels(self):
# ------------------------------------------------------------
# Aggregation functions

def _group_rank_wrapper(func, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would encode this in _cython_functions directly

@@ -2314,10 +2341,13 @@ def _cython_operation(self, kind, values, how, axis, min_count=-1):
else:
raise

if is_numeric:
out_dtype = '%s%d' % (values.dtype.kind, values.dtype.itemsize)
if how == 'rank':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should prob have a better way of doing this :<

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch I see here is that unlike other groupby transformations (where the dtype of the result object typically matches the dtype of the groups' values and gets cast as necessary after transformation) the dtype of the result object for rank needs to be a float prior to calling any group_rank operation. Otherwise, things like TIEBREAK_AVERAGE will not work when ranking say a group full of ints.

The generic calls to algos have it a little bit easier because they don't pass an object in to be modified, using instead the return value which algos builds internally as a like-sized array of floats. Unless there's something basic I don't understand with Cython, I don't think there's a way to upcast the dtype of the return object that gets passed into group_rank from say an int to a float, so the only option I can think of there would be to break out group_rank from other transformations and have group_rank return a new result object instead of modifying the provided one in place.

I'd argue the effort and maintainability of doing that would far outweigh any benefit from cleaning up this conditional, but curious if you have other thoughts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is similar in a way to the pre-transformations we do for datetimelikes. What I meant is that the registry of _cython_functions should have this logic itself (which is already based on the function name).

e.g. maybe in _cython_functions you get back a tuple rather than a scalar / lamba function, which includes the required dtype (or None). Or _cython_functions actually should be a class which is dispatched to, to call the functions themselves. IOW something like

class CythonTransformFunction:
    def __init__(self, obj):
        self.obj = obj

    def cummax(self):
         return group_cummax(self.obj.values)

    def rank(self):
        return group_cummax(self.obj.values.astype('floatt64'))

(and really should have more logic pushed to this class, e.g. pre-and post-convert dtype things. And should have a Transformation and an Aggregation version. This would all of the specific logic to be pushed here, rather than lookup in dicts and such. Further we could prob move this code out of the main groupby.py class into a sub-module.

don't have to do this here, but groupby for sure needs cleanup like this.

@WillAyd
Copy link
Member Author

WillAyd commented Feb 7, 2018

I posted AVSs for a previous commit. Trying to repost for latest but they aren't working any more. I think 3597de0 broke the ASVs by removing rolling_median and potentially other functions, which now yields the following trying to run benchmarks:

File "/Users/williamayd/Git/pandas/asv_bench/benchmarks/gil.py", line 3, in <module>
  from pandas import (DataFrame, Series, rolling_median, rolling_mean,
ImportError: cannot import name 'rolling_median'

Need to debug a little more but will probably have to open a separate issue to fix and get those working again

EDIT: looks like #19236 is supposed to fix the ASV issue. Can repost results after that gets merged

@@ -16,14 +16,20 @@ from numpy cimport (ndarray,
from libc.stdlib cimport malloc, free

from util cimport numeric, get_nat
from algos cimport swap
from algos import take_2d_axis1_float64_float64, groupsort_indexer
from algos cimport (swap, TiebreakEnumType, TIEBREAK_AVERAGE, TIEBREAK_MIN,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats annoying. havne't seen that

ndarray[object, ndim=2] values,
ndarray[int64_t] labels,
bint is_datetimelike, object ties_method,
bint ascending, bint pct, object na_option):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simply not define it at all. This will be caught at a higher level (IOW it will try to compose the function name it won't exist), I think the error is ok.

bint is_datetimelike, object ties_method,
bint ascending, bint pct, object na_option):
"""
Only transforms on axis=0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a more full doc-string

for j in range(i - dups + 1, i + 1):
if ascending:
out[_as[j], 0] = j + 1 - grp_start
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any comments on the impl would be helpful (to future readers)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice comments!

@Appender(_doc_template)
def rank(self, method='average', ascending=True, na_option='keep',
pct=False, axis=0):
"""Rank within each group"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this doc-string look ok?

@@ -2314,10 +2341,13 @@ def _cython_operation(self, kind, values, how, axis, min_count=-1):
else:
raise

if is_numeric:
out_dtype = '%s%d' % (values.dtype.kind, values.dtype.itemsize)
if how == 'rank':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is similar in a way to the pre-transformations we do for datetimelikes. What I meant is that the registry of _cython_functions should have this logic itself (which is already based on the function name).

e.g. maybe in _cython_functions you get back a tuple rather than a scalar / lamba function, which includes the required dtype (or None). Or _cython_functions actually should be a class which is dispatched to, to call the functions themselves. IOW something like

class CythonTransformFunction:
    def __init__(self, obj):
        self.obj = obj

    def cummax(self):
         return group_cummax(self.obj.values)

    def rank(self):
        return group_cummax(self.obj.values.astype('floatt64'))

(and really should have more logic pushed to this class, e.g. pre-and post-convert dtype things. And should have a Transformation and an Aggregation version. This would all of the specific logic to be pushed here, rather than lookup in dicts and such. Further we could prob move this code out of the main groupby.py class into a sub-module.

don't have to do this here, but groupby for sure needs cleanup like this.

@WillAyd
Copy link
Member Author

WillAyd commented Feb 8, 2018

Latest ASVs posted below

       before           after         ratio
     [b8351277]       [0141747d]
+         145±2μs          479±2μs     3.29  groupby.GroupByMethods.time_method('int', 'cummax')
+       144±0.8μs          469±6μs     3.26  groupby.GroupByMethods.time_method('int', 'cummin')
+         148±2μs          479±8μs     3.24  groupby.GroupByMethods.time_method('float', 'cummax')
+       160±0.9μs         493±10μs     3.08  groupby.GroupByMethods.time_method('float', 'cumsum')
+         155±6μs         478±20μs     3.08  groupby.GroupByMethods.time_method('float', 'cummin')
+         169±2μs          484±1μs     2.87  groupby.GroupByMethods.time_method('int', 'cumsum')
+         557±3μs       1.14±0.2ms     2.05  groupby.GroupByMethods.time_method('int', 'sem')
+         308±5μs          626±2μs     2.03  groupby.GroupByMethods.time_method('int', 'cumprod')
+         353±8μs          636±5μs     1.80  groupby.GroupByMethods.time_method('float', 'cumprod')
+       149±0.7μs          192±1μs     1.29  groupby.GroupByMethods.time_method('float', 'min')
+         300±2μs         380±10μs     1.27  groupby.GroupByMethods.time_method('float', 'head')
+         312±7μs          385±2μs     1.24  groupby.GroupByMethods.time_method('int', 'std')
+         149±1μs          180±8μs     1.21  groupby.GroupByMethods.time_method('float', 'last')
+         335±2μs          403±7μs     1.20  groupby.GroupByMethods.time_method('float', 'median')
+       353±0.9μs         420±10μs     1.19  groupby.GroupByMethods.time_method('int', 'nunique')
+         484±5μs         570±20μs     1.18  groupby.GroupByMethods.time_method('float', 'sem')
+           914ms            1.04s     1.14  groupby.GroupByMethods.time_method('float', 'mad')
+           889ms            1.00s     1.13  groupby.GroupByMethods.time_method('float', 'pct_change')
+           2.93s            3.28s     1.12  groupby.GroupByMethods.time_method('float', 'describe')
+         214±4ms         237±10ms     1.11  groupby.GroupByMethods.time_method('int', 'skew')
-           1.98s            1.77s     0.89  groupby.GroupByMethods.time_method('int', 'describe')
-         178±5μs        157±0.7μs     0.89  groupby.GroupByMethods.time_method('int', 'first')
-       187±0.7ms         772±20μs     0.00  groupby.GroupByMethods.time_method('int', 'rank')
-       293±0.9ms         793±40μs     0.00  groupby.GroupByMethods.time_method('float', 'rank')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might also close #11759 (as you added the full signature). if so can you add a test?


cdef int64_t iNaT = get_nat()

cdef double NaN = <double> np.NaN
cdef double nan = NaN

cdef extern from "numpy/npy_math.h" nogil:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need these? you can just use np.isnan no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to do the nan check inside the nogil block hence the need for it, but your comment makes me realize that I already am creating a mask array that I can reference instead of doing that check again within the nogil.

for j in range(i - dups + 1, i + 1):
if ascending:
out[_as[j], 0] = j + 1 - grp_start
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice comments!

# also be sure to reset any of the items helping to calculate dups
if i == N - 1 or labels[_as[i]] != labels[_as[i+1]]:
if pct:
for j in range(grp_start, i + 1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pct here is confusing. just fill another variable with the group sizes as you go then do the division at the very end.

@WillAyd
Copy link
Member Author

WillAyd commented Feb 9, 2018

Missed your note on #11759 before latest commit. With the example provided there, this fix would now be raising TypeError: 'NoneType' object is not callable because there is no group_rank_object implementation - is that what you are looking for and if so do you not feel that is covered already by test_rank_object_raises method that was added?

@jreback jreback merged commit c1068d9 into pandas-dev:master Feb 10, 2018
@jreback
Copy link
Contributor

jreback commented Feb 10, 2018

thanks! ok ideally have a better message for unsupported groupbys, but can address then with a followup.

@WillAyd WillAyd deleted the rank-perf branch February 10, 2018 18:52
@jreback
Copy link
Contributor

jreback commented Feb 13, 2018

https://travis-ci.org/MacPython/pandas-wheels/jobs/340767780 is a 32-bit build which is failing

sample error

/venv/lib/python3.6/site-packages/pandas/tests/groupby/test_groupby.py:1932: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/venv/lib/python3.6/site-packages/pandas/core/groupby.py:1804: in rank
    na_option=na_option, pct=pct, axis=axis)
/venv/lib/python3.6/site-packages/pandas/core/groupby.py:1006: in _cython_transform
    **kwargs)
/venv/lib/python3.6/site-packages/pandas/core/groupby.py:2427: in transform
    return self._cython_operation('transform', values, how, axis, **kwargs)
/venv/lib/python3.6/site-packages/pandas/core/groupby.py:2387: in _cython_operation
    **kwargs)
/venv/lib/python3.6/site-packages/pandas/core/groupby.py:2461: in _transform
    transform_func(result, values, comp_ids, is_datetimelike, **kwargs)
/venv/lib/python3.6/site-packages/pandas/core/groupby.py:2276: in wrapper
    return f(afunc, *args, **kwargs)
/venv/lib/python3.6/site-packages/pandas/core/groupby.py:2228: in <lambda>
    kwargs.get('na_option', 'keep')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
>   ???
E   ValueError: Buffer dtype mismatch, expected 'int64_t' but got 'int'
pandas/_libs/groupby_helper.pxi:1250: ValueError

not sure exactly what is happening, if you can have a look. note that it is almost impossible to actually test this w/o merging. these are built once a day. once we have a fix which passes regular travis then can go from there. thanks.

@WillAyd
Copy link
Member Author

WillAyd commented Feb 13, 2018

EDIT: Ignore below comment - didn't look closely enough at traceback you provided where rank is the offending function

Judging by the line number I think this may be a side effect of #19635. Previously there was the following condition preceding that:

  {{if name == 'int64'}}		
  if val != {{nan_val}}:		
  {{else}}		
  if val == val and val != {{nan_val}}:
  {{endif}}

Which I simplified to

  if val == val and val != {{nan_val}}:

I can't imagine why that would make a difference but it is suspicious given location of the error. Want me to add that condition back in and submit it referencing this PR?

@jreback
Copy link
Contributor

jreback commented Feb 13, 2018

yep thanks

@jorisvandenbossche
Copy link
Member

@WillAyd not sure if it is related to this PR, but on AppVeyor, there are a huge amount of numpy warnings ("DeprecationWarning: numpy boolean subtract, the - operator, is deprecated, use the bitwise_xor, the ^ operator, or the logical_xor function instead.") related to rank tests (so many warnings you even don't see the actual errors anymore).

See https://ci.appveyor.com/project/pandas-dev/pandas/build/1.0.13483/job/vt1rpmyub41gf6aa

@jorisvandenbossche
Copy link
Member

From a quick manual appveyor bisect: introduced by #20091

@WillAyd
Copy link
Member Author

WillAyd commented Apr 19, 2018

cc @peterpanmj

@WillAyd
Copy link
Member Author

WillAyd commented Apr 19, 2018

I think this warning is coming from np.diff specifically at the below line (link goes to master):

https://github.com/numpy/numpy/blob/a9e9343ecf5a6564eb09f1d7078c407db2ff7d9e/numpy/lib/function_base.py#L1172

It looks like in 1.13 this was actually performing subtraction regardless of type, so perhaps something that just slipped through on that release?

@peterpanmj
Copy link
Contributor

Yes, this is a result of np.diff in 1.13. I use np.dfiff in #20091. Actually something like ^ operator can be used instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: cythonize groupby-rank
5 participants