Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: avoid object conversion in fillna(method=pad|backfill) for masked arrays #39953

Merged
merged 11 commits into from
Mar 5, 2021

Conversation

simonjayhawkins
Copy link
Member

[ 75.00%] ··· frame_methods.Fillna.time_frame_fillna                                                                         ok
[ 75.00%] ··· ========= ======== ============ ============= ========== ============= =============
              --                                               dtype                              
              ------------------ -----------------------------------------------------------------
               inplace   method    float64       float32      object       Int64        Float64   
              ========= ======== ============ ============= ========== ============= =============
                 True     pad     3.44±0.3ms    2.64±0.2ms   307±10ms    3.97±0.2ms    3.72±0.1ms 
                 True    bfill    3.23±0.2ms    2.89±0.2ms   275±8ms     2.34±0.1ms   2.45±0.09ms 
                False     pad     3.62±0.1ms   2.06±0.01ms   181±3ms    4.23±0.07ms   4.16±0.07ms 
                False    bfill    3.59±0.2ms   2.06±0.02ms   186±6ms    4.35±0.09ms   4.18±0.06ms 
              ========= ======== ============ ============= ========== ============= =============
       before           after         ratio
     [2e5c28fc]       [d3ae0cf1]
     <master>         <BaseMaskedArray.fillna>
-     2.28±0.06ms      2.06±0.02ms     0.91  frame_methods.Fillna.time_frame_fillna(False, 'bfill', 'float32')
-      15.5±0.2ms      2.45±0.09ms     0.16  frame_methods.Fillna.time_frame_fillna(True, 'bfill', 'Float64')
-      16.5±0.2ms       2.34±0.1ms     0.14  frame_methods.Fillna.time_frame_fillna(True, 'bfill', 'Int64')
-      85.2±0.9ms      4.18±0.06ms     0.05  frame_methods.Fillna.time_frame_fillna(False, 'bfill', 'Float64')
-      86.6±0.6ms      4.16±0.07ms     0.05  frame_methods.Fillna.time_frame_fillna(False, 'pad', 'Float64')
-      90.9±0.8ms      4.35±0.09ms     0.05  frame_methods.Fillna.time_frame_fillna(False, 'bfill', 'Int64')
-      92.4±0.4ms      4.23±0.07ms     0.05  frame_methods.Fillna.time_frame_fillna(False, 'pad', 'Int64')
-        85.3±1ms       3.72±0.1ms     0.04  frame_methods.Fillna.time_frame_fillna(True, 'pad', 'Float64')
-        92.5±2ms       3.97±0.2ms     0.04  frame_methods.Fillna.time_frame_fillna(True, 'pad', 'Int64')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@simonjayhawkins simonjayhawkins added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Feb 21, 2021
if method is not None:
func = missing.get_fill_func(method)
new_values, new_mask = func(
self._data.copy(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtensionBlock.interpolate creates another copy L1774. if we always copy in EA.fillna then that copy shouldn't be needed.

@@ -623,6 +623,38 @@ def pad_inplace(algos_t[:] values, const uint8_t[:] mask, limit=None):
val = values[i]


@cython.boundscheck(False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than adding new, can we simply always use the mask (for all dtypes)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, the 1d versions of pad and backfill are only used by EA.fillna and Series.replace.

some of the EAs override fillna and use 2d algorithms. NDArrayBackedExtensionArray uses the 1d versions but should be using a 2d version (not an issue since only called columnwise, but would be fail if called directly on a 2d EA) The other EAs that use this convert to object so are not performant anyway.

not sure about Series.replace, will look some more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't support any 2D EA atm, so can leave those out for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. if we are always updating and returning the mask, maybe we should return the same dtype as the original mask, or only allow boolean mask in python space and always return a boolean mask.

@jreback jreback added this to the 1.3 milestone Feb 22, 2021
@jreback
Copy link
Contributor

jreback commented Feb 22, 2021

looks good. pls add a whatsnew note. you had a comment about multiple copies ; can try to handle here or make a new issue. ping when ready.

@jreback jreback merged commit 3888a3f into pandas-dev:master Mar 5, 2021
@jreback
Copy link
Contributor

jreback commented Mar 5, 2021

thanks @simonjayhawkins very nice

@jorisvandenbossche
Copy link
Member

This was failing CI (now also on master):

=================================== FAILURES ===================================
________________ TestMissing.test_fillna_no_op_returns_copy[0] _________________
[gw1] linux -- Python 3.8.8 /home/vsts/miniconda3/envs/pandas-dev/bin/python

self = <pandas.tests.extension.test_sparse.TestMissing object at 0x7f0590718cd0>
data = [7, 33, 0, 95, 38, 0, 30, 26, 0, 47, 6, 0, 66, 4, 0, 57, 94, 0, 19, 23, 0, 48, 3, 0, 75, 74, 0, 51, 74, 0, 13, 73, 0, ...66, 67, 69, 70, 72, 73, 75,
       76, 78, 79, 81, 82, 84, 85, 87, 88, 90, 91, 93, 94, 96, 97, 99],
      dtype=int32)

request = <FixtureRequest for <Function test_fillna_no_op_returns_copy[0]>>

    def test_fillna_no_op_returns_copy(self, data, request):
        if np.isnan(data.fill_value):
            request.node.add_marker(
                pytest.mark.xfail(reason="returns array with different fill value")
            )
>       super().test_fillna_no_op_returns_copy(data)

pandas/tests/extension/test_sparse.py:229: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/extension/base/missing.py:80: in test_fillna_no_op_returns_copy
    result = data.fillna(method="backfill")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = [7, 33, 0, 95, 38, 0, 30, 26, 0, 47, 6, 0, 66, 4, 0, 57, 94, 0, 19, 23, 0, 48, 3, 0, 75, 74, 0, 51, 74, 0, 13, 73, 0, ...66, 67, 69, 70, 72, 73, 75,
       76, 78, 79, 81, 82, 84, 85, 87, 88, 90, 91, 93, 94, 96, 97, 99],
      dtype=int32)

value = None, method = 'backfill', limit = None

    def fillna(self, value=None, method=None, limit=None):
        """
        Fill missing values with `value`.
>           warnings.warn(msg, PerformanceWarning)
E           pandas.errors.PerformanceWarning: fillna with 'method' requires high memory usage.

pandas/core/arrays/sparse/array.py:672: PerformanceWarning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants