Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: SparseDataFrame._init_dict uses intermediary dict, not DataFrame #16883

Merged
merged 9 commits into from
Jul 17, 2017
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,7 @@ Removal of prior version deprecations/changes
Performance Improvements
~~~~~~~~~~~~~~~~~~~~~~~~

- Fixed performance of instantiating :class:`SparseDataFrame` (:issue:`16773`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm being pedantic, but you can't really "fix" performance, only improve it. 😄



.. _whatsnew_0210.bug_fixes:
Expand Down
25 changes: 16 additions & 9 deletions pandas/core/sparse/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,10 @@

from pandas.core.dtypes.missing import isnull, notnull
from pandas.core.dtypes.cast import maybe_upcast, find_common_type
from pandas.core.dtypes.common import _ensure_platform_int, is_scipy_sparse
from pandas.core.dtypes.common import (
_ensure_platform_int, is_scipy_sparse,
is_float,
)

from pandas.core.common import _try_sort
from pandas.compat.numpy import function as nv
Expand Down Expand Up @@ -143,7 +146,7 @@ def _init_dict(self, data, index, columns, dtype=None):
sp_maker = lambda x: SparseArray(x, kind=self._default_kind,
fill_value=self._default_fill_value,
copy=True, dtype=dtype)
sdict = DataFrame()
sdict = {}
for k, v in compat.iteritems(data):
if isinstance(v, Series):
# Force alignment, no copy necessary
Expand All @@ -159,15 +162,12 @@ def _init_dict(self, data, index, columns, dtype=None):
v = [v.get(i, nan) for i in index]

v = sp_maker(v)
sdict[k] = v
sdict[_nan_to_np_nan(k)] = v

# TODO: figure out how to handle this case, all nan's?
# add in any other columns we want to have (completeness)
nan_vec = np.empty(len(index))
nan_vec.fill(nan)
for c in columns:
if c not in sdict:
sdict[c] = sp_maker(nan_vec)
nan_arr = sp_maker(np.full(len(index), np.nan))
sdict.update((c, nan_arr) for c in columns if c not in sdict)

return to_manager(sdict, columns, index)

Expand Down Expand Up @@ -846,6 +846,13 @@ def applymap(self, func):
return self.apply(lambda x: lmap(func, x))


def _nan_to_np_nan(value):
"""Normalize nan values to singleton np.NaN object so that when NaNs are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use isnull

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had wrongly assumed raw numpy was, as is often the case, faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, why do you need this at all?

Copy link
Contributor Author

@kernc kernc Jul 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because one of the tests (test_nan_columnname, #8822) apparently sets a float('nan') (is not np.nan) as a valid column name, and {float('nan'): 'foo'}[np.nan] is a KeyError. I didn't know what else to do. DataFrame apparently handles all cases of nan specially; it seemed easiest to force singleton for dict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a look thru how DataFrame handles this in the init_dict routines

don't want to be reinventing the wheel here

Copy link
Contributor Author

@kernc kernc Jul 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how DataFrame handles this

Incorrectly in the sense that:

>>> df = pd.DataFrame({np.nan: [1, 2]})
>>> df[np.nan]   # Arguably expectedly, nan matches nan
0    1
1    2
Name: nan, dtype: int64

>>> df = pd.DataFrame({np.nan: [1, 2], 2: [2, 3]}, columns=[np.nan, 2])
>>> df   # nan from dict didn't match the nan from ensured Float64Index
  NaN    2.0
0  NaN     2
1  NaN     3

Nans are tricky because it generally holds nan != nan. But I guess more often than not, this leads to confusion when nan is expected to be a single, catch-all bin.

How can I improve on the above singleton approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you point to the test that is failing. I don't want to address this in this PR. This is non-trivial and needs to be common code. ok with xfailing those tests (and making an issue)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing failing test for sparse is pandas.tests.sparse.test_frame.TestSparseDataFrame.test_nan_columnname.
Issue #8822 has some discussion. nan in indexes are supported and have valid uses (e.g. #3729).
So xfailing that test in this PR ...

used as dict keys, getitem works.
"""
return np.nan if is_float(value) and np.isnan(value) else value


def to_manager(sdf, columns, index):
""" create and return the block manager from a dataframe of series,
columns, index
Expand All @@ -855,7 +862,7 @@ def to_manager(sdf, columns, index):
axes = [_ensure_index(columns), _ensure_index(index)]

return create_block_manager_from_arrays(
[sdf[c] for c in columns], columns, axes)
[sdf[_nan_to_np_nan(c)] for c in columns], columns, axes)


def stack_sparse_frame(frame):
Expand Down