-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix make_sparse mask generation #17574
BUG: Fix make_sparse mask generation #17574
Conversation
e4d23c0
to
501ba6f
Compare
501ba6f
to
b64c123
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a whatsnew entry for this ; use the PR number as issue
pandas/core/sparse/array.py
Outdated
mask.append(e != fill_value) | ||
else: | ||
mask.append(True) | ||
mask = np.array(mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very awkward
you can do this by starting with an array of True
the using .flat to set only the portion you need and do that with a list comprehension
pandas/tests/sparse/test_array.py
Outdated
@@ -61,6 +61,12 @@ def test_constructor_object_dtype(self): | |||
assert arr.dtype == np.object | |||
assert arr.fill_value == 'A' | |||
|
|||
data = [False, 0, 100.0, 0.0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issues number as a comment
Hello @Licht-T! Thanks for updating the PR.
Comment last updated on September 27, 2017 at 11:39 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #17574 +/- ##
==========================================
- Coverage 91.22% 91.2% -0.02%
==========================================
Files 163 163
Lines 49625 49632 +7
==========================================
- Hits 45270 45268 -2
- Misses 4355 4364 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17574 +/- ##
==========================================
+ Coverage 91.24% 91.25% +<.01%
==========================================
Files 163 163
Lines 49762 49764 +2
==========================================
+ Hits 45406 45412 +6
+ Misses 4356 4352 -4
Continue to review full report at Codecov.
|
05bbd54
to
7727c5a
Compare
@jreback Thanks for your review.
|
pandas/core/sparse/array.py
Outdated
mask = np.ones(len(arr), dtype=np.bool) | ||
fv_type = type(fill_value) | ||
|
||
itr = (type(x) is fv_type for x in arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still quite cumbersome can you simplify
@jreback The method is now simplified. |
pandas/core/sparse/array.py
Outdated
@@ -789,7 +790,15 @@ def make_sparse(arr, kind='block', fill_value=None): | |||
if is_string_dtype(arr): | |||
arr = arr.astype(object) | |||
|
|||
mask = arr != fill_value | |||
if is_object_dtype(arr.dtype): | |||
mask = np.ones(len(arr), dtype=np.bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok pls add a comment on what/why this is happening.
no general numpy routine to do this? (I don't think we have a pandas one), but have a look around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Added comments. I haven't found yet. I think there is no numpy general method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some asvs for sparse? (if you prefer another issue that's ok too). Need to benchmarks things like this.
what do benchmarks show? |
08f5a22
to
a48f957
Compare
@jreback Here is the results of tests. [asv_bench] asv continuous -f 1.1 pandas-dev/master HEAD -b sparse.sparse_array_constructor 21:39:21 ☁ fix-make_spase-mask-generation ☂ 𝝙 ⚡ ✭ 𝝙
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
· Running 12 total benchmarks (2 commits * 1 environments * 6 benchmarks)
[ 0.00%] · For pandas commit hash a48f9572:
[ 0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 8.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_10percent 39.3±2ms
[ 16.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_1percent 7.67±1ms
[ 25.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent 39.6±0.4ms
[ 33.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_1percent 6.69±0.3ms
[ 41.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_10percent 556±20ms
[ 50.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_1percent 486±6ms
[ 50.00%] · For pandas commit hash 5279a172:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...................................................................
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 58.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_10percent 37.3±3ms
[ 66.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_1percent 6.70±0.3ms
[ 75.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent 38.0±1ms
[ 83.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_1percent 6.16±0.1ms
[ 91.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_10percent 53.8±2ms
[100.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_1percent 32.4±0.5ms
before after ratio
[5279a172] [a48f9572]
+ 32.4±0.5ms 486±6ms 15.01 sparse.sparse_array_constructor.time_sparse_array_constructor_object_1percent
+ 53.8±2ms 556±20ms 10.34 sparse.sparse_array_constructor.time_sparse_array_constructor_object_10percent
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. |
@Licht-T yes that is slow. Sure if you can do that in cython would be great. its only a single case so shouldn't be much code. not sure how much speedup to get though as you can't really type anything. |
also, would actually be ok with NOT allowing a not-nan fill_value for |
@jreback Here is the Cython result. [asv_bench] asv continuous -f 1.1 pandas-dev/master HEAD -b sparse.sparse_array_constructor 23:04:10 ☁ fix-make_spase-mask-generation-new ☂ ⚡ ✭
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
· Running 12 total benchmarks (2 commits * 1 environments * 6 benchmarks)
[ 0.00%] · For pandas commit hash ae079f69:
[ 0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..................................................................
[ 0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 8.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_10percent 36.9±0.6ms
[ 16.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_1percent 6.91±0.4ms
[ 25.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent 38.5±2ms
[ 33.33%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_1percent 5.88±0.05ms
[ 41.67%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_10percent 85.9±2ms
[ 50.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_1percent 52.7±0.6ms |
This is rough reimplement and only checking whether this passes tests. It might be a little more faster. |
@jreback Here is the final result. [asv_bench] asv continuous -f 1.1 pandas-dev/master HEAD -b sparse.sparse_array_constructor 0:35:38 ☁ fix-make_spase-mask-generation ☂ 𝝙 ⚡ ✭ 𝝙
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
· Running 16 total benchmarks (2 commits * 1 environments * 8 benchmarks)
[ 0.00%] · For pandas commit hash 35890726:
[ 0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 6.25%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_10percent 40.7±1ms
[ 12.50%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_1percent 7.34±0.1ms
[ 18.75%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent 37.3±0.6ms
[ 25.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_1percent 6.34±0.09ms
[ 31.25%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_nan_fill_value_10percent 78.2±1ms
[ 37.50%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_nan_fill_value_1percent 44.0±0.6ms
[ 43.75%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_10percent 87.0±2ms
[ 50.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_1percent 51.6±0.5ms
[ 50.00%] · For pandas commit hash 5279a172:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 56.25%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_10percent 37.2±2ms
[ 62.50%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_float64_1percent 7.07±0.3ms
[ 68.75%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent 46.8±3ms
[ 75.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_int64_1percent 6.66±0.05ms
[ 81.25%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_nan_fill_value_10percent 82.8±1ms
[ 87.50%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_nan_fill_value_1percent 46.9±1ms
[ 93.75%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_10percent 55.0±2ms
[100.00%] ··· Running sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_1percent 34.4±0.9ms
before after ratio
[5279a172] [35890726]
+ 55.0±2ms 87.0±2ms 1.58 sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_10percent
+ 34.4±0.9ms 51.6±0.5ms 1.50 sparse.sparse_array_constructor.time_sparse_array_constructor_object_non_nan_fill_value_1percent
- 46.8±3ms 37.3±0.6ms 0.80 sparse.sparse_array_constructor.time_sparse_array_constructor_int64_10percent
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. |
ok this looks good |
lgtm ping on green. |
thanks @Licht-T keep em coming! |
git diff upstream/master -u -- "*.py" | flake8 --diff
This is the part of #17386.
The mask generation sequence in
make_sparse
treated two data, which have same bits, as same if array isdtype=object
.This bug made
SparseArray([False, 0, 100.0, 0.0], fill_value=False, dtype=np.object)
is evaluated as[False, False, 100.0, False]