-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix wrong SparseBlock initialization #17386
BUG: Fix wrong SparseBlock initialization #17386
Conversation
Codecov Report
@@ Coverage Diff @@
## master #17386 +/- ##
==========================================
- Coverage 91.03% 90.99% -0.05%
==========================================
Files 163 163
Lines 49580 49586 +6
==========================================
- Hits 45137 45119 -18
- Misses 4443 4467 +24
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17386 +/- ##
==========================================
- Coverage 91.43% 91.4% -0.03%
==========================================
Files 163 163
Lines 50091 50148 +57
==========================================
+ Hits 45800 45838 +38
- Misses 4291 4310 +19
Continue to review full report at Codecov.
|
needs tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
w/o tests I don't even know what you are trying to do. Further we don't use if/then like this at all, instead dispatch per-block. but pls add tests first.
ea077b3
to
785bbcf
Compare
@jreback Thanks for your comment. I've just added test codes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first need to get the tests correct
pandas/core/internals.py
Outdated
@@ -527,7 +535,7 @@ def f(m, v, i): | |||
|
|||
return self.split_and_operate(None, f, False) | |||
|
|||
def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs): | |||
def astype(self, dtype, copy=True, errors='raise', values=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls don't change things like this, which completely changes semantics
pandas/core/internals.py
Outdated
@@ -442,12 +443,19 @@ def make_a_block(nv, ref_loc): | |||
nv = _block_shape(nv, ndim=self.ndim) | |||
except (AttributeError, NotImplementedError): | |||
pass | |||
block = self.make_block(values=nv, | |||
placement=ref_loc, fastpath=True) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this pattern at all, nv should be a sparse array at this point; the inference should just work. is that not the case?
pandas/core/internals.py
Outdated
@@ -1562,6 +1581,10 @@ def _nanpercentile(values, q, axis, **kw): | |||
result = self._try_coerce_result(result) | |||
if is_scalar(result): | |||
return ax, self.make_block_scalar(result) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again this should be a sparse array, if its not then you need to implement this on the SparseBlock itself (and you can certainly call the super method). I don't like type checking inside a block.
pandas/core/internals.py
Outdated
@@ -2653,7 +2676,7 @@ def sp_index(self): | |||
def kind(self): | |||
return self.values.kind | |||
|
|||
def _astype(self, dtype, copy=False, raise_on_error=True, values=None, | |||
def _astype(self, dtype, copy=True, raise_on_error=True, values=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not do this
pandas/core/sparse/frame.py
Outdated
@@ -321,8 +321,8 @@ def _apply_columns(self, func): | |||
data=new_data, index=self.index, columns=self.columns, | |||
default_fill_value=self.default_fill_value).__finalize__(self) | |||
|
|||
def astype(self, dtype): | |||
return self._apply_columns(lambda x: x.astype(dtype)) | |||
def astype(self, dtype, copy=True, errors='raise'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this match the super signature?
pandas/tests/sparse/test_frame.py
Outdated
@@ -1385,3 +1385,62 @@ def test_numpy_func_call(self): | |||
'std', 'min', 'max'] | |||
for func in funcs: | |||
getattr(np, func)(self.frame) | |||
|
|||
def test_where(self): | |||
data = [[1, 2], [3, 4]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment
pandas/tests/sparse/test_frame.py
Outdated
sparse_df = SparseDataFrame(data) | ||
result = sparse_df.where(sparse_df >= 2) | ||
|
||
dense_df = DataFrame(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so these return dense frames????
pandas/tests/sparse/test_frame.py
Outdated
q = 0.1 | ||
|
||
sparse_df = SparseDataFrame(data) | ||
result = sparse_df.quantile(q) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are very odd tests, why are you returning dense frames?
pandas/tests/sparse/test_series.py
Outdated
result = sparse.where(sparse >= 2) | ||
|
||
dense = Series(data) | ||
expected = dense.where(dense >= 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same questions as above
@jreback Thanks for your review! I've found the initial solution is not enough through fixing tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are changing lots of things. can you do this step by step rather that all at once. you are touching some pretty gnarly code.
@@ -978,7 +984,7 @@ def f(m, v, i): | |||
|
|||
return [self.make_block(new_values, fastpath=True)] | |||
|
|||
def coerce_to_target_dtype(self, other): | |||
def coerce_to_target_dtype(self, other, copy=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I dont think this actually helps, why did you add it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparseBlock
astype(dtype, copy=False)
makes reinterpret cast, so I override coerce_to_target_dtype
and set copy=True
in SparseBlock
class.
203a8f9#diff-e705e723b2d6e7c0e2a0443f80916abfR2639
**kwargs) | ||
|
||
dtype = self.values.sp_values.dtype | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty complex, pls simplify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added these codes because coerce_to_target_dtype
does not work well in SparseBlock. This is only checking the type information of SparseArray and these procedure is also implemented in IntBlock
, etc. I cannot figure out how to simplify. Any suggestion for simplifying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not against this theorectially, but the implementation is fragile here. there are functions for validation already in the sparse classes.
@jreback Okay. I'll split the commit. |
2d94fba
to
1c0613b
Compare
@jreback The big commit is now split. If you have any questions, feel free to ask. |
1c0613b
to
d0bf226
Compare
Rebased to change commit logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again you are doing too much in this PR. I require much simpler PR's that are built on top of each other. IOW pls break this apart. You can add tests for everything in a single PR (the first) and simply xfail tests that don't work. That will give you a basis to build on, and simplify understandbility of the reviews / PR / code. Adding paths via if/thens is not a good way forward here.
pandas/core/internals.py
Outdated
return isinstance(element, dtype) | ||
else: | ||
element_dtype = infer_dtype_from(element, pandas_dtype=True)[0] | ||
return isinstance(element, dtype) or dtype == element_dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh?
@@ -995,7 +1001,7 @@ def coerce_to_target_dtype(self, other): | |||
|
|||
if self.is_bool or is_object_dtype(dtype) or is_bool_dtype(dtype): | |||
# we don't upcast to bool | |||
return self.astype(object) | |||
return self.astype(object, copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is ok, though numpy ignores the copy=
flag when dtype is object
so no point in passing it if dtype is object
|
||
raise AssertionError("possible recursion in " | ||
"coerce_to_target_dtype: {} {}".format( | ||
self, other)) | ||
|
||
try: | ||
return self.astype(dtype) | ||
return self.astype(dtype, copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here i guess is ok
@@ -1382,6 +1388,11 @@ def where(self, other, cond, align=True, raise_on_error=True, | |||
if hasattr(other, 'reindex_axis'): | |||
other = other.values | |||
|
|||
if is_scalar(other) or is_list_like(other): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh?
@@ -1394,6 +1405,9 @@ def where(self, other, cond, align=True, raise_on_error=True, | |||
if not hasattr(cond, 'shape'): | |||
raise ValueError("where must have a condition that is ndarray " | |||
"like") | |||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? if you find yourself adding an if/then pretty much anywhere then you are doing it wrong. better to add the method to Sparse and call super; sometimes the super method may need a bit of refactor to make it more general.
@@ -1440,7 +1454,12 @@ def func(cond, values, other): | |||
if try_cast: | |||
result = self._try_cast_result(result) | |||
|
|||
return self.make_block(result) | |||
if isinstance(result, np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again this is just completely confusing to do and makes the code way more complex. find a better way
@@ -1713,6 +1733,7 @@ class FloatBlock(FloatOrComplexBlock): | |||
is_float = True | |||
_downcast_dtype = 'int64' | |||
|
|||
@classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is such a huge change, what is the purpose?
**kwargs) | ||
|
||
dtype = self.values.sp_values.dtype | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not against this theorectially, but the implementation is fragile here. there are functions for validation already in the sparse classes.
self._can_hold_na = False | ||
|
||
def _can_hold_element(self, element): | ||
""" require the same dtype as ourselves """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again this is so complex and adds so much techincal debt.
@@ -2769,9 +2858,15 @@ def sparse_reindex(self, new_index): | |||
return self.make_block_same_class(values, sparse_index=new_index, | |||
placement=self.mgr_locs) | |||
|
|||
def _try_coerce_result(self, result): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is the right idea, though you shoul dgenerally have a function that intercepts a ndarray and creates a SparseArray; it should be called for most sparse methods.
@Licht-T I believe you have all of the xfail tests in. pls rebase and update. |
@jreback Yeah! Nothing stands in my way! |
BUG: Fix wrong SparseBlock initialization in quantile method BUG: Fix make_spase mask generation not to cast when dtype is object BUG: Add SparseArray.all method BUG: Add copy parameter to prevent reinterpret cast of sparse Revert and fix astype parameters BUG: Create SparseBlock.__init__ to set type information of SparseArray BUG: Override SparseBlock._can_hold_element Revert changes in Block.whare BUG: Override SparseBlock.make_block with fill_value argument BUG: Set fill_value and ndim parameter in make_block when generating SparseBlock from result BUG: Override SparseBlock._try_coerce_result to make result flatten and sparse BUG: Change form _can_hold_na to _can_hold_element for supporting non NA fill value BUG: Fix 1D check statement SparseDataFrame.where passes (1, n)-shape SparseBlock, but actual values is n-length SparseArray BUG: Adjust cond shape to SparseBlock SparseDataFrame.where passes (1, n)-shape SparseBlock and condition block to Block.where, but it compares n-length SparseArray held by the SparseBlock and (1, n)-shape condition block. BUG: Override SparseDataFrame.where method to set _default_fill_value
d0bf226
to
6b36d55
Compare
Now rebased. |
@Licht-T can you see if you can simplify this, there seem to be lots of if/then cases. I think you might be able to define / override some sparse methods in internals to avoid this. |
@Licht-T can you update? |
closing as stale, though we'd definitely take a fixed up / rebased version updated for comments |
closes AttributeError: 'IntBlock'/'FloatBlock'/etc. object has no attribute 'sp_index' #17198
tests added / passed
Passed the same tests which the current master branch pass.
passes
git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry