-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: 2D support for MaskedArray #38992
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine to me.
pandas/core/arrays/masked.py
Outdated
@@ -80,6 +80,8 @@ class BaseMaskedArray(OpsMixin, ExtensionArray): | |||
|
|||
# The value used to fill '_data' to avoid upcasting | |||
_internal_fill_value: Scalar | |||
_data: np.ndarray | |||
_mask: np.ndarray[Any, bool] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.ndarray in numpy 1.20 is not generic so although mypy is happy with type parameters on Any, this will raise errors when we transition to numpy types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the [Any, bool]
part of this. is there an approximate calendar for the transition to numpy types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are in 1.20 (releaseing blocking on arrow update atm)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an approximate calendar for the transition to numpy types?
I updated to 1.20.0rc2 locally and there were no changes to the mypy errors. ( I've not yet checked the commit history to see if there were any changes)
Once 1.20 is released, and we pick it up on CI we will see the errors in #36092. (I'll merge master and make a start on updating now)
I assume that we will pin numpy in ci while we discuss how to sort out the errors. (once we know what the status is with the released numpy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such a significant architectural change shouldn't be merged without prior discussion
sure, what are your concerns |
The ExtensionArrays have been 1D from the start (for around 3 years now). And idem for MaskedArray in specific. So if Brock proposes to change that, then I think it is to start up to the proposer to come up with arguments for changing this. There has been discussion before about 1D vs 2D extension arrays, for sure (although I don't think any of that discussion resulted in a clear decision to merge a PR like this). But specifically for MaskedArray, we didn't have any discussion about this, AFAIK. I think that at least requires some discussion about whether we want this or not? The masked arrays have been explicitly designed to be 1D, of course also because currently ExtensionArrays in general are 1D, but in addition, there are several ideas for future improvements (#30435), exploring bitmask instead of boolean mask (#31293), optionally using arrow under the hood like we are doing for the string array, more efficient zero-copy conversion with arrow, nested data types (eg #35176). Those are all not impossible with 2D arrays, but IMO will be much easier with 1D arrays, and thus requires some consideration. It's also not fully clear to me where this PR is going towards. At the moment it is adding quite some complexity for something that is not yet used. What's the plan for how to actually use it in pandas? Do we want to give the masked arrays 2D capabilities (so this ability can be used for certain operations), but keep storing them as 1D in DataFrames? Or do we want to change the ExtensionBlock to a consolidated 2D Block? But only for masked arrays, or for all ExtensionArrays? What for arrays that cannot easily be 2D (eg nested array)? What's the idea for externally defined ExtensionArrays? ... |
The reason to do this is roughly the same reason why we're moving forward with ArrayManager: so that we can see if actually using this is something we want to do longer-term. |
gentle ping, plenty more tests where these came from |
ok I am +1 on merging this. I agree with @jbrockmendel reasoning here. We don't really know where we are going to ultimately go, e.g. ArrayManger or simplified BlockManager. We need more support & performance testing to see. Sure I'd like to see a unified approach, but we have advocates for both and would rather not inhibit experimentation. |
For the simplified non-consolidating BlockManager, I started with a description of arguments for it, we had an extensive discussion about it, with several people expressing their interest for it, and with the main question mark being performance. At which point we need a proof of concept to test things. As far as I know, we have had no such discussion about 2D masked arrays. |
We've had the same discussion about 2D EAs repeatedly. |
@jorisvandenbossche do you have actual concrete objections to merging this? We are allowing ArrayManager on an experimental basis, I don't see how this is any different. |
I think my longer comment above (#38992 (comment)) already includes some concrete concerns. Reformulating them:
(having a call about this might help resolve some of those discussion points?) |
I'm tired of repeating myself. At the sprint in 2019 we (including Wes) agreed to move forward with 2D EA support for experimentation. Since then the only thing we've learned is that you are more willing to repeat the same arguments over and over again than I am, and everyone else makes the entirely reasonable decision to tune it out.
I would dearly like to see that, which I see as part-in-parcel with the fix-many-xfails that I brought up on last week's call. But I don't see any effort towards making them happen, or any reason why they are mutually exclusive with this experimentation. |
gentle ping; this would simplify a corner case in #33036 |
ok tests are failing. i suppose it makes sense to support both 1d and 2d kernels on things. it does lead to some code duplication, but performance can be great if we don't need to operate column-by-column all the time. However the codebase is mostly 1d currently, with some efforts to add 2d kernels. does this concur with your thinking? |
Fixed
Was this part of the comment supposed to go in #42841? Will answer there. |
@jbrockmendel needs rebase |
@@ -115,6 +117,9 @@ class BaseMaskedArray(OpsMixin, ExtensionArray): | |||
|
|||
# The value used to fill '_data' to avoid upcasting | |||
_internal_fill_value: Scalar | |||
_data: np.ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment about these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. great to see how to proceed.
skipna: bool = True, | ||
axis: Optional[int] = None, | ||
): | ||
return _minmax(np.max, values=values, mask=mask, skipna=skipna, axis=axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there doc strings here? if so can you update (can be followup as well)
@@ -194,9 +199,23 @@ def test_reductions_2d_axis0(self, data, method, request): | |||
if method in ["sum", "prod"] and data.dtype.kind in ["i", "u"]: | |||
# FIXME: kludge | |||
if data.dtype.kind == "i": | |||
dtype = pd.Int64Dtype() | |||
if is_platform_windows() or not IS64: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah followup with this to make nicer
certainly fine for testing things out. thanks @jbrockmendel a couple of followups |
continue | ||
fill_count += 1 | ||
values[j, i] = val | ||
mask[j, i] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mask should be previous mask... see pad_inplace #39953
import numpy as np
import pandas as pd
dtype = pd.Int64Dtype()
data_missing = pd.array([pd.NA, 1], dtype=dtype)
arr = data_missing.repeat(4).reshape(4, 2)
result = arr.fillna(method="pad")
print(result)
expected = data_missing.fillna(method="pad").repeat(4).reshape(4, 2)
print(expected)
<IntegerArray>
[
[<NA>, <NA>],
[1, 1],
[1, 1],
[1, 1]
]
Shape: (4, 2), dtype: Int64
<IntegerArray>
[
[<NA>, <NA>],
[<NA>, <NA>],
[1, 1],
[1, 1]
]
Shape: (4, 2), dtype: Int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so fixing this on my numba branch results in...
@numba.njit
def _pad_2d_inplace(values, mask, limit=None):
if values.shape[1]:
K, N = values.shape
if limit is None:
for j in range(K):
val, prev_mask = values[j, 0], mask[j, 0]
for i in range(N):
if mask[j, i]:
values[j, i], mask[j, i] = val, prev_mask
else:
val, prev_mask = values[j, i], mask[j, i]
else:
for j in range(K):
fill_count = 0
val, prev_mask = values[j, 0], mask[j, 0]
for i in range(N):
if mask[j, i]:
if fill_count >= limit:
continue
fill_count += 1
values[j, i], mask[j, i] = val, prev_mask
else:
fill_count = 0
val, prev_mask = values[j, i], mask[j, i]
I have some duplication here but a perf improvement for the common case of no limit, the duplication can probably be mitigated by reshaping a 1d array and removing the 1d version pad_inplace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might look into a variant using 2 loops, the first to find the first not missing value. and the second to fill without tracking the previous mask, then we could just do mask[j, i] = False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing while in the neighborhood, I think for i in range(N)
should probably be for i in range(1, N)
?
@@ -656,10 +656,11 @@ def pad_2d_inplace(numeric_object_t[:, :] values, const uint8_t[:, :] mask, limi | |||
val = values[j, 0] | |||
for i in range(N): | |||
if mask[j, i]: | |||
if fill_count >= lim: | |||
if fill_count >= lim or i == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #38992 (comment)
@simonjayhawkins addressing your post-merge comments has been on my todo list for a while, looks likely to fall off. Do they merit their own Issue? |
This doesn't in any way use the 2D support, but opens up the option of incrementally fleshing out the tests.