Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should SparseArray.astype be dense or Sparse #23125

Closed
TomAugspurger opened this issue Oct 13, 2018 · 16 comments
Closed

Should SparseArray.astype be dense or Sparse #23125

TomAugspurger opened this issue Oct 13, 2018 · 16 comments
Labels
API Design Blocker Blocking issue or pull request for an upcoming release Dtype Conversions Unexpected or buggy dtype conversions Sparse Sparse Data Type
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Right now SparseArray.astype(numpy_dtype) is sparse:

In [6]: a = pd.SparseArray([0, 1, 0, 1])

In [7]: a.astype(np.dtype('float'))
Out[7]:
[0, 1.0, 0, 1.0]
Fill: 0
IntIndex
Indices: array([1, 3], dtype=int32)

This is potentially confusing. I did it to match the behavior of SparseSeries, but we may not want that.

@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Oct 13, 2018
@TomAugspurger TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions API Design Sparse Sparse Data Type Blocker Blocking issue or pull request for an upcoming release labels Oct 13, 2018
@TomAugspurger
Copy link
Contributor Author

This also leads to confusing behavior like

In [10]: a.astype(bool)
Out[10]:
[0, True, 0, True]
Fill: 0
IntIndex
Indices: array([1, 3], dtype=int32)

since .astype(bool) is interpreted as .astype(SparseDtype(bool, fill_value=0))

@jorisvandenbossche
Copy link
Member

I think this is a regression compared to released version. Compare the last example:

In [18]: a = pd.SparseArray([0, 1, 0, 1])

In [20]: a.astype(bool)
Out[20]: 
[False, True, False, True]
Fill: False
IntIndex
Indices: array([1, 3], dtype=int32)

When astype-ing with a numpy dtype, I think we should simply "astype" the scalar fill_value as well.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 23, 2018

So we have a few choices

  1. Change SparseArray.astype to be dense for non-SparseDtype dtype
  2. Keep SparseArray.astype always sparse, but change Series[sparse].astype to be dense for non-SparseDtype dtypes.

2 can be done and is backwards compatible. It means a special case in Block.astype, and means Series[sparse].astype isn't equivalent to SparseArray.astype.

1 is backwards incompatible. It'd be hard to do with a deprecation cycle I think.

@jorisvandenbossche
Copy link
Member

Isn't there also option 3: keep both SparsArray.astype and Series[sparse].astype always sparse?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 23, 2018

Yes, I was assuming we didn't want Series[sparse].astype(numpy_type) to be sparse though. Do we?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 23, 2018

That's in any case what it is doing now (and in 0.23), so not doing that would also be a backwards incompatible change.

On the one hand it is convenient to keep it sparse if you quickly want to change the sparse type (without needing to use SparseDtype), but on the other hand it also might be strange that the result of Series.astype(dtype).dtype != dtype

@jorisvandenbossche
Copy link
Member

But didn't we discuss that on the SparseArray PR? I seem to remember that you and @kernc argued for keeping it sparse?

@TomAugspurger
Copy link
Contributor Author

Hmm I don't recall right now.

Not having Series.astype(dtype) being equal to dtype is causing a few problems in our internal routines. I suspect we'll need to add more little changes like 428f230 to special case sparse, which would be unfortunate.

@jorisvandenbossche
Copy link
Member

The topic is touched here: #22325 (comment)

@jorisvandenbossche
Copy link
Member

Not having Series.astype(dtype) being equal to dtype is causing a few problems in our internal routines.

How do you mean?

Because that is what is already happening on master:

In [29]: a = pd.SparseArray([0, 1, 0, 1])

In [30]: pd.Series(a).astype(np.float64)
Out[30]: 
0    0.0
1    1.0
2    0.0
3    1.0
dtype: Sparse[float64, 0]

@TomAugspurger
Copy link
Contributor Author

In the case of
428f230 it's a method that previously converted to an ndarray, which we want to avoid. Perhaps we've already caught most of those and it won't be a big deal to support...

@TomAugspurger
Copy link
Contributor Author

Do people (@kernc) have thoughts on whether Series[sparse].astype(numpy_type) should be sparse or not? Right now, I'm leaning towards keeping it sparse (the behavior of master and 0.23).

Closing this in a few days if there aren't any further comments.

@kernc
Copy link
Contributor

kernc commented Nov 8, 2018

I'd foremost prefer to avoid shaping the decision that may in future turn out to be a bad idea. 😄 As a user, I've never used astype() in situations other than to convert values to str to use the Series.str accessor, or to convert to int to use the values as indexes (since numpy now warns/raises on round floats), or to convert floats to shorter floats (float16) for performance.

I understand series sparsity to be orthogonal to its underlying dtype. Dtype is the type of values and sparsity is the way they are laid out in memory. The latter can be idempotently transformed from one into the other and back while the former not necessarily so.

Admittedly, I'm no longer as interested in the topic and not regularly working with sparse data, but as a potential user, I'd be annoyed if Series(scipy_coo_matrix).astype(int) blew my RAM and I instead had to use something as verbose as .astype('Sparse[int, 0]'). I'd prefer instead .astype(int, fill_value=0) (docs say kwargs are passed to some constructor?) and .astype(int).to_dense() to densify.

But I agree with @jorisvandenbossche's comment that fill_value should be adapted to be compatible with set dtype.

Perhaps get input from some other sparse users.

@kernc
Copy link
Contributor

kernc commented Nov 8, 2018

A side question, certainly a little late at that, what blows up if SparseDtype were a subclass of np.dtype (and np.dtype(float) == SparseDtype(float, fill_value=any) (as long as fill_value type is forced compatible, i.e. a float))?

@TomAugspurger
Copy link
Contributor Author

Great, I think we're in agreement then. Series[sparse].astype(dtype) will always be sparse.

But I agree with @jorisvandenbossche's comment that fill_value should be adapted to be compatible with set dtype.

I have a WIP branch that I hope to push tomorrow doing this.

what blows up if SparseDtype were a subclass of np.dtype (and np.dtype(float) == SparseDtype(float, fill_value=any) (as long as fill_value type is forced compatible, i.e. a float))?

You can't subclass np.dtype, at least from Python, and NumPy's dtype equality isn't the friendliest (IIRC, it tries to convert the other into a NumPy dtype, but raises a TypeError if that fails, instead of returning NotImplemented).

@jorisvandenbossche
Copy link
Member

Long term it might be nice to be able to make the distinction between astype('int') and astype(np.int64), where the first could be interpreted by pandas and keep the sparsity, while the second could ensure the output actually has dtype of np.int64.
But that might also be confusing ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Blocker Blocking issue or pull request for an upcoming release Dtype Conversions Unexpected or buggy dtype conversions Sparse Sparse Data Type
Projects
None yet
Development

No branches or pull requests

3 participants