Should SparseArray.astype be dense or Sparse #23125

TomAugspurger · 2018-10-13T12:07:08Z

Right now SparseArray.astype(numpy_dtype) is sparse:

In [6]: a = pd.SparseArray([0, 1, 0, 1])

In [7]: a.astype(np.dtype('float'))
Out[7]:
[0, 1.0, 0, 1.0]
Fill: 0
IntIndex
Indices: array([1, 3], dtype=int32)

This is potentially confusing. I did it to match the behavior of SparseSeries, but we may not want that.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-10-13T12:08:18Z

This also leads to confusing behavior like

In [10]: a.astype(bool)
Out[10]:
[0, True, 0, True]
Fill: 0
IntIndex
Indices: array([1, 3], dtype=int32)

since .astype(bool) is interpreted as .astype(SparseDtype(bool, fill_value=0))

jorisvandenbossche · 2018-10-15T11:47:41Z

I think this is a regression compared to released version. Compare the last example:

In [18]: a = pd.SparseArray([0, 1, 0, 1])

In [20]: a.astype(bool)
Out[20]: 
[False, True, False, True]
Fill: False
IntIndex
Indices: array([1, 3], dtype=int32)

When astype-ing with a numpy dtype, I think we should simply "astype" the scalar fill_value as well.

TomAugspurger · 2018-10-23T16:45:44Z

So we have a few choices

Change SparseArray.astype to be dense for non-SparseDtype dtype
Keep SparseArray.astype always sparse, but change Series[sparse].astype to be dense for non-SparseDtype dtypes.

2 can be done and is backwards compatible. It means a special case in Block.astype, and means Series[sparse].astype isn't equivalent to SparseArray.astype.

1 is backwards incompatible. It'd be hard to do with a deprecation cycle I think.

jorisvandenbossche · 2018-10-23T19:58:50Z

Isn't there also option 3: keep both SparsArray.astype and Series[sparse].astype always sparse?

TomAugspurger · 2018-10-23T20:07:34Z

Yes, I was assuming we didn't want Series[sparse].astype(numpy_type) to be sparse though. Do we?

jorisvandenbossche · 2018-10-23T20:10:42Z

That's in any case what it is doing now (and in 0.23), so not doing that would also be a backwards incompatible change.

On the one hand it is convenient to keep it sparse if you quickly want to change the sparse type (without needing to use SparseDtype), but on the other hand it also might be strange that the result of Series.astype(dtype).dtype != dtype

jorisvandenbossche · 2018-10-23T20:11:50Z

But didn't we discuss that on the SparseArray PR? I seem to remember that you and @kernc argued for keeping it sparse?

TomAugspurger · 2018-10-23T20:13:29Z

Hmm I don't recall right now.

Not having Series.astype(dtype) being equal to dtype is causing a few problems in our internal routines. I suspect we'll need to add more little changes like 428f230 to special case sparse, which would be unfortunate.

jorisvandenbossche · 2018-10-23T20:14:23Z

The topic is touched here: #22325 (comment)

jorisvandenbossche · 2018-10-23T20:16:12Z

Not having Series.astype(dtype) being equal to dtype is causing a few problems in our internal routines.

How do you mean?

Because that is what is already happening on master:

In [29]: a = pd.SparseArray([0, 1, 0, 1])

In [30]: pd.Series(a).astype(np.float64)
Out[30]: 
0    0.0
1    1.0
2    0.0
3    1.0
dtype: Sparse[float64, 0]

TomAugspurger · 2018-10-23T20:21:15Z

In the case of
428f230 it's a method that previously converted to an ndarray, which we want to avoid. Perhaps we've already caught most of those and it won't be a big deal to support...

TomAugspurger · 2018-11-07T17:31:38Z

Do people (@kernc) have thoughts on whether Series[sparse].astype(numpy_type) should be sparse or not? Right now, I'm leaning towards keeping it sparse (the behavior of master and 0.23).

Closing this in a few days if there aren't any further comments.

kernc · 2018-11-08T00:59:02Z

I'd foremost prefer to avoid shaping the decision that may in future turn out to be a bad idea. 😄 As a user, I've never used astype() in situations other than to convert values to str to use the Series.str accessor, or to convert to int to use the values as indexes (since numpy now warns/raises on round floats), or to convert floats to shorter floats (float16) for performance.

I understand series sparsity to be orthogonal to its underlying dtype. Dtype is the type of values and sparsity is the way they are laid out in memory. The latter can be idempotently transformed from one into the other and back while the former not necessarily so.

Admittedly, I'm no longer as interested in the topic and not regularly working with sparse data, but as a potential user, I'd be annoyed if Series(scipy_coo_matrix).astype(int) blew my RAM and I instead had to use something as verbose as .astype('Sparse[int, 0]'). I'd prefer instead .astype(int, fill_value=0) (docs say kwargs are passed to some constructor?) and .astype(int).to_dense() to densify.

But I agree with @jorisvandenbossche's comment that fill_value should be adapted to be compatible with set dtype.

Perhaps get input from some other sparse users.

kernc · 2018-11-08T01:01:23Z

A side question, certainly a little late at that, what blows up if SparseDtype were a subclass of np.dtype (and np.dtype(float) == SparseDtype(float, fill_value=any) (as long as fill_value type is forced compatible, i.e. a float))?

TomAugspurger · 2018-11-08T02:32:37Z

Great, I think we're in agreement then. Series[sparse].astype(dtype) will always be sparse.

But I agree with @jorisvandenbossche's comment that fill_value should be adapted to be compatible with set dtype.

I have a WIP branch that I hope to push tomorrow doing this.

what blows up if SparseDtype were a subclass of np.dtype (and np.dtype(float) == SparseDtype(float, fill_value=any) (as long as fill_value type is forced compatible, i.e. a float))?

You can't subclass np.dtype, at least from Python, and NumPy's dtype equality isn't the friendliest (IIRC, it tries to convert the other into a NumPy dtype, but raises a TypeError if that fails, instead of returning NotImplemented).

jorisvandenbossche · 2018-11-08T12:23:05Z

Long term it might be nice to be able to make the distinction between astype('int') and astype(np.int64), where the first could be interpreted by pandas and keep the sparsity, while the second could ensure the output actually has dtype of np.int64.
But that might also be confusing ..

TomAugspurger added this to the 0.24.0 milestone Oct 13, 2018

TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions API Design Sparse Sparse Data Type Blocker Blocking issue or pull request for an upcoming release labels Oct 13, 2018

TomAugspurger mentioned this issue Oct 15, 2018

SparseArray is an ExtensionArray #22325

Merged

4 tasks

jorisvandenbossche mentioned this issue Oct 15, 2018

Require the dtype of SparseArray.fill_value and sp_values.dtype to match #23124

Closed

TomAugspurger mentioned this issue Oct 23, 2018

Preserve EA dtype in DataFrame.stack #23285

Merged

TomAugspurger mentioned this issue Nov 7, 2018

BUG: astype fill_value for SparseArray.astype #23547

Merged

TomAugspurger closed this as completed Nov 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should SparseArray.astype be dense or Sparse #23125

Should SparseArray.astype be dense or Sparse #23125

TomAugspurger commented Oct 13, 2018

TomAugspurger commented Oct 13, 2018

jorisvandenbossche commented Oct 15, 2018

TomAugspurger commented Oct 23, 2018 •

edited

Loading

jorisvandenbossche commented Oct 23, 2018

TomAugspurger commented Oct 23, 2018 •

edited

Loading

jorisvandenbossche commented Oct 23, 2018 •

edited

Loading

jorisvandenbossche commented Oct 23, 2018

TomAugspurger commented Oct 23, 2018

jorisvandenbossche commented Oct 23, 2018

jorisvandenbossche commented Oct 23, 2018

TomAugspurger commented Oct 23, 2018

TomAugspurger commented Nov 7, 2018

kernc commented Nov 8, 2018

kernc commented Nov 8, 2018 •

edited

Loading

TomAugspurger commented Nov 8, 2018

jorisvandenbossche commented Nov 8, 2018

Should SparseArray.astype be dense or Sparse #23125

Should SparseArray.astype be dense or Sparse #23125

Comments

TomAugspurger commented Oct 13, 2018

TomAugspurger commented Oct 13, 2018

jorisvandenbossche commented Oct 15, 2018

TomAugspurger commented Oct 23, 2018 • edited Loading

jorisvandenbossche commented Oct 23, 2018

TomAugspurger commented Oct 23, 2018 • edited Loading

jorisvandenbossche commented Oct 23, 2018 • edited Loading

jorisvandenbossche commented Oct 23, 2018

TomAugspurger commented Oct 23, 2018

jorisvandenbossche commented Oct 23, 2018

jorisvandenbossche commented Oct 23, 2018

TomAugspurger commented Oct 23, 2018

TomAugspurger commented Nov 7, 2018

kernc commented Nov 8, 2018

kernc commented Nov 8, 2018 • edited Loading

TomAugspurger commented Nov 8, 2018

jorisvandenbossche commented Nov 8, 2018

TomAugspurger commented Oct 23, 2018 •

edited

Loading

TomAugspurger commented Oct 23, 2018 •

edited

Loading

jorisvandenbossche commented Oct 23, 2018 •

edited

Loading

kernc commented Nov 8, 2018 •

edited

Loading