[FEA] Add a .values property to convert to a GPU array #1824

mrocklin · 2019-05-22T19:10:27Z

What is the canonical way to convert a cudf dataframe into an array-like object?

There is

as_gpu_matrix, which seems to be the best choice today? This returns a Numba device array
to_gpu_matrix, which seems to not do anything
as_matrix, which seems to return a Numpy array

First, these should maybe replace the name matrix with array. Numpy matrix objects are different and somewhat unpleasant. Also, note this:

In [19]: df.to_pandas().as_matrix()
/home/nfs/mrocklin/miniconda/envs/cudf-nightly/bin/ipython:1: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  #!/home/nfs/mrocklin/miniconda/envs/cudf-nightly/bin/python
Out[19]:
array([[1, 4],
       [2, 5],
       [3, 6]])

I most often see people use .values, a property which returns a homogenously typed numpy array. I wonder if we might instead return a cupy array (or numba device array if that's preferred). If we choose something sufficiently numpy-like (like cupy) then things will just work with dask dataframe.

See also https://github.com/rapidsai/dask-cudf/issues/259

The text was updated successfully, but these errors were encountered:

beckernick · 2019-05-22T19:25:30Z

This would be a valuable property, as it's fairly common to call this in pandas to grab the underlying numpy array

One question quickly jumps out: What will happen with dataframes including non-numeric data? Should this fail? Should it only return the numeric values? Currently, as_gpu_matrix requires the data to be numeric (and of the same type) as it returns numba device ndarray. We can fairly easily solve the type alignment issue (.values in pandas resolves this by upcasting the numeric types, as @mrocklin mentioned), but the other is quite unclear.

mrocklin · 2019-05-22T19:29:06Z

In pandas you would use the to_records method in that case. I believe that in Pandas both .as_matrix() and .values are supposed to return a homogenously typed array. If you give it very heterogeneously typed arrays then you get back an object-dtype array.

beckernick · 2019-05-22T19:39:09Z

Absolutely. I didn't phrase my question as clearly as I should have. We currently have no way of representing non-numeric data in either numba device arrays or cupy arrays. My question is how to best handle that when thinking about something like .values.

mrocklin · 2019-05-22T19:51:43Z

We currently have no way of representing non-numeric data in either numba device arrays or cupy arrays.

Oh I see, yeah that's a much bigger problem it seems. I don't have any answers for you there.

My question is how to best handle that when thinking about something like .values.

Well, for the .values API in particular this concern doesn't come up. The contract is to promise a homogeneously typed array. We would upcast or err probably?

beckernick · 2019-05-22T19:53:55Z

I would think raising an informative error if we cannot upcast (such as when some columns are non-numeric) is probably the right solution, as I think the .values contract may be a bit more specific: a homogeneously typed array that contains all of your data.

kkraus14 · 2019-05-28T18:46:05Z

@mrocklin @beckernick typically, calling .values against a Pandas Series/DataFrame gives its underlying data by reference, how important would it be to return by reference versus by copy? Given the user expects a dense buffer, what should we do with nulls?

beckernick · 2019-05-30T14:29:38Z

@thomcom , @kkraus14 that's a good question. Is it correct that we can return the underlying buffer by reference easily if there are no nulls, but we'd need to make a copy to return a buffer with nulls? (Please do correct me if I'm wrong on that one.)

For interoperability in the ecosystem, we likely don't want the following behavior to occur when someone wants to convert to CuPy (or anything else):

s = cudf.Series([1.0, np.nan, 3, 4])
print(s)
0    1.0
1       
2    3.0
3    4.0
dtype: float64

print(cuda.as_cuda_array(s.data.mem).copy_to_host())
[1. 0. 3. 4.]

print(cupy.asarray(s.data.mem))
[1. 0. 3. 4.]

print(cudf.Series(s.data.mem))
0    1.0
1    0.0
2    3.0
3    4.0
dtype: float64

print(cudf.Series(s.to_gpu_array())) # the dense buffer without the null
0    1.0
1    3.0
2    4.0
dtype: float64

This could lead to a bunch of unexpected downstream behavior for users that is silently incorrect. I think users expect to pass around data with nulls (or NaNs) that don't get left out or filled in as 0.

print(cudf.Series(cuda.as_cuda_array(cuda.to_device([1.2, np.nan, 3]))))
0    1.2
1       
2    3.0
dtype: float64

If copying can fulfill that contract, I think it's a good choice. I think allowing people to leverage and interoperate the growing ecosystem is worth a copy in the short term.

beckernick · 2019-05-30T14:36:49Z

@jakirkham as well for visibility

kkraus14 · 2019-08-12T17:28:41Z

Reopening this since we currently return a numpy array as opposed to a GPU array. Will collect further thoughts on this in a bit.

mrocklin · 2019-08-14T15:11:10Z

To give a first test for .values, I think that the following would be helpful

def test_cupy_values():
    cupy = pytest.importorskip("cupy")
    s = cudf.Series([1, 2, 3])

    assert isinstance(s.values, cupy.ndarray)

    numpy.testing.assert_array_equal(
        s.values.get(),
        s.to_pandas().values
    )

If one wanted to extend this we might try

parametrizing around dtype with @pytest.mark.parametrize("dtype", [float, int, "float32"])
Checking that calling .values on an array with missing values or strings or categoricals raises an informative TypeError or NotImplementedError

jakirkham · 2019-08-28T09:27:24Z

Just noticed this after someone pointed this out to me, in the Pandas docs for .values it says the following.

Warning We recommend using DataFrame.to_numpy() instead.

ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html#pandas.DataFrame.values

brandon-b-miller · 2019-09-27T11:23:05Z

Closed by #2655.

jakirkham · 2019-09-27T16:10:44Z

Thanks Brandon! 😄

mrocklin added feature request New feature or request Needs Triage Need team to review and classify labels May 22, 2019

kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 28, 2019

kkraus14 mentioned this issue May 30, 2019

[FEA] Implementing __array_function__ #1728

Closed

galipremsagar mentioned this issue Jul 23, 2019

[REVIEW] cudf.DataFrame enchancements & Series.values support #2373

Merged

3 tasks

mrocklin mentioned this issue Jul 25, 2019

Make a plan for sort_values/set_index #2272

Closed

kkraus14 closed this as completed in #2373 Jul 30, 2019

kkraus14 reopened this Aug 12, 2019

kkraus14 assigned kkraus14 and brandon-b-miller and unassigned kkraus14 Aug 12, 2019

brandon-b-miller mentioned this issue Aug 21, 2019

[REVIEW] Adds Series/DataFrame.values for Host and Device #2655

Merged

beckernick mentioned this issue Sep 18, 2019

[REVIEW] Series covariance and Pearson correlation #2719

Merged

brandon-b-miller closed this as completed Sep 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add a .values property to convert to a GPU array #1824

[FEA] Add a .values property to convert to a GPU array #1824

mrocklin commented May 22, 2019

beckernick commented May 22, 2019 •

edited

Loading

mrocklin commented May 22, 2019

beckernick commented May 22, 2019

mrocklin commented May 22, 2019

beckernick commented May 22, 2019 •

edited

Loading

kkraus14 commented May 28, 2019 •

edited

Loading

beckernick commented May 30, 2019 •

edited

Loading

beckernick commented May 30, 2019

kkraus14 commented Aug 12, 2019

mrocklin commented Aug 14, 2019

jakirkham commented Aug 28, 2019

brandon-b-miller commented Sep 27, 2019

jakirkham commented Sep 27, 2019

[FEA] Add a .values property to convert to a GPU array #1824

[FEA] Add a .values property to convert to a GPU array #1824

Comments

mrocklin commented May 22, 2019

beckernick commented May 22, 2019 • edited Loading

mrocklin commented May 22, 2019

beckernick commented May 22, 2019

mrocklin commented May 22, 2019

beckernick commented May 22, 2019 • edited Loading

kkraus14 commented May 28, 2019 • edited Loading

beckernick commented May 30, 2019 • edited Loading

beckernick commented May 30, 2019

kkraus14 commented Aug 12, 2019

mrocklin commented Aug 14, 2019

jakirkham commented Aug 28, 2019

brandon-b-miller commented Sep 27, 2019

jakirkham commented Sep 27, 2019

beckernick commented May 22, 2019 •

edited

Loading

beckernick commented May 22, 2019 •

edited

Loading

kkraus14 commented May 28, 2019 •

edited

Loading

beckernick commented May 30, 2019 •

edited

Loading