Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add a .values property to convert to a GPU array #1824

Closed
mrocklin opened this issue May 22, 2019 · 13 comments · Fixed by #2373
Closed

[FEA] Add a .values property to convert to a GPU array #1824

mrocklin opened this issue May 22, 2019 · 13 comments · Fixed by #2373
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@mrocklin
Copy link
Collaborator

What is the canonical way to convert a cudf dataframe into an array-like object?

There is

  • as_gpu_matrix, which seems to be the best choice today? This returns a Numba device array
  • to_gpu_matrix, which seems to not do anything
  • as_matrix, which seems to return a Numpy array

First, these should maybe replace the name matrix with array. Numpy matrix objects are different and somewhat unpleasant. Also, note this:

In [19]: df.to_pandas().as_matrix()
/home/nfs/mrocklin/miniconda/envs/cudf-nightly/bin/ipython:1: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  #!/home/nfs/mrocklin/miniconda/envs/cudf-nightly/bin/python
Out[19]:
array([[1, 4],
       [2, 5],
       [3, 6]])

I most often see people use .values, a property which returns a homogenously typed numpy array. I wonder if we might instead return a cupy array (or numba device array if that's preferred). If we choose something sufficiently numpy-like (like cupy) then things will just work with dask dataframe.

See also https://github.com/rapidsai/dask-cudf/issues/259

@mrocklin mrocklin added feature request New feature or request Needs Triage Need team to review and classify labels May 22, 2019
@beckernick
Copy link
Member

beckernick commented May 22, 2019

This would be a valuable property, as it's fairly common to call this in pandas to grab the underlying numpy array

One question quickly jumps out: What will happen with dataframes including non-numeric data? Should this fail? Should it only return the numeric values? Currently, as_gpu_matrix requires the data to be numeric (and of the same type) as it returns numba device ndarray. We can fairly easily solve the type alignment issue (.values in pandas resolves this by upcasting the numeric types, as @mrocklin mentioned), but the other is quite unclear.

@mrocklin
Copy link
Collaborator Author

In pandas you would use the to_records method in that case. I believe that in Pandas both .as_matrix() and .values are supposed to return a homogenously typed array. If you give it very heterogeneously typed arrays then you get back an object-dtype array.

@beckernick
Copy link
Member

Absolutely. I didn't phrase my question as clearly as I should have. We currently have no way of representing non-numeric data in either numba device arrays or cupy arrays. My question is how to best handle that when thinking about something like .values.

@mrocklin
Copy link
Collaborator Author

We currently have no way of representing non-numeric data in either numba device arrays or cupy arrays.

Oh I see, yeah that's a much bigger problem it seems. I don't have any answers for you there.

My question is how to best handle that when thinking about something like .values.

Well, for the .values API in particular this concern doesn't come up. The contract is to promise a homogeneously typed array. We would upcast or err probably?

@beckernick
Copy link
Member

beckernick commented May 22, 2019

I would think raising an informative error if we cannot upcast (such as when some columns are non-numeric) is probably the right solution, as I think the .values contract may be a bit more specific: a homogeneously typed array that contains all of your data.

@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 28, 2019
@kkraus14
Copy link
Collaborator

kkraus14 commented May 28, 2019

@mrocklin @beckernick typically, calling .values against a Pandas Series/DataFrame gives its underlying data by reference, how important would it be to return by reference versus by copy? Given the user expects a dense buffer, what should we do with nulls?

@beckernick
Copy link
Member

beckernick commented May 30, 2019

@thomcom , @kkraus14 that's a good question. Is it correct that we can return the underlying buffer by reference easily if there are no nulls, but we'd need to make a copy to return a buffer with nulls? (Please do correct me if I'm wrong on that one.)

For interoperability in the ecosystem, we likely don't want the following behavior to occur when someone wants to convert to CuPy (or anything else):

s = cudf.Series([1.0, np.nan, 3, 4])
print(s)
0    1.0
1       
2    3.0
3    4.0
dtype: float64

print(cuda.as_cuda_array(s.data.mem).copy_to_host())
[1. 0. 3. 4.]

print(cupy.asarray(s.data.mem))
[1. 0. 3. 4.]

print(cudf.Series(s.data.mem))
0    1.0
1    0.0
2    3.0
3    4.0
dtype: float64

print(cudf.Series(s.to_gpu_array())) # the dense buffer without the null
0    1.0
1    3.0
2    4.0
dtype: float64

This could lead to a bunch of unexpected downstream behavior for users that is silently incorrect. I think users expect to pass around data with nulls (or NaNs) that don't get left out or filled in as 0.

print(cudf.Series(cuda.as_cuda_array(cuda.to_device([1.2, np.nan, 3]))))
0    1.2
1       
2    3.0
dtype: float64

If copying can fulfill that contract, I think it's a good choice. I think allowing people to leverage and interoperate the growing ecosystem is worth a copy in the short term.

@beckernick
Copy link
Member

@jakirkham as well for visibility

@kkraus14
Copy link
Collaborator

Reopening this since we currently return a numpy array as opposed to a GPU array. Will collect further thoughts on this in a bit.

@mrocklin
Copy link
Collaborator Author

To give a first test for .values, I think that the following would be helpful

def test_cupy_values():
    cupy = pytest.importorskip("cupy")
    s = cudf.Series([1, 2, 3])

    assert isinstance(s.values, cupy.ndarray)

    numpy.testing.assert_array_equal(
        s.values.get(),
        s.to_pandas().values
    )

If one wanted to extend this we might try

  • parametrizing around dtype with @pytest.mark.parametrize("dtype", [float, int, "float32"])
  • Checking that calling .values on an array with missing values or strings or categoricals raises an informative TypeError or NotImplementedError

@jakirkham
Copy link
Member

Just noticed this after someone pointed this out to me, in the Pandas docs for .values it says the following.

Warning We recommend using DataFrame.to_numpy() instead.

ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html#pandas.DataFrame.values

@brandon-b-miller
Copy link
Contributor

Closed by #2655.

@jakirkham
Copy link
Member

Thanks Brandon! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants