-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add a .values property to convert to a GPU array #1824
Comments
This would be a valuable property, as it's fairly common to call this in pandas to grab the underlying numpy array One question quickly jumps out: What will happen with dataframes including non-numeric data? Should this fail? Should it only return the numeric values? Currently, |
In pandas you would use the |
Absolutely. I didn't phrase my question as clearly as I should have. We currently have no way of representing non-numeric data in either numba device arrays or cupy arrays. My question is how to best handle that when thinking about something like |
Oh I see, yeah that's a much bigger problem it seems. I don't have any answers for you there.
Well, for the |
I would think raising an informative error if we cannot upcast (such as when some columns are non-numeric) is probably the right solution, as I think the |
@mrocklin @beckernick typically, calling |
@thomcom , @kkraus14 that's a good question. Is it correct that we can return the underlying buffer by reference easily if there are no nulls, but we'd need to make a copy to return a buffer with nulls? (Please do correct me if I'm wrong on that one.) For interoperability in the ecosystem, we likely don't want the following behavior to occur when someone wants to convert to CuPy (or anything else): s = cudf.Series([1.0, np.nan, 3, 4])
print(s)
0 1.0
1
2 3.0
3 4.0
dtype: float64
print(cuda.as_cuda_array(s.data.mem).copy_to_host())
[1. 0. 3. 4.]
print(cupy.asarray(s.data.mem))
[1. 0. 3. 4.]
print(cudf.Series(s.data.mem))
0 1.0
1 0.0
2 3.0
3 4.0
dtype: float64
print(cudf.Series(s.to_gpu_array())) # the dense buffer without the null
0 1.0
1 3.0
2 4.0
dtype: float64 This could lead to a bunch of unexpected downstream behavior for users that is silently incorrect. I think users expect to pass around data with nulls (or NaNs) that don't get left out or filled in as print(cudf.Series(cuda.as_cuda_array(cuda.to_device([1.2, np.nan, 3]))))
0 1.2
1
2 3.0
dtype: float64 If copying can fulfill that contract, I think it's a good choice. I think allowing people to leverage and interoperate the growing ecosystem is worth a copy in the short term. |
@jakirkham as well for visibility |
Reopening this since we currently return a numpy array as opposed to a GPU array. Will collect further thoughts on this in a bit. |
To give a first test for def test_cupy_values():
cupy = pytest.importorskip("cupy")
s = cudf.Series([1, 2, 3])
assert isinstance(s.values, cupy.ndarray)
numpy.testing.assert_array_equal(
s.values.get(),
s.to_pandas().values
) If one wanted to extend this we might try
|
Just noticed this after someone pointed this out to me, in the Pandas docs for
|
Closed by #2655. |
Thanks Brandon! 😄 |
What is the canonical way to convert a cudf dataframe into an array-like object?
There is
First, these should maybe replace the name
matrix
witharray
. Numpy matrix objects are different and somewhat unpleasant. Also, note this:I most often see people use
.values
, a property which returns a homogenously typed numpy array. I wonder if we might instead return a cupy array (or numba device array if that's preferred). If we choose something sufficiently numpy-like (like cupy) then things will just work with dask dataframe.See also https://github.com/rapidsai/dask-cudf/issues/259
The text was updated successfully, but these errors were encountered: