Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] can't merge dask-cudf dataframes when index is non-numeric? #12773

Closed
wence- opened this issue Feb 14, 2023 · 8 comments · Fixed by #14529
Closed

[BUG] can't merge dask-cudf dataframes when index is non-numeric? #12773

wence- opened this issue Feb 14, 2023 · 8 comments · Fixed by #14529
Labels
2 - In Progress Currently a work in progress bug Something isn't working dask Dask issue Python Affects Python cuDF API.

Comments

@wence-
Copy link
Contributor

wence- commented Feb 14, 2023

Describe the bug

Consider

import cudf
import dask_cudf as dd

df1 = cudf.DataFrame({"a": ["a", "b"], "b": [1, 2]})
df2 = cudf.DataFrame({"a": ["a", "c"], "b": [2, 3]})

ddf1 = dd.from_cudf(df1, npartitions=2).set_index("a")
ddf2 = dd.from_cudf(df2, npartitions=2).set_index("a")

union = ddf1.merge(ddf2, left_index=True, right_index=True, how="left").compute()

This produces

File ~/cudf/python/cudf/cudf/core/column/string.py:5475, in StringColumn.as_numerical_column(self, dtype, **kwargs)
   5473 elif out_dtype.kind == "f":
   5474     if not libstrings.is_float(string_col).all():
-> 5475         raise ValueError(
   5476             "Could not convert strings to float "
   5477             "type due to presence of non-floating values."
   5478         )
   5480 result_col = _str_to_numeric_typecast_functions[out_dtype](string_col)
   5481 return result_col

ValueError: Could not convert strings to float type due to presence of
non-floating values.

This should work. I think something strange is going on, effectively
the error occurs because dask is trying to do

df3 = df1.set_index("a")

df4 = df3.assign(foo=cudf.Series([1, 2]))

And this fails because the assignment column doesn't have the same
index as the dataframe.

Whose bug is this? Everything "works" with pandas-backed dataframes.

@wence- wence- added bug Something isn't working Needs Triage Need team to review and classify labels Feb 14, 2023
@daxiongshu
Copy link

I'm curious about this bug. Thank you for raising the issue.

@wence-
Copy link
Contributor Author

wence- commented Feb 15, 2023

Looks like at least one problem is that .set_index() with a column of str dtype, divisions are not computed.

@rjzamora
Copy link
Member

It isn't too difficult to get set_index to update the divisions correctly in dask_cudf. However, I'm still running into a separate issue where cudf doesn't support loc indexing of a range where one end of the range is not physicaly present in the index. For example:

import pandas as pd, cudf

pdf = pd.DataFrame({"a": ["a", "b", "c"], "b": range(3)}).set_index("a")
gdf = cudf.from_pandas(pdf)

pdf.loc["a":"d"]  # Works fine
gdf.loc["a":"d"]  # Fails

@rjzamora
Copy link
Member

@wence- do you happen to know off-hand if this issue is already covered by something in #12793? If not, I can submit a new issue and link it.

@wence-
Copy link
Contributor Author

wence- commented Feb 17, 2023

I think it is not, please add to the list!

@wence-
Copy link
Contributor Author

wence- commented Feb 23, 2023

However, I'm still running into a separate issue where cudf doesn't support loc indexing of a range where one end of the range is not physicaly present in the index.

This is #12833.

@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. dask Dask issue and removed Needs Triage Need team to review and classify labels Jun 6, 2023
@wence-
Copy link
Contributor Author

wence- commented Jun 30, 2023

It isn't too difficult to get set_index to update the divisions correctly in dask_cudf. However, I'm still running into a separate issue where cudf doesn't support loc indexing of a range where one end of the range is not physicaly present in the index. For example:

import pandas as pd, cudf

pdf = pd.DataFrame({"a": ["a", "b", "c"], "b": range(3)}).set_index("a")
gdf = cudf.from_pandas(pdf)

pdf.loc["a":"d"]  # Works fine
gdf.loc["a":"d"]  # Fails

This aspect should now be fixed. Can we revisit the original issue?

@wence-
Copy link
Contributor Author

wence- commented Nov 22, 2023

This looks like it is now fixed, we should probably add a test...

@vyasr vyasr mentioned this issue Nov 29, 2023
3 tasks
raydouglass pushed a commit that referenced this issue Dec 8, 2023
Closes #12773.

Authors:
   - Vyas Ramasubramani (https://github.com/vyasr)
   - Lawrence Mitchell (https://github.com/wence-)

Approvers:
   - GALI PREM SAGAR (https://github.com/galipremsagar)
   - Lawrence Mitchell (https://github.com/wence-)
karthikeyann pushed a commit to karthikeyann/cudf that referenced this issue Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working dask Dask issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants