You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When trying to merge two dataframes on a column of string dtyped to category when using dask_cudf and RMM, the column is hased as int32 and the hashes are not consistent, so the merge, which successes, results in an empty datafream.
This happens on 0.13 3/17 nightlies. cudf works. dask_cudf without RMM works. I am having issues replicating this with a smaller dataset.
Steps/Code to reproduce bug
This code will set up the environment, start a client with RMM, read the data using dask, print out the results where you will see that the strings have been hashed, and then attempt a merge on the hashed column, which will result in a 0 row dataframe, although it should be a 115492 row dataframe.
Expected behavior
the output should be like if we change the seller_name column dtype to str, or if we used cudf to read the dataset.
0 AMTRUST BANK
1 BANK OF AMERICA, N.A.
2 BISHOPS GATE RESIDENTIAL MORTGAGE TRUST
3 CITIMORTGAGE, INC.
4 FIRST TENNESSEE BANK NATIONAL ASSOCIATION
Name: seller_name, dtype: category
0 ACADEMY MORTGAGE CORPORATION
1 ALLY BANK
2 AMERIHOME MORTGAGE COMPANY|LLC
3 AMERISAVE MORTGAGE CORPORATION
4 AMTRUST BANK
Name: seller_name, Length: 79, dtype: category
and then when you merge, you will get 115492 rows × 26 columns:
Additional context
This affects the completion of the mortgage notebook for 0.13 release. While working through this, i found #4565. It is related, but not the same issue.
Describe the bug
When trying to merge two dataframes on a column of string dtyped to
category
when using dask_cudf and RMM, the column is hased asint32
and the hashes are not consistent, so the merge, which successes, results in an empty datafream.This happens on 0.13 3/17 nightlies. cudf works. dask_cudf without RMM works. I am having issues replicating this with a smaller dataset.
Steps/Code to reproduce bug
This code will set up the environment, start a client with RMM, read the data using dask, print out the results where you will see that the strings have been hashed, and then attempt a merge on the hashed column, which will result in a 0 row dataframe, although it should be a 115492 row dataframe.
please follow the directions here: https://docs.rapids.ai/datasets/mortgage-data
and download the data from: http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000.tgz
Output: 0 rows x 26 columns. You'll notice that the data is hashed, but there is no consistency in the hashing, so there is nothing to merge together
Expected behavior
the output should be like if we change the
seller_name
column dtype tostr
, or if we usedcudf
to read the dataset.and then when you merge, you will get 115492 rows × 26 columns:
Environment overview (please complete the following information)
Environment location: [Bare-metal]
Method of cuDF install: [conda]
Environment details
Additional context
This affects the completion of the mortgage notebook for 0.13 release. While working through this, i found #4565. It is related, but not the same issue.
@pentschev
@randerzander
@rnyak
The text was updated successfully, but these errors were encountered: