-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]dask_cudf: read_csv seems to compute only first partition #9719
Comments
Indeed, when playing with partitions, it appears that after partition 0, all records have an issue in columns content: ent.partitions.fn(1).head() Columns are not set with right values.
After investigating a bit more, I've found that many lines have a quotechar = ' " ' use for fields having a coma in their value So if you can help... Thanks a lot |
Do you see similar behavior if you read the file with |
I cannot check with cudf only as my GPU has 8Go of ram only and data does not fit into this memory But with an older version of the file the result is good, so it may be a file issue ; but this file provides good results with dask.dataframe So I still in doubt. |
I've rent a GPU with enough memory and yes, it works fine with the same file and using directly cudf.read_csv in place of dask_cudf.read_csv So apparently it is a dask_cudf issue. Thanks in advance for helping |
I will try to reproduce this soon, but I suspect the problem may be related to a |
I had a chance to investigate and the problem is indeed the |
Thanks a lot. I've looked at commit for #9618 but do not understood why this is the issue, so not sure I can help writing the code to solve it |
No worries! I expect that #9796 will resolve your error - However, I'll be happy to iterate if this is not the case. |
Closes #9719 `dask_cudf.read_csv` currently fails when both `usecols` and `dtype` are specified. This PR is a simple fix. In the near future, the `_internal_read_csv` implementation should also be modified to produce a `Blockwise` HLG Layer, but I will leave that for a separate PR. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #9796
I closed this issue with #9796, but feel free to reopen if the problem persists. |
Dears,
Thanks for creating cudf and all dask work around that make it so great !
I'm trying to read a 8Go CSV file and doing a few computation using dask cudf and LocalCudaCluster
But I'm getting strange results, it seems that only first partition is used for computation
The file I'm reading can be downloaded from here:
http://data.cquest.org/geo_sirene/last/etablissements_actifs.csv.gz
Then, here is my code: the goal is to find pizza near my location in France - a funny exercice
The result is returning shops that are not located in Paris at all but in departement 02, not really the same place :(
If I'm using the same file with dask.dataframe I got right results
It seems that only first partition is used, but not sure.
Below are my versions of librairies:
The text was updated successfully, but these errors were encountered: