Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]dask_cudf: read_csv seems to compute only first partition #9719

Closed
MordicusEtCubitus opened this issue Nov 17, 2021 · 9 comments · Fixed by #9796
Closed

[BUG]dask_cudf: read_csv seems to compute only first partition #9719

MordicusEtCubitus opened this issue Nov 17, 2021 · 9 comments · Fixed by #9796
Assignees
Labels
bug Something isn't working dask Dask issue Python Affects Python cuDF API. question Further information is requested

Comments

@MordicusEtCubitus
Copy link

Dears,
Thanks for creating cudf and all dask work around that make it so great !

I'm trying to read a 8Go CSV file and doing a few computation using dask cudf and LocalCudaCluster
But I'm getting strange results, it seems that only first partition is used for computation
The file I'm reading can be downloaded from here:
http://data.cquest.org/geo_sirene/last/etablissements_actifs.csv.gz

Then, here is my code: the goal is to find pizza near my location in France - a funny exercice

from dask_cuda import LocalCUDACluster  # I've got 2 GTX 1080ti GPU
from dask.distributed import Client, progress

cluster = LocalCUDACluster()
client = Client(cluster)
import dask_cudf as dc

ent = dc.read_csv("data/etablissements_actifs.csv", sep=","
                  , dtype={"l1_normalisee" : str, "geo_adresse" : str, "latitude" : float, "longitude" : float}
                  , usecols=["l1_normalisee", "latitude", "longitude", "geo_adresse"])

import geocoder
pos = geocoder.osm("Grande arche de la défense")

pizz = ent.dropna()
pizz = pizz[ pizz.l1_normalisee.str.contains("PIZZ") ]
pizz['distance'] = ((pizz.latitude-pos.lat)**2+(pizz.longitude-pos.lng)**2)**0.5
top10 = pizz.nsmallest(10, "distance")

r = top10.compute()
r

The result is returning shops that are not located in Paris at all but in departement 02, not really the same place :(

If I'm using the same file with dask.dataframe I got right results

It seems that only first partition is used, but not sure.
Below are my versions of librairies:

print(dask_cudf.__version__)
print(dask.__version__)
print(dask_cuda.__version__)

21.10.01
2021.09.1
0+unknown  # strange
@MordicusEtCubitus MordicusEtCubitus added Needs Triage Need team to review and classify bug Something isn't working labels Nov 17, 2021
@MordicusEtCubitus
Copy link
Author

MordicusEtCubitus commented Nov 17, 2021

Indeed, when playing with partitions, it appears that after partition 0, all records have an issue in columns content:

ent.partitions.fn(1).head()

Columns are not set with right values.
Maybe an issue with the csv file, but this is not the case with dask.dataframe, checking csv file on my side

l1_normalisee latitude longitude geo_adresse
423034149 25.0
524459773 15.0 LO CRICCHIO FRANCK ESPACE VERT
391360112 18.0 LA CONDAMINE VILLAGE
398483081 12.0 CHEZ MR ABADIANE BT 6
880173638 11.0 LES MURIERS BAT C9

After investigating a bit more, I've found that many lines have a quotechar = ' " ' use for fields having a coma in their value
But even if I read the file setting quotechar='"' I still have the same issue

So if you can help...

Thanks a lot

@beckernick
Copy link
Member

Do you see similar behavior if you read the file with cudf.read_csv? Or only if you use dask_cudf.read_csv?

@beckernick beckernick added Python Affects Python cuDF API. question Further information is requested and removed Needs Triage Need team to review and classify labels Nov 19, 2021
@MordicusEtCubitus
Copy link
Author

I cannot check with cudf only as my GPU has 8Go of ram only and data does not fit into this memory

But with an older version of the file the result is good, so it may be a file issue ; but this file provides good results with dask.dataframe

So I still in doubt.

@MordicusEtCubitus
Copy link
Author

MordicusEtCubitus commented Nov 25, 2021

I've rent a GPU with enough memory and yes, it works fine with the same file and using directly cudf.read_csv in place of dask_cudf.read_csv

So apparently it is a dask_cudf issue.

Thanks in advance for helping

@rjzamora rjzamora added the dask Dask issue label Nov 29, 2021
@rjzamora
Copy link
Member

rjzamora commented Nov 29, 2021

I will try to reproduce this soon, but I suspect the problem may be related to a usecols bug that was already fixed in #9618

@rjzamora
Copy link
Member

I will try to reproduce this soon, but I suspect the problem may be related to a usecols bug that was already fixed in #9618

I had a chance to investigate and the problem is indeed the usecols bug that was fixed in #9618. Unfortunately, that fix does not resolve the issue when dtype is also specified. Therefore, we will need further changes before this issue can be closed - I plan to work on a solution today or tomorrow.

@rjzamora rjzamora self-assigned this Nov 29, 2021
@MordicusEtCubitus
Copy link
Author

Thanks a lot. I've looked at commit for #9618 but do not understood why this is the issue, so not sure I can help writing the code to solve it

@rjzamora
Copy link
Member

so not sure I can help writing the code to solve it

No worries! I expect that #9796 will resolve your error - However, I'll be happy to iterate if this is not the case.

rapids-bot bot pushed a commit that referenced this issue Nov 30, 2021
Closes #9719

`dask_cudf.read_csv` currently fails when both `usecols` and `dtype` are specified. This PR is  a simple fix.  In the near future, the `_internal_read_csv` implementation should also be modified to produce a `Blockwise` HLG Layer, but I will leave that for a separate PR.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #9796
@rjzamora
Copy link
Member

I closed this issue with #9796, but feel free to reopen if the problem persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dask Dask issue Python Affects Python cuDF API. question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants