[BUG]dask_cudf: read_csv seems to compute only first partition #9719

MordicusEtCubitus · 2021-11-17T22:51:55Z

Dears,
Thanks for creating cudf and all dask work around that make it so great !

I'm trying to read a 8Go CSV file and doing a few computation using dask cudf and LocalCudaCluster
But I'm getting strange results, it seems that only first partition is used for computation
The file I'm reading can be downloaded from here:
http://data.cquest.org/geo_sirene/last/etablissements_actifs.csv.gz

Then, here is my code: the goal is to find pizza near my location in France - a funny exercice

from dask_cuda import LocalCUDACluster  # I've got 2 GTX 1080ti GPU
from dask.distributed import Client, progress

cluster = LocalCUDACluster()
client = Client(cluster)
import dask_cudf as dc

ent = dc.read_csv("data/etablissements_actifs.csv", sep=","
                  , dtype={"l1_normalisee" : str, "geo_adresse" : str, "latitude" : float, "longitude" : float}
                  , usecols=["l1_normalisee", "latitude", "longitude", "geo_adresse"])

import geocoder
pos = geocoder.osm("Grande arche de la défense")

pizz = ent.dropna()
pizz = pizz[ pizz.l1_normalisee.str.contains("PIZZ") ]
pizz['distance'] = ((pizz.latitude-pos.lat)**2+(pizz.longitude-pos.lng)**2)**0.5
top10 = pizz.nsmallest(10, "distance")

r = top10.compute()
r

The result is returning shops that are not located in Paris at all but in departement 02, not really the same place :(

If I'm using the same file with dask.dataframe I got right results

It seems that only first partition is used, but not sure.
Below are my versions of librairies:

print(dask_cudf.__version__)
print(dask.__version__)
print(dask_cuda.__version__)

21.10.01
2021.09.1
0+unknown  # strange

MordicusEtCubitus · 2021-11-17T23:09:18Z

Indeed, when playing with partitions, it appears that after partition 0, all records have an issue in columns content:

ent.partitions.fn(1).head()

Columns are not set with right values.
Maybe an issue with the csv file, but this is not the case with dask.dataframe, checking csv file on my side

l1_normalisee	latitude	geo_adresse
423034149	25.0
524459773	15.0	LO CRICCHIO FRANCK ESPACE VERT
391360112	18.0	LA CONDAMINE VILLAGE
398483081	12.0	CHEZ MR ABADIANE BT 6
880173638	11.0	LES MURIERS BAT C9

After investigating a bit more, I've found that many lines have a quotechar = ' " ' use for fields having a coma in their value
But even if I read the file setting quotechar='"' I still have the same issue

So if you can help...

Thanks a lot

beckernick · 2021-11-19T19:54:24Z

Do you see similar behavior if you read the file with cudf.read_csv? Or only if you use dask_cudf.read_csv?

MordicusEtCubitus · 2021-11-19T21:49:13Z

I cannot check with cudf only as my GPU has 8Go of ram only and data does not fit into this memory

But with an older version of the file the result is good, so it may be a file issue ; but this file provides good results with dask.dataframe

So I still in doubt.

MordicusEtCubitus · 2021-11-25T21:23:49Z

I've rent a GPU with enough memory and yes, it works fine with the same file and using directly cudf.read_csv in place of dask_cudf.read_csv

So apparently it is a dask_cudf issue.

Thanks in advance for helping

rjzamora · 2021-11-29T19:34:02Z

I will try to reproduce this soon, but I suspect the problem may be related to a usecols bug that was already fixed in #9618

rjzamora · 2021-11-29T21:28:28Z

I will try to reproduce this soon, but I suspect the problem may be related to a usecols bug that was already fixed in #9618

I had a chance to investigate and the problem is indeed the usecols bug that was fixed in #9618. Unfortunately, that fix does not resolve the issue when dtype is also specified. Therefore, we will need further changes before this issue can be closed - I plan to work on a solution today or tomorrow.

MordicusEtCubitus · 2021-11-29T22:16:04Z

Thanks a lot. I've looked at commit for #9618 but do not understood why this is the issue, so not sure I can help writing the code to solve it

rjzamora · 2021-11-29T22:51:36Z

so not sure I can help writing the code to solve it

No worries! I expect that #9796 will resolve your error - However, I'll be happy to iterate if this is not the case.

Closes #9719 `dask_cudf.read_csv` currently fails when both `usecols` and `dtype` are specified. This PR is a simple fix. In the near future, the `_internal_read_csv` implementation should also be modified to produce a `Blockwise` HLG Layer, but I will leave that for a separate PR. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #9796

rjzamora · 2021-11-30T16:42:17Z

I closed this issue with #9796, but feel free to reopen if the problem persists.

MordicusEtCubitus added Needs Triage Need team to review and classify bug Something isn't working labels Nov 17, 2021

beckernick added Python Affects Python cuDF API. question Further information is requested and removed Needs Triage Need team to review and classify labels Nov 19, 2021

rjzamora added the dask Dask issue label Nov 29, 2021

rjzamora self-assigned this Nov 29, 2021

rjzamora mentioned this issue Nov 29, 2021

Fix dtype-argument bug in dask_cudf read_csv #9796

Merged

rapids-bot bot closed this as completed in #9796 Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]dask_cudf: read_csv seems to compute only first partition #9719

[BUG]dask_cudf: read_csv seems to compute only first partition #9719

MordicusEtCubitus commented Nov 17, 2021

MordicusEtCubitus commented Nov 17, 2021 •

edited

Loading

beckernick commented Nov 19, 2021

MordicusEtCubitus commented Nov 19, 2021

MordicusEtCubitus commented Nov 25, 2021 •

edited

Loading

rjzamora commented Nov 29, 2021 •

edited

Loading

rjzamora commented Nov 29, 2021

MordicusEtCubitus commented Nov 29, 2021

rjzamora commented Nov 29, 2021

rjzamora commented Nov 30, 2021

[BUG]dask_cudf: read_csv seems to compute only first partition #9719

[BUG]dask_cudf: read_csv seems to compute only first partition #9719

Comments

MordicusEtCubitus commented Nov 17, 2021

MordicusEtCubitus commented Nov 17, 2021 • edited Loading

beckernick commented Nov 19, 2021

MordicusEtCubitus commented Nov 19, 2021

MordicusEtCubitus commented Nov 25, 2021 • edited Loading

rjzamora commented Nov 29, 2021 • edited Loading

rjzamora commented Nov 29, 2021

MordicusEtCubitus commented Nov 29, 2021

rjzamora commented Nov 29, 2021

rjzamora commented Nov 30, 2021

MordicusEtCubitus commented Nov 17, 2021 •

edited

Loading

MordicusEtCubitus commented Nov 25, 2021 •

edited

Loading

rjzamora commented Nov 29, 2021 •

edited

Loading