-
Notifications
You must be signed in to change notification settings - Fork 58
multi-gb joins result in hangs #102
Comments
I get a segfault when I try this. Here is the traceback from gdb
cc @jrhemstad for triage |
@mrocklin that looks like an error in the csv_reader:
|
What is the right way to dig into this deeper? Try to get a reproducer with just |
Note that when using cudf directly (snippet is included in the repro), the same join works fine. If the trace @jrhemstad 's referring to is in the CSV reader, might it be related to use of byte_range? @mjsamoht fyi |
I’m afraid I’m pretty clueless when it comes to the csv_reader. You’d need to ping someone working on cuIO directly.
From: Matthew Rocklin <[email protected]>
Sent: Friday, February 22, 2019 4:04 PM
To: rapidsai/dask-cudf <[email protected]>
Cc: Jake Hemstad <[email protected]>; Mention <[email protected]>
Subject: Re: [rapidsai/dask-cudf] multi-gb joins result in hangs (#102)
What is the right way to dig into this deeper? Try to get a reproducer with just cudf?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#102 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AOhCKd4CygFq6sXRKfxNc-ejpJ6Dk226ks5vQGlJgaJpZM4bKeRU>.
…-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
|
Probably related to byte_range, given that the call is otherwise the same. @mrocklin can you post the arguments of the failing cudf.read_csv call ( within dgd.read_csv)? I cannot repro the issue locally otherwise. |
Yeah, I'm working on a getting reproducible failure now |
Small update. This has nothing to do with joins or with the later partitions. I can replicate it with the following: import dask_cudf
left = dask_cudf.read_csv('left.csv').partitions[:1]
left.compute() Still trying to reduce it down so that it doesn't include dask |
Yeah, this will do it Here is @randerzander 's prep code from above import pandas as pd
import numpy as np
nelem = 100000000
# generate 2.1, and 1.1 gb file, takes about 4 minutes
df_0 = pd.DataFrame({'key': np.arange(0, nelem), 'zeros': np.zeros(nelem)})
df_0.to_csv('left.csv') And then here is a failing cudf code to segfault import cudf
cudf.read_csv('left.csv', byte_range=(0, 268435456)) Interestingly, it works fine with byte_range with a length of one less or one more In [9]: cudf.read_csv('left.csv', byte_range=(0, 268435455))
Out[9]: <cudf.DataFrame ncols=3 nrows=13211713 >
In [10]: cudf.read_csv('left.csv', byte_range=(0, 268435457))
Out[10]: <cudf.DataFrame ncols=3 nrows=13211713 >
In [11]: cudf.read_csv('left.csv', byte_range=(0, 268435456))
Segmentation fault If it's ok, at this point I'm going to hand investigation off to you all |
Thanks, @mrocklin. I'll look into this today or tomorrow. |
Got repro with the steps posted above. After some experimentation, turns out that the issue reproes with any test where the byte range ends at the end of a page (4096B). |
@randerzander @mrocklin Can you please merge rapidsai/cudf#1044 and check if it fixes the issue? Verified the fix locally. |
thanks, @vuule using your PR, the problem looks resolved Will close this issue once the PR merges into cuDF |
@randerzander The PR has been merged into branch-0.6 |
Thanks, @vuule ! |
While working with the GHCN weather dataset, I ran into a "ValueError: All series must be of same type" which only occurs when working with > 1ish GB of data. Will file in a separate issue.
The below issue occurs both when setting up with LocalCudaCluster (omitted for simplicity), and when using dask_cudf diretly, without cluster.
While trying to boil it down to a simpler repro, I run into hanging/restarting kernels:
From Jupyter logs:
The text was updated successfully, but these errors were encountered: