multi-gb joins result in hangs #102

randerzander · 2019-02-22T21:13:19Z

While working with the GHCN weather dataset, I ran into a "ValueError: All series must be of same type" which only occurs when working with > 1ish GB of data. Will file in a separate issue.

The below issue occurs both when setting up with LocalCudaCluster (omitted for simplicity), and when using dask_cudf diretly, without cluster.

While trying to boil it down to a simpler repro, I run into hanging/restarting kernels:

import pandas as pd
import numpy as np
import dask_cudf as dgd
import cudf

nelem = 100000000

# generate 2.1, and 1.1 gb file, takes about 4 minutes
df_0 = pd.DataFrame({'key': range(0, nelem), 'zeros': np.zeros(nelem)})
df_0.to_csv('left.csv')
df_1 = pd.DataFrame({'key': range(0, int(nelem/2)), 'ones': np.ones(int(nelem/2))})
df_1.to_csv('right.csv')

# runs fast, no issue
left = cudf.read_csv('left.csv')
right = cudf.read_csv('right.csv')
joined = left.merge(right, on=['key'], how='outer')
joined.head().to_pandas()

# hangs, restarts Jupyter kernels
left = dgd.read_csv('left.csv')
right = dgd.read_csv('right.csv')
joined = left.merge(right, on=['key'], how='outer')
joined.head().to_pandas()

From Jupyter logs:

KernelRestarter: restarting kernel (1/5), keep random ports
kernel 27543bfb-967e-4a50-b77a-665ec0443502 restarted
kernel 27543bfb-967e-4a50-b77a-665ec0443502 restarted

The text was updated successfully, but these errors were encountered:

mrocklin · 2019-02-22T21:35:04Z

I get a segfault when I try this. Here is the traceback from gdb

#0  0x00007ffff793db7a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fffe14efaaf in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fffe136258f in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fffe136335c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fffe127c48e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007fffe127c7f6 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007fffe13cd025 in cuMemcpy () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007fffe2133892 in ?? () from /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2
#8  0x00007fffe2113216 in ?? () from /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2
#9  0x00007fffe2139318 in cudaMemcpy () from /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2
#10 0x00007fffe3da8cfc in launch_storeRecordStart(char const*, unsigned long, raw_csv_*) ()
   from /home/nfs/mrocklin/miniconda/envs/cudf/lib/libcudf.so
#11 0x00007fffe3dad29b in read_csv () from /home/nfs/mrocklin/miniconda/envs/cudf/lib/libcudf.so
#12 0x00007ffff7e0c630 in ffi_call_unix64 () from /home/nfs/mrocklin/miniconda/envs/cudf/lib/python3.7/lib-dynload/../../libffi.so.6
#13 0x00007ffff7e0bfed in ffi_call () from /home/nfs/mrocklin/miniconda/envs/cudf/lib/python3.7/lib-dynload/../../libffi.so.6
#14 0x00007fffe6a5d0c4 in cdata_call ()
   from /home/nfs/mrocklin/miniconda/envs/cudf/lib/python3.7/site-packages/_cffi_backend.cpython-37m-x86_64-linux-gnu.so
#15 0x000055555567c38e in PyObject_Call () at /home/conda/feedstock_root/build_artifacts/python_1550451629915/work/Objects/call.c:245
#16 0x000055555572dfdd in do_call_core (kwdict=0x0, callargs=0x7fffd1af24a8, func=0x7ffff7ef1738)
    at /home/conda/feedstock_root/build_artifacts/python_1550451629915/work/Python/ceval.c:4631
#17 _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1550451629915/work/Python/ceval.c:3191
#18 0x000055555566a929 in _PyEval_EvalCodeWithName ()
    at /home/conda/feedstock_root/build_artifacts/python_1550451629915/work/Python/ceval.c:3930
#19 0x00005555556bbad7 in _PyFunction_FastCallKeywords ()
    at /home/conda/feedstock_root/build_artifacts/python_1550451629915/work/Objects/call.c:433

cc @jrhemstad for triage

jrhemstad · 2019-02-22T21:56:08Z

@mrocklin that looks like an error in the csv_reader:

#11 0x00007fffe3dad29b in read_csv () from /home/nfs/mrocklin/miniconda/envs/cudf/lib/libcudf.so

mrocklin · 2019-02-22T22:03:52Z

What is the right way to dig into this deeper? Try to get a reproducer with just cudf?

randerzander · 2019-02-22T22:04:22Z

Note that when using cudf directly (snippet is included in the repro), the same join works fine.

If the trace @jrhemstad 's referring to is in the CSV reader, might it be related to use of byte_range? @mjsamoht fyi

jrhemstad · 2019-02-22T22:06:13Z

I’m afraid I’m pretty clueless when it comes to the csv_reader. You’d need to ping someone working on cuIO directly. From: Matthew Rocklin <[email protected]> Sent: Friday, February 22, 2019 4:04 PM To: rapidsai/dask-cudf <[email protected]> Cc: Jake Hemstad <[email protected]>; Mention <[email protected]> Subject: Re: [rapidsai/dask-cudf] multi-gb joins result in hangs (#102) What is the right way to dig into this deeper? Try to get a reproducer with just cudf? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#102 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AOhCKd4CygFq6sXRKfxNc-ejpJ6Dk226ks5vQGlJgaJpZM4bKeRU>.

…

----------------------------------------------------------------------------------- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -----------------------------------------------------------------------------------

vuule · 2019-02-22T22:12:56Z

Probably related to byte_range, given that the call is otherwise the same. @mrocklin can you post the arguments of the failing cudf.read_csv call ( within dgd.read_csv)? I cannot repro the issue locally otherwise.

mrocklin · 2019-02-22T22:15:50Z

Yeah, I'm working on a getting reproducible failure now

mrocklin · 2019-02-22T22:49:16Z

Small update. This has nothing to do with joins or with the later partitions. I can replicate it with the following:

import dask_cudf

left = dask_cudf.read_csv('left.csv').partitions[:1]
left.compute()

Still trying to reduce it down so that it doesn't include dask

mrocklin · 2019-02-22T23:03:56Z

Yeah, this will do it

Here is @randerzander 's prep code from above

import pandas as pd
import numpy as np
nelem = 100000000

# generate 2.1, and 1.1 gb file, takes about 4 minutes
df_0 = pd.DataFrame({'key': np.arange(0, nelem), 'zeros': np.zeros(nelem)})
df_0.to_csv('left.csv')

And then here is a failing cudf code to segfault

import cudf

cudf.read_csv('left.csv', byte_range=(0, 268435456))

Interestingly, it works fine with byte_range with a length of one less or one more

In [9]: cudf.read_csv('left.csv', byte_range=(0, 268435455))
Out[9]: <cudf.DataFrame ncols=3 nrows=13211713 >

In [10]: cudf.read_csv('left.csv', byte_range=(0, 268435457))
Out[10]: <cudf.DataFrame ncols=3 nrows=13211713 >

In [11]: cudf.read_csv('left.csv', byte_range=(0, 268435456))
Segmentation fault

If it's ok, at this point I'm going to hand investigation off to you all

vuule · 2019-02-25T19:02:39Z

Thanks, @mrocklin. I'll look into this today or tomorrow.

vuule · 2019-02-25T20:20:16Z

Got repro with the steps posted above. After some experimentation, turns out that the issue reproes with any test where the byte range ends at the end of a page (4096B).
However, the issue is intermittent with small byte_range sizes.

vuule · 2019-02-25T22:29:58Z

@randerzander @mrocklin Can you please merge rapidsai/cudf#1044 and check if it fixes the issue? Verified the fix locally.

randerzander · 2019-02-26T16:55:33Z

thanks, @vuule using your PR, the problem looks resolved

Will close this issue once the PR merges into cuDF

vuule · 2019-02-27T20:10:30Z

@randerzander The PR has been merged into branch-0.6

randerzander · 2019-02-27T21:44:18Z

Thanks, @vuule !

vuule mentioned this issue Feb 25, 2019

[REVIEW] CSV Reader: Fix a segfault when byte range aligns with a page rapidsai/cudf#1044

Merged

randerzander closed this as completed Feb 27, 2019

amahussein mentioned this issue Jul 22, 2022

[FEA] explore faster data transitions NVIDIA/spark-rapids#507

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-gb joins result in hangs #102

multi-gb joins result in hangs #102

randerzander commented Feb 22, 2019 •

edited

Loading

mrocklin commented Feb 22, 2019

jrhemstad commented Feb 22, 2019

mrocklin commented Feb 22, 2019

randerzander commented Feb 22, 2019 •

edited

Loading

jrhemstad commented Feb 22, 2019 via email

vuule commented Feb 22, 2019

mrocklin commented Feb 22, 2019

mrocklin commented Feb 22, 2019

mrocklin commented Feb 22, 2019

vuule commented Feb 25, 2019

vuule commented Feb 25, 2019

vuule commented Feb 25, 2019

randerzander commented Feb 26, 2019 •

edited

Loading

vuule commented Feb 27, 2019

randerzander commented Feb 27, 2019

multi-gb joins result in hangs #102

multi-gb joins result in hangs #102

Comments

randerzander commented Feb 22, 2019 • edited Loading

mrocklin commented Feb 22, 2019

jrhemstad commented Feb 22, 2019

mrocklin commented Feb 22, 2019

randerzander commented Feb 22, 2019 • edited Loading

jrhemstad commented Feb 22, 2019 via email

vuule commented Feb 22, 2019

mrocklin commented Feb 22, 2019

mrocklin commented Feb 22, 2019

mrocklin commented Feb 22, 2019

vuule commented Feb 25, 2019

vuule commented Feb 25, 2019

vuule commented Feb 25, 2019

randerzander commented Feb 26, 2019 • edited Loading

vuule commented Feb 27, 2019

randerzander commented Feb 27, 2019

randerzander commented Feb 22, 2019 •

edited

Loading

randerzander commented Feb 22, 2019 •

edited

Loading

randerzander commented Feb 26, 2019 •

edited

Loading