Skip to content
This repository has been archived by the owner on Jul 16, 2019. It is now read-only.

multi-gb joins result in hangs #102

Closed
randerzander opened this issue Feb 22, 2019 · 15 comments
Closed

multi-gb joins result in hangs #102

randerzander opened this issue Feb 22, 2019 · 15 comments

Comments

@randerzander
Copy link
Contributor

randerzander commented Feb 22, 2019

While working with the GHCN weather dataset, I ran into a "ValueError: All series must be of same type" which only occurs when working with > 1ish GB of data. Will file in a separate issue.

The below issue occurs both when setting up with LocalCudaCluster (omitted for simplicity), and when using dask_cudf diretly, without cluster.

While trying to boil it down to a simpler repro, I run into hanging/restarting kernels:

import pandas as pd
import numpy as np
import dask_cudf as dgd
import cudf

nelem = 100000000

# generate 2.1, and 1.1 gb file, takes about 4 minutes
df_0 = pd.DataFrame({'key': range(0, nelem), 'zeros': np.zeros(nelem)})
df_0.to_csv('left.csv')
df_1 = pd.DataFrame({'key': range(0, int(nelem/2)), 'ones': np.ones(int(nelem/2))})
df_1.to_csv('right.csv')

# runs fast, no issue
left = cudf.read_csv('left.csv')
right = cudf.read_csv('right.csv')
joined = left.merge(right, on=['key'], how='outer')
joined.head().to_pandas()

# hangs, restarts Jupyter kernels
left = dgd.read_csv('left.csv')
right = dgd.read_csv('right.csv')
joined = left.merge(right, on=['key'], how='outer')
joined.head().to_pandas()

From Jupyter logs:

KernelRestarter: restarting kernel (1/5), keep random ports
kernel 27543bfb-967e-4a50-b77a-665ec0443502 restarted
kernel 27543bfb-967e-4a50-b77a-665ec0443502 restarted
@mrocklin
Copy link
Collaborator

I get a segfault when I try this. Here is the traceback from gdb

#0  0x00007ffff793db7a in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fffe14efaaf in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007fffe136258f in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007fffe136335c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fffe127c48e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007fffe127c7f6 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007fffe13cd025 in cuMemcpy () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007fffe2133892 in ?? () from /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2
#8  0x00007fffe2113216 in ?? () from /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2
#9  0x00007fffe2139318 in cudaMemcpy () from /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2
#10 0x00007fffe3da8cfc in launch_storeRecordStart(char const*, unsigned long, raw_csv_*) ()
   from /home/nfs/mrocklin/miniconda/envs/cudf/lib/libcudf.so
#11 0x00007fffe3dad29b in read_csv () from /home/nfs/mrocklin/miniconda/envs/cudf/lib/libcudf.so
#12 0x00007ffff7e0c630 in ffi_call_unix64 () from /home/nfs/mrocklin/miniconda/envs/cudf/lib/python3.7/lib-dynload/../../libffi.so.6
#13 0x00007ffff7e0bfed in ffi_call () from /home/nfs/mrocklin/miniconda/envs/cudf/lib/python3.7/lib-dynload/../../libffi.so.6
#14 0x00007fffe6a5d0c4 in cdata_call ()
   from /home/nfs/mrocklin/miniconda/envs/cudf/lib/python3.7/site-packages/_cffi_backend.cpython-37m-x86_64-linux-gnu.so
#15 0x000055555567c38e in PyObject_Call () at /home/conda/feedstock_root/build_artifacts/python_1550451629915/work/Objects/call.c:245
#16 0x000055555572dfdd in do_call_core (kwdict=0x0, callargs=0x7fffd1af24a8, func=0x7ffff7ef1738)
    at /home/conda/feedstock_root/build_artifacts/python_1550451629915/work/Python/ceval.c:4631
#17 _PyEval_EvalFrameDefault () at /home/conda/feedstock_root/build_artifacts/python_1550451629915/work/Python/ceval.c:3191
#18 0x000055555566a929 in _PyEval_EvalCodeWithName ()
    at /home/conda/feedstock_root/build_artifacts/python_1550451629915/work/Python/ceval.c:3930
#19 0x00005555556bbad7 in _PyFunction_FastCallKeywords ()
    at /home/conda/feedstock_root/build_artifacts/python_1550451629915/work/Objects/call.c:433

cc @jrhemstad for triage

@jrhemstad
Copy link
Contributor

@mrocklin that looks like an error in the csv_reader:

#11 0x00007fffe3dad29b in read_csv () from /home/nfs/mrocklin/miniconda/envs/cudf/lib/libcudf.so

@mrocklin
Copy link
Collaborator

What is the right way to dig into this deeper? Try to get a reproducer with just cudf?

@randerzander
Copy link
Contributor Author

randerzander commented Feb 22, 2019

Note that when using cudf directly (snippet is included in the repro), the same join works fine.

If the trace @jrhemstad 's referring to is in the CSV reader, might it be related to use of byte_range? @mjsamoht fyi

@jrhemstad
Copy link
Contributor

jrhemstad commented Feb 22, 2019 via email

@vuule
Copy link

vuule commented Feb 22, 2019

Probably related to byte_range, given that the call is otherwise the same. @mrocklin can you post the arguments of the failing cudf.read_csv call ( within dgd.read_csv)? I cannot repro the issue locally otherwise.

@mrocklin
Copy link
Collaborator

Yeah, I'm working on a getting reproducible failure now

@mrocklin
Copy link
Collaborator

Small update. This has nothing to do with joins or with the later partitions. I can replicate it with the following:

import dask_cudf

left = dask_cudf.read_csv('left.csv').partitions[:1]
left.compute()

Still trying to reduce it down so that it doesn't include dask

@mrocklin
Copy link
Collaborator

Yeah, this will do it

Here is @randerzander 's prep code from above

import pandas as pd
import numpy as np
nelem = 100000000

# generate 2.1, and 1.1 gb file, takes about 4 minutes
df_0 = pd.DataFrame({'key': np.arange(0, nelem), 'zeros': np.zeros(nelem)})
df_0.to_csv('left.csv')

And then here is a failing cudf code to segfault

import cudf

cudf.read_csv('left.csv', byte_range=(0, 268435456))

Interestingly, it works fine with byte_range with a length of one less or one more

In [9]: cudf.read_csv('left.csv', byte_range=(0, 268435455))
Out[9]: <cudf.DataFrame ncols=3 nrows=13211713 >

In [10]: cudf.read_csv('left.csv', byte_range=(0, 268435457))
Out[10]: <cudf.DataFrame ncols=3 nrows=13211713 >

In [11]: cudf.read_csv('left.csv', byte_range=(0, 268435456))
Segmentation fault

If it's ok, at this point I'm going to hand investigation off to you all

@vuule
Copy link

vuule commented Feb 25, 2019

Thanks, @mrocklin. I'll look into this today or tomorrow.

@vuule
Copy link

vuule commented Feb 25, 2019

Got repro with the steps posted above. After some experimentation, turns out that the issue reproes with any test where the byte range ends at the end of a page (4096B).
However, the issue is intermittent with small byte_range sizes.

@vuule
Copy link

vuule commented Feb 25, 2019

@randerzander @mrocklin Can you please merge rapidsai/cudf#1044 and check if it fixes the issue? Verified the fix locally.

@randerzander
Copy link
Contributor Author

randerzander commented Feb 26, 2019

thanks, @vuule using your PR, the problem looks resolved

Will close this issue once the PR merges into cuDF

@vuule
Copy link

vuule commented Feb 27, 2019

@randerzander The PR has been merged into branch-0.6

@randerzander
Copy link
Contributor Author

Thanks, @vuule !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants