[BUG]cudaErrorIllegalAddress an illegal memory access was encountered #563

JustPlay · 2020-09-18T07:19:03Z

Describe the bug
When I running rapids-0.2 with cudf-0.16,rmm-0.16, i encountered the following rmm error

I'am using
RMM with commit-id: f591436
cuDF with commit-id: 32e6c1d9369f9a5bfe81958bf1a4a81af51bb59e

cuDF, rmm was build by using https://github.com/rapidsai/cudf/blob/branch-0.16/java/ci/build-in-docker.sh but turn PTDS=on

export WORKSPACE=$(pwd)
export PARALLEL_LEVEL=4
export SKIP_JAVA_TESTS=true
export BUILD_CPP_TESTS=OFF
export OUT=out
export ENABLE_PTDS=ON

set -e
gcc --version

SIGN_FILE=$1
#Set absolute path for OUT_PATH
OUT_PATH=$WORKSPACE/$OUT

# set on Jenkins parameter
if [ -z $RMM_VERSION ]
then
RMM_VERSION=`git describe --tags | grep -o -E '([0-9]+\.[0-9]+)'`
fi
echo "RMM_VERSION: $RMM_VERSION, SIGN_FILE: $SIGN_FILE, SKIP_JAVA_TESTS: $SKIP_JAVA_TESTS,\
 BUILD_CPP_TESTS: $BUILD_CPP_TESTS, ENABLED_PTDS: $ENABLE_PTDS, OUT_PATH: $OUT_PATH"

INSTALL_PREFIX=/usr/local/rapids
export GIT_COMMITTER_NAME="ci"
export GIT_COMMITTER_EMAIL="[email protected]"
export CUDACXX=/usr/local/cuda/bin/nvcc
export RMM_ROOT=$INSTALL_PREFIX
export DLPACK_ROOT=$INSTALL_PREFIX
export LIBCUDF_KERNEL_CACHE_PATH=/rapids

rapids_dir=$(pwd)

[[ ! -d ${rapids_dir} ]] && mkidr -p ${rapids_dir}

cd ${rapids_dir}
# git clone --recurse-submodules https://github.com/rapidsai/rmm.git -b branch-$RMM_VERSION
# git clone --recurse-submodules https://github.com/rapidsai/dlpack.git -b cudf

###### Build rmm/dlpack ######
mkdir -p ${rapids_dir}/rmm/build
cd ${rapids_dir}/rmm/build
echo "RMM SHA: `git rev-parse HEAD`"
cmake .. -DCMAKE_INSTALL_PREFIX=$INSTALL_PREFIX -DBUILD_TESTS=$BUILD_CPP_TESTS -DPER_THREAD_DEFAULT_STREAM=$ENABLE_PTDS
make -j install

# Install spdlog headers from RMM build
(cd ${rapids_dir}//rmm/build/_deps/spdlog-src && find include/spdlog | cpio -pmdv $INSTALL_PREFIX)

mkdir -p ${rapids_dir}/dlpack/build
cd ${rapids_dir}/dlpack/build
echo "DLPACK SHA: `git rev-parse HEAD`"
cmake .. -DCMAKE_INSTALL_PREFIX=$INSTALL_PREFIX -DBUILD_TESTS=$BUILD_CPP_TESTS
make -j install

###### Build libcudf ######
# rm -rf $WORKSPACE/cpp/build
mkdir -p $WORKSPACE/cpp/build
cd $WORKSPACE/cpp/build
cmake .. -DUSE_NVTX=OFF -DARROW_STATIC_LIB=ON -DBoost_USE_STATIC_LIBS=ON -DBUILD_TESTS=$BUILD_CPP_TESTS -DPER_THREAD_DEFAULT_STREAM=$ENABLE_PTDS
make -j
make install DESTDIR=$INSTALL_PREFIX

###### Build cudf jar ######
BUILD_ARG="-Dmaven.repo.local=$WORKSPACE/.m2 -DskipTests=$SKIP_JAVA_TESTS -DPER_THREAD_DEFAULT_STREAM=$ENABLE_PTDS"
if [ "$SIGN_FILE" == true ]; then
    # Build javadoc and sources only when SIGN_FILE is true
    BUILD_ARG="$BUILD_ARG -Prelease"
fi

cd $WORKSPACE/java
/opt/toolchain/maven-3.6.3/bin/mvn -B clean package $BUILD_ARG

###### Stash Jar files ######
rm -rf $OUT_PATH
mkdir -p $OUT_PATH
cp -f target/*.jar $OUT_PATH

The text was updated successfully, but these errors were encountered:

harrism · 2020-09-20T23:22:17Z

@JustPlay this won't be possible to triage unless you can provide a minimal reproducer.

@rongou @jlowe if this is a Spark repro I will need your help.

rongou · 2020-09-21T18:01:08Z

I think I've seen similar errors, but haven't been able to reproduce it reliably.

jrhemstad · 2020-09-21T18:06:52Z

The fact that these errors are picked up in RMM does not mean that they originate in RMM. RMM could just be picking up latent, asynchronous CUDA errors caused from somewhere else.

rongou · 2020-09-21T19:00:10Z

Hmm, seems like this can reliably produce a test failure when built with -DPER_THREAD_DEFAULT_STREAM=ON. Not sure if this is the cause:

for i in {1..100}; do ./gtests/DEVICE_SCALAR_TEST || break; done

jrhemstad · 2020-09-21T19:06:18Z

Hmm, seems like this can reliably produce a test failure when built with -DPER_THREAD_DEFAULT_STREAM=ON. Not sure if this is the cause:
for i in {1..100}; do ./gtests/DEVICE_SCALAR_TEST || break; done

Aha. I see what the problem is. I'll have a PR with the fix in a moment.

jrhemstad · 2020-09-21T19:24:59Z

This should resolve it: #569

jrhemstad · 2020-09-21T19:29:55Z

Also see #570 re: asynchrony of set_value.

JustPlay · 2020-09-22T05:21:48Z

#569

Two kind of " illegal", one is cudaErrorIllegalAddress, one is cudaErrorMisalignedAddress (one the same code lines)

And tpc-ds query 14a, 14b has very high probability to trigger this bug

@jrhemstad

rongou · 2020-09-23T04:48:15Z

@jrhemstad I'm still seeing test failures after syncing to head and rebuilding:

$ for i in {1..100}; do ./gtests/DEVICE_SCALAR_TEST || break; done
...
...
[ RUN      ] DeviceScalarTest/0.InitialValue
/home/rou/src/rmm/tests/device_scalar_tests.cpp:60: Failure
Expected equality of these values:
  this->value
    Which is: '\x80' (-128)
  scalar.value()
    Which is: '\0'
[  FAILED  ] DeviceScalarTest/0.InitialValue, where TypeParam = signed char (0 ms)

harrism · 2020-09-23T04:49:40Z

Seems we need to add a DEVICE_SCALAR_PTDS_TEST...

rongou · 2020-09-24T05:11:06Z

Looked at DEVICE_SCALAR_TEST a bit today. It seems the problem is we are creating a stream and destroying it in every test, somehow causing an issue when PTDS is enabled. Probably not related to the errors seen in Spark jobs.

JustPlay · 2020-09-24T05:18:09Z

Looked at DEVICE_SCALAR_TEST a bit today. It seems the problem is we are creating a stream and destroying it in every test, somehow causing an issue when PTDS is enabled. Probably not related to the errors seen in Spark jobs.

tpc-ds 9,14a,14b,99,88,35,58,82,46,87,38,70,48,59,11,61,24b,78,50,8,25 are most affected @rongou

JustPlay · 2020-09-28T06:43:08Z

Looked at DEVICE_SCALAR_TEST a bit today. It seems the problem is we are creating a stream and destroying it in every test, somehow causing an issue when PTDS is enabled. Probably not related to the errors seen in Spark jobs.

maybe it is due to some unknown issue in pool_memory_resouce
arena seems do not have this problem (i will do more tpc-ds test and update if i found some thing new) @jrhemstad

JustPlay · 2020-09-29T03:38:01Z

CUDA error at: /home/0.alpha/rapids.pkgs/20200917/cudf-0.16.memory_resource/cudf.dev.arena/rapids/include/rmm/mr/device/detail/arena.hpp:623: cudaErrorIllegalAddress an illegal memory access was encountered

@rongou found a cudaErrorIllegalAddress when using arena (tpc-ds 64 and tpc-ds 46)

harrism · 2020-09-29T03:41:58Z

@jrhemstad seems #569 did not resolve this?

JustPlay · 2020-09-29T03:46:23Z

seems #569 did not resolve this?

based on my TPC-DS test using rmm (pool_memory_resource) 6cc497a, there still exists many cudaErrorIllegalAddress error.

BTW: arena works much better with just one cudaErrorIllegalAddress error (above pic)

harrism · 2020-09-29T10:44:00Z

Building RMM debug and running under cuda-memcheck could provide deeper insight (a full call stack inside RMM).

rongou · 2020-09-29T16:06:06Z

It's likely there are race conditions in libcudf and/or the spark-rapids plugin. RMM is only surfacing these issue when synchronizing the stream or waiting for events.

harrism · 2020-09-30T00:46:08Z

So are you no longer seeing the DEVICE_SCALAR_TEST failures?

rongou · 2020-09-30T03:26:16Z

I am, but as I said, that seems to be caused by creating and destroying streams in every test. Maybe the CUDA driver is reusing some data structure for streams that's causing the problem. In any case, it seems like a different issue from the cudaErrorIllegalAddress errors.

harrism · 2020-09-30T03:30:36Z

I don't understand why you get these failures and nobody else in the team does (and CI doesn't). CI is currently only testing PTDS mode. (#555)

JustPlay · 2020-09-30T04:14:39Z

So are you no longer seeing the DEVICE_SCALAR_TEST failures?

still has error in device_scalar.hpp (pool_memory_resouce) when testing TPC-DS （i have not tested DEVICE_SCALAR_TEST）

harrism · 2020-10-01T01:02:25Z

@JustPlay you've made that abundantly clear. :) If @rongou is correct and the problem is not in RMM then we should probably open an issue on the spark-rapids repo.

rongou · 2020-10-18T18:01:11Z

@JustPlay are you still seeing these errors? I ran all the TPC-DS queries in a loop for several times and haven't see any memory errors. I'm using the latest arena_memory_source and running both legacy and per-thread default streams.

JustPlay · 2020-10-22T04:00:24Z

@JustPlay are you still seeing these errors? I ran all the TPC-DS queries in a loop for several times and haven't see any memory errors. I'm using the latest arena_memory_source and running both legacy and per-thread default streams.

sorry, i have not tested rapids recently; i will report if i found some;

Thanks

harrism · 2020-10-22T04:03:50Z

Closing for now. Please reopen if you run into the errors again.

JustPlay added ? - Needs Triage Need team to review and classify bug Something isn't working labels Sep 18, 2020

This was referenced Oct 7, 2020

[FEA] Improve PTDS performance NVIDIA/spark-rapids#533

Closed

[REVIEW] fix device_scalar and its tests so that they use the correct CUDA stream #602

Merged

harrism mentioned this issue Oct 14, 2020

[BUG] Broken tests in debug build rapidsai/cudf#6521

Closed

12 tasks

harrism closed this as completed Oct 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]cudaErrorIllegalAddress an illegal memory access was encountered #563

[BUG]cudaErrorIllegalAddress an illegal memory access was encountered #563

JustPlay commented Sep 18, 2020 •

edited

Loading

harrism commented Sep 20, 2020

rongou commented Sep 21, 2020

jrhemstad commented Sep 21, 2020 •

edited

Loading

rongou commented Sep 21, 2020

jrhemstad commented Sep 21, 2020

jrhemstad commented Sep 21, 2020

jrhemstad commented Sep 21, 2020

JustPlay commented Sep 22, 2020 •

edited

Loading

rongou commented Sep 23, 2020

harrism commented Sep 23, 2020

rongou commented Sep 24, 2020

JustPlay commented Sep 24, 2020

JustPlay commented Sep 28, 2020 •

edited

Loading

JustPlay commented Sep 29, 2020 •

edited

Loading

harrism commented Sep 29, 2020

JustPlay commented Sep 29, 2020

harrism commented Sep 29, 2020

rongou commented Sep 29, 2020

harrism commented Sep 30, 2020

rongou commented Sep 30, 2020

harrism commented Sep 30, 2020

JustPlay commented Sep 30, 2020

harrism commented Oct 1, 2020

rongou commented Oct 18, 2020

JustPlay commented Oct 22, 2020 •

edited

Loading

harrism commented Oct 22, 2020

[BUG]cudaErrorIllegalAddress an illegal memory access was encountered #563

[BUG]cudaErrorIllegalAddress an illegal memory access was encountered #563

Comments

JustPlay commented Sep 18, 2020 • edited Loading

harrism commented Sep 20, 2020

rongou commented Sep 21, 2020

jrhemstad commented Sep 21, 2020 • edited Loading

rongou commented Sep 21, 2020

jrhemstad commented Sep 21, 2020

jrhemstad commented Sep 21, 2020

jrhemstad commented Sep 21, 2020

JustPlay commented Sep 22, 2020 • edited Loading

rongou commented Sep 23, 2020

harrism commented Sep 23, 2020

rongou commented Sep 24, 2020

JustPlay commented Sep 24, 2020

JustPlay commented Sep 28, 2020 • edited Loading

JustPlay commented Sep 29, 2020 • edited Loading

harrism commented Sep 29, 2020

JustPlay commented Sep 29, 2020

harrism commented Sep 29, 2020

rongou commented Sep 29, 2020

harrism commented Sep 30, 2020

rongou commented Sep 30, 2020

harrism commented Sep 30, 2020

JustPlay commented Sep 30, 2020

harrism commented Oct 1, 2020

rongou commented Oct 18, 2020

JustPlay commented Oct 22, 2020 • edited Loading

harrism commented Oct 22, 2020

JustPlay commented Sep 18, 2020 •

edited

Loading

jrhemstad commented Sep 21, 2020 •

edited

Loading

JustPlay commented Sep 22, 2020 •

edited

Loading

JustPlay commented Sep 28, 2020 •

edited

Loading

JustPlay commented Sep 29, 2020 •

edited

Loading

JustPlay commented Oct 22, 2020 •

edited

Loading