Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]cudaErrorIllegalAddress an illegal memory access was encountered #563

Closed
JustPlay opened this issue Sep 18, 2020 · 26 comments
Closed
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@JustPlay
Copy link

JustPlay commented Sep 18, 2020

Describe the bug
When I running rapids-0.2 with cudf-0.16,rmm-0.16, i encountered the following rmm error
image
image
image

I'am using
RMM with commit-id: f591436
cuDF with commit-id: 32e6c1d9369f9a5bfe81958bf1a4a81af51bb59e

cuDF, rmm was build by using https://github.com/rapidsai/cudf/blob/branch-0.16/java/ci/build-in-docker.sh but turn PTDS=on

export WORKSPACE=$(pwd)
export PARALLEL_LEVEL=4
export SKIP_JAVA_TESTS=true
export BUILD_CPP_TESTS=OFF
export OUT=out
export ENABLE_PTDS=ON

set -e
gcc --version

SIGN_FILE=$1
#Set absolute path for OUT_PATH
OUT_PATH=$WORKSPACE/$OUT

# set on Jenkins parameter
if [ -z $RMM_VERSION ]
then
RMM_VERSION=`git describe --tags | grep -o -E '([0-9]+\.[0-9]+)'`
fi
echo "RMM_VERSION: $RMM_VERSION, SIGN_FILE: $SIGN_FILE, SKIP_JAVA_TESTS: $SKIP_JAVA_TESTS,\
 BUILD_CPP_TESTS: $BUILD_CPP_TESTS, ENABLED_PTDS: $ENABLE_PTDS, OUT_PATH: $OUT_PATH"

INSTALL_PREFIX=/usr/local/rapids
export GIT_COMMITTER_NAME="ci"
export GIT_COMMITTER_EMAIL="[email protected]"
export CUDACXX=/usr/local/cuda/bin/nvcc
export RMM_ROOT=$INSTALL_PREFIX
export DLPACK_ROOT=$INSTALL_PREFIX
export LIBCUDF_KERNEL_CACHE_PATH=/rapids

rapids_dir=$(pwd)

[[ ! -d ${rapids_dir} ]] && mkidr -p ${rapids_dir}

cd ${rapids_dir}
# git clone --recurse-submodules https://github.com/rapidsai/rmm.git -b branch-$RMM_VERSION
# git clone --recurse-submodules https://github.com/rapidsai/dlpack.git -b cudf

###### Build rmm/dlpack ######
mkdir -p ${rapids_dir}/rmm/build
cd ${rapids_dir}/rmm/build
echo "RMM SHA: `git rev-parse HEAD`"
cmake .. -DCMAKE_INSTALL_PREFIX=$INSTALL_PREFIX -DBUILD_TESTS=$BUILD_CPP_TESTS -DPER_THREAD_DEFAULT_STREAM=$ENABLE_PTDS
make -j install

# Install spdlog headers from RMM build
(cd ${rapids_dir}//rmm/build/_deps/spdlog-src && find include/spdlog | cpio -pmdv $INSTALL_PREFIX)

mkdir -p ${rapids_dir}/dlpack/build
cd ${rapids_dir}/dlpack/build
echo "DLPACK SHA: `git rev-parse HEAD`"
cmake .. -DCMAKE_INSTALL_PREFIX=$INSTALL_PREFIX -DBUILD_TESTS=$BUILD_CPP_TESTS
make -j install

###### Build libcudf ######
# rm -rf $WORKSPACE/cpp/build
mkdir -p $WORKSPACE/cpp/build
cd $WORKSPACE/cpp/build
cmake .. -DUSE_NVTX=OFF -DARROW_STATIC_LIB=ON -DBoost_USE_STATIC_LIBS=ON -DBUILD_TESTS=$BUILD_CPP_TESTS -DPER_THREAD_DEFAULT_STREAM=$ENABLE_PTDS
make -j
make install DESTDIR=$INSTALL_PREFIX

###### Build cudf jar ######
BUILD_ARG="-Dmaven.repo.local=$WORKSPACE/.m2 -DskipTests=$SKIP_JAVA_TESTS -DPER_THREAD_DEFAULT_STREAM=$ENABLE_PTDS"
if [ "$SIGN_FILE" == true ]; then
    # Build javadoc and sources only when SIGN_FILE is true
    BUILD_ARG="$BUILD_ARG -Prelease"
fi

cd $WORKSPACE/java
/opt/toolchain/maven-3.6.3/bin/mvn -B clean package $BUILD_ARG

###### Stash Jar files ######
rm -rf $OUT_PATH
mkdir -p $OUT_PATH
cp -f target/*.jar $OUT_PATH
@JustPlay JustPlay added ? - Needs Triage Need team to review and classify bug Something isn't working labels Sep 18, 2020
@harrism
Copy link
Member

harrism commented Sep 20, 2020

@JustPlay this won't be possible to triage unless you can provide a minimal reproducer.

@rongou @jlowe if this is a Spark repro I will need your help.

@rongou
Copy link
Contributor

rongou commented Sep 21, 2020

I think I've seen similar errors, but haven't been able to reproduce it reliably.

@jrhemstad
Copy link
Contributor

jrhemstad commented Sep 21, 2020

The fact that these errors are picked up in RMM does not mean that they originate in RMM. RMM could just be picking up latent, asynchronous CUDA errors caused from somewhere else.

@rongou
Copy link
Contributor

rongou commented Sep 21, 2020

Hmm, seems like this can reliably produce a test failure when built with -DPER_THREAD_DEFAULT_STREAM=ON. Not sure if this is the cause:

for i in {1..100}; do ./gtests/DEVICE_SCALAR_TEST || break; done

@jrhemstad
Copy link
Contributor

Hmm, seems like this can reliably produce a test failure when built with -DPER_THREAD_DEFAULT_STREAM=ON. Not sure if this is the cause:

for i in {1..100}; do ./gtests/DEVICE_SCALAR_TEST || break; done

Aha. I see what the problem is. I'll have a PR with the fix in a moment.

@jrhemstad
Copy link
Contributor

This should resolve it: #569

@jrhemstad
Copy link
Contributor

Also see #570 re: asynchrony of set_value.

@JustPlay
Copy link
Author

JustPlay commented Sep 22, 2020

#569

image
image

Two kind of " illegal", one is cudaErrorIllegalAddress, one is cudaErrorMisalignedAddress (one the same code lines)

And tpc-ds query 14a, 14b has very high probability to trigger this bug

@jrhemstad

@rongou
Copy link
Contributor

rongou commented Sep 23, 2020

@jrhemstad I'm still seeing test failures after syncing to head and rebuilding:

$ for i in {1..100}; do ./gtests/DEVICE_SCALAR_TEST || break; done
...
...
[ RUN      ] DeviceScalarTest/0.InitialValue
/home/rou/src/rmm/tests/device_scalar_tests.cpp:60: Failure
Expected equality of these values:
  this->value
    Which is: '\x80' (-128)
  scalar.value()
    Which is: '\0'
[  FAILED  ] DeviceScalarTest/0.InitialValue, where TypeParam = signed char (0 ms)

@harrism
Copy link
Member

harrism commented Sep 23, 2020

Seems we need to add a DEVICE_SCALAR_PTDS_TEST...

@rongou
Copy link
Contributor

rongou commented Sep 24, 2020

Looked at DEVICE_SCALAR_TEST a bit today. It seems the problem is we are creating a stream and destroying it in every test, somehow causing an issue when PTDS is enabled. Probably not related to the errors seen in Spark jobs.

@JustPlay
Copy link
Author

Looked at DEVICE_SCALAR_TEST a bit today. It seems the problem is we are creating a stream and destroying it in every test, somehow causing an issue when PTDS is enabled. Probably not related to the errors seen in Spark jobs.

tpc-ds 9,14a,14b,99,88,35,58,82,46,87,38,70,48,59,11,61,24b,78,50,8,25 are most affected @rongou

@JustPlay
Copy link
Author

JustPlay commented Sep 28, 2020

Looked at DEVICE_SCALAR_TEST a bit today. It seems the problem is we are creating a stream and destroying it in every test, somehow causing an issue when PTDS is enabled. Probably not related to the errors seen in Spark jobs.

maybe it is due to some unknown issue in pool_memory_resouce
arena seems do not have this problem (i will do more tpc-ds test and update if i found some thing new) @jrhemstad

@JustPlay
Copy link
Author

JustPlay commented Sep 29, 2020

CUDA error at: /home/0.alpha/rapids.pkgs/20200917/cudf-0.16.memory_resource/cudf.dev.arena/rapids/include/rmm/mr/device/detail/arena.hpp:623: cudaErrorIllegalAddress an illegal memory access was encountered

image

image

@rongou found a cudaErrorIllegalAddress when using arena (tpc-ds 64 and tpc-ds 46)

@harrism
Copy link
Member

harrism commented Sep 29, 2020

@jrhemstad seems #569 did not resolve this?

@JustPlay
Copy link
Author

seems #569 did not resolve this?

based on my TPC-DS test using rmm (pool_memory_resource) 6cc497a, there still exists many cudaErrorIllegalAddress error.

BTW: arena works much better with just one cudaErrorIllegalAddress error (above pic)

@harrism
Copy link
Member

harrism commented Sep 29, 2020

Building RMM debug and running under cuda-memcheck could provide deeper insight (a full call stack inside RMM).

@rongou
Copy link
Contributor

rongou commented Sep 29, 2020

It's likely there are race conditions in libcudf and/or the spark-rapids plugin. RMM is only surfacing these issue when synchronizing the stream or waiting for events.

@harrism
Copy link
Member

harrism commented Sep 30, 2020

So are you no longer seeing the DEVICE_SCALAR_TEST failures?

@rongou
Copy link
Contributor

rongou commented Sep 30, 2020

I am, but as I said, that seems to be caused by creating and destroying streams in every test. Maybe the CUDA driver is reusing some data structure for streams that's causing the problem. In any case, it seems like a different issue from the cudaErrorIllegalAddress errors.

@harrism
Copy link
Member

harrism commented Sep 30, 2020

I don't understand why you get these failures and nobody else in the team does (and CI doesn't). CI is currently only testing PTDS mode. (#555)

@JustPlay
Copy link
Author

So are you no longer seeing the DEVICE_SCALAR_TEST failures?

still has error in device_scalar.hpp (pool_memory_resouce) when testing TPC-DS (i have not tested DEVICE_SCALAR_TEST)

image

@harrism
Copy link
Member

harrism commented Oct 1, 2020

@JustPlay you've made that abundantly clear. :) If @rongou is correct and the problem is not in RMM then we should probably open an issue on the spark-rapids repo.

@rongou
Copy link
Contributor

rongou commented Oct 18, 2020

@JustPlay are you still seeing these errors? I ran all the TPC-DS queries in a loop for several times and haven't see any memory errors. I'm using the latest arena_memory_source and running both legacy and per-thread default streams.

@JustPlay
Copy link
Author

JustPlay commented Oct 22, 2020

@JustPlay are you still seeing these errors? I ran all the TPC-DS queries in a loop for several times and haven't see any memory errors. I'm using the latest arena_memory_source and running both legacy and per-thread default streams.

sorry, i have not tested rapids recently; i will report if i found some;

Thanks

@harrism
Copy link
Member

harrism commented Oct 22, 2020

Closing for now. Please reopen if you run into the errors again.

@harrism harrism closed this as completed Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants