Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Multi-GPU training problem. #11

Closed
ginsongsong opened this issue Jul 20, 2017 · 17 comments
Closed

Multi-GPU training problem. #11

ginsongsong opened this issue Jul 20, 2017 · 17 comments

Comments

@ginsongsong
Copy link

ginsongsong commented Jul 20, 2017

Issue summary

I had succeeded to training bvlc-alexnet and bvlc-googlenet models in single MI25 GPU.
When I changed the number of training GPU from 1 to all, caffe show the below message..
CPU memory:256GB swap:16GB
db:imagenet lmdb
batchsize:64
bvlc_alexnet:

I0719 10:51:50.941951 2540 solver.cpp:279] Solving AlexNet
I0719 10:51:50.941956 2540 solver.cpp:280] Learning Rate Policy: step
I0719 10:51:50.955250 2540 solver.cpp:337] Iteration 0, Testing net (#0)
I0719 10:54:02.507711 2540 solver.cpp:404] Test net output #0: accuracy = 0.00109375
I0719 10:54:02.508229 2540 solver.cpp:404] Test net output #1: loss = 6.91062 (* 1 = 6.91062 loss)
Memory access fault by GPU node-2 on address 0x422ea6b000. Reason: Page not present or supervisor privilege.
*** Aborted at 1500432842 (unix time) try "date -d @1500432842" if you are using GNU date ***
PC: @ 0x7f64489dc428 gsignal
*** SIGABRT (@0x9ec) received by PID 2540 (TID 0x7f642c526700) from PID 2540; stack trace: ***
@ 0x7f644ddd0390 (unknown)
@ 0x7f64489dc428 gsignal
@ 0x7f64489de02a abort
@ 0x7f644d9401c9 (unknown)
@ 0x7f644d9464e5 (unknown)
@ 0x7f644d91e9d7 (unknown)
@ 0x7f644ddc66ba start_thread
@ 0x7f6448aae3dd clone
@ 0x0 (unknown)

db:imagenet lmdb
batchsize:32
bvlc_googlenet:

I0719 00:12:28.380522 7405 solver.cpp:279] Solving GoogleNet
I0719 00:12:28.380544 7405 solver.cpp:280] Learning Rate Policy: step
Memory access fault by GPU node-2 on address 0x42309ba000. Reason: Page not present or supervisor privilege.
*** Aborted at 1500394348 (unix time) try "date -d @1500394348" if you are using GNU date ***
PC: @ 0x7f4078d7a428 gsignal
*** SIGABRT (@0x1CED) received by PID 7405 (TID 0x7f405c8c4700) from PID 7405; stack trace: ***
@ 0x7f407e16e390 (unknown)
@ 0x7f4078d7a428 gsignal
@ 0x7f4078d7c02a abort
@ 0x7f407dcde1c9 (unknown)
@ 0x7f407dce44e5 (unknown)
@ 0x7f407dcbc9d7 (unknown)
@ 0x7f407e1646ba start_thread
@ 0x7f4078e4c3dd clone
@ 0x0 (unknown)

Steps to reproduce

Using the latest ROCm from debian packages.

My caffe configuration:

USE_CUDNN := 0
USE_MIOPEN := 1
USE_LMDB := 1
BLAS := open
BLAS_INCLUDE := /opt/openBlas/include
BLAS_LIB := /opt/openBlas/lib

Your system configuration

Operating system: Ubuntu 16.04.2 LTS with 4.9.0-kfd-compute-rocm-rel-1.6-77
Compiler: GCC v5.4.0, HCC clang 5.0
CUDA version (if applicable): not applicable
CUDNN version (if applicable): not applicable
BLAS: OpenBlas
Python or MATLAB version (for pycaffe and matcaffe respectively): not applicable

@parallelo
Copy link
Contributor

Hi @ginsongsong,

Thank you for reporting this. Our understanding was that multi-GPU was working okay. That being said, we will try to reproduce this specific issue.

How many GPUs did you test with?

Best,

Jeff

@ginsongsong
Copy link
Author

ginsongsong commented Jul 21, 2017

Hi @parallelo
I use two MI25 GPUs to training bvlc_caffe models,
and I can use example test for cifar10_quick for two MI25 GPUs.

Maybe the MIopen can't directly reduce memory footprint like cuDnn yet.
My Resnet50 model can use batchsize=6 in P100-PCIE 16GB GPU, but for MI25 16GB GPU I can't use any batchsize in hipcaffe.
resnet.zip

I use rocm-smi to setup the GPU clock and GPU memory clock memory to top level,
following information was captured from rocm-smi in single gpu training in alexnet.

GPU DID Temp AvgPwr SCLK MCLK Fan Perf OverDrive ECC
1 6860 69.0c 177.0W 1500Mhz 945Mhz 0.0% manual 0% N/A
2 6860 52.0c 68.0W 1500Mhz 945Mhz 0.0% manual 0% N/A

But for two MI25 GPU training in alexnet, it the clock level will decrease to the basic clock level .

GPU DID Temp AvgPwr SCLK MCLK Fan Perf OverDrive ECC
1 6860 36.0c 66.0W 852Mhz 167Mhz 0.0% manual 0% N/A
2 6860 38.0c 68.0W 825Mhz 167Mhz 0.0% manual 0% N/A

@parallelo
Copy link
Contributor

Hi @ginsongsong,

Thanks for the extra details. Initially, let's try to focus on multi-GPU AlexNet, and then we can move from there.

Can you please provide these further details?

  • Can you confirm that all of your components are from ROCm 1.6? (both kfd and user-level components)
  • Are you building any libs from source? (e.g. MIOpen, rocBLAS, etc)
  • Please try dropping the AlexNet batch size down, and see if this changes the situation.
  • Please set both of these environment variables, re-run, and report the full hipCaffe run log:
    • export HIP_TRACE_API=1
    • export HCC_SERIALIZE_KERNEL=1

Also, note that we'll have another ROCm point release coming pretty soon to test.

Best,

Jeff

PS - I'll be out of town until Monday evening, but afterwards I'll be able to focus on this specific issue.

@ginsongsong
Copy link
Author

ginsongsong commented Jul 24, 2017

Hi @parallelo ,

Thanks for your kind reply.

For kfd information :
root@AMD:/home/gin/hipCaffe# uname -a
Linux AMD 4.9.0-kfd-compute-rocm-rel-1.6-77 #1 SMP Wed Jun 28 07:30:27 CDT 2017 x86_64 x86_64 x86_64 GNU/Linux

All of the rocm libs are downloading from the debian packages.
cxlactivitylogger is already the newest version (5.1.6386).
hcc is already the newest version (1.0.17262).
miopen-hip is already the newest version (1.0.0).
miopengemm is already the newest version (1.0.1).
rocblas is already the newest version (0.5.2.0).
rocm is already the newest version (1.6.77).
rocm-libs is already the newest version (1.6.77).
rocm-opencl is already the newest version (1.2.0-1424893).
rocm-opencl-dev is already the newest version (1.2.0-1424893).
rocm-profiler is already the newest version (5.1.6386).
rocm-utils is already the newest version (1.0.0).

I saw a lots of hip-api error message from hipPointerGetAttributes function,
maybe p2p function can't get device pointer from PCIE?

For the full hipCaffe result log:
Single GPU:
http://122.147.187.124/resultMI25/Result-MI25x1_PCIe16_17_07_24_15_04_1000_ITER_BS64_ALEXNET.txt
Multi-GPU:
http://122.147.187.124/resultMI25/Result-MI25x2_PCIe16_17_07_24_15_11_1000_ITER_BS64_ALEXNET.txt

MI25 lspci log
lspci_MI25.txt

Thanks for your help.

@parallelo
Copy link
Contributor

Hi again @ginsongsong,

I just tried ROCm 1.6.1 with the internal MIOpen repo built from source. Multi-GPU AlexNet and GoogleNet ran without error.

There's expected to be an update soon to the public MIOpen repo, and you'll need those changes.

To build MIOpen from source, please follow these instructions:

# Install rocm-cmake (needed by miopen)
cd ~
git clone https://github.com/RadeonOpenCompute/rocm-cmake.git
cd rocm-cmake
mkdir build && cd build && cmake .. && make -j$(nproc) && make -j$(nproc) package
sudo dpkg -i ./rocm-cmake*.deb

# Install MIOpen 
cd ~
git clone https://github.com/ROCmSoftwarePlatform/MIOpen.git
cd MIOpen
mkdir -p build && cd build && \
    CXX=$HCC_HOME/bin/hcc cmake -DHIP_OC_COMPILER=/opt/rocm/bin/clang-ocl -DCMAKE_PREFIX_PATH="$HCC_HOME;$HIP_PATH" -DOPENCL_INCLUDE_DIRS="$OPENCL_ROOT/include" ..
make -j$(nproc) && make package -j$(nproc)
sudo dpkg -i ./MIOpen*.deb

Then, set this environment variable (as a temp workaround):

export HCC_UNPINNED_COPY_MODE=2

For AlexNet, try something like this:

cd $CAFFE_ROOT

# Params to be set by the user
gpuids="0,1"
batchsize_per_gpu=128
iterations=500
model_path=./models/bvlc_alexnet

# Update the train_val prototxt's batch size
train_val_prototxt=${model_path}/train_val_batch${batchsize_per_gpu}.prototxt
cp ${model_path}/train_val.prototxt ${model_path}/train_val_batch${batchsize_per_gpu}.prototxt
sed -i "s|batch_size: 256|batch_size: ${batchsize_per_gpu}|g" ./${train_val_prototxt}

# Update the solver prototxt's max_iter and train_val prototxt path
solver_prototxt=${model_path}/solver_short.prototxt
cp ${model_path}/solver.prototxt ${solver_prototxt}
sed -i "s|max_iter: 10000000|max_iter: ${iterations}|g" ${solver_prototxt}
sed -i "s|${model_path}/train_val.prototxt|${train_val_prototxt}|g" ${solver_prototxt}

# Run on ImageNet data
ngpus=$(( 1 + $(grep -o "," <<< "$g" | wc -l) ))
train_log=./hipCaffe_nGPUs${ngpus}_batchsizePerGpu${batchsize_per_gpu}.log
train_log_sec=./hipCaffe_nGPUs${ngpus}_batchsizePerGpu${batchsize_per_gpu}_sec.log
./build/tools/caffe train --solver=${solver_prototxt} --gpu ${gpuids} 2>&1 | tee ${train_log}

For GoogleNet, try something like this:

cd $CAFFE_ROOT

# Params to be set by the user
gpuids="0,1"
batchsize_per_gpu=16
iterations=500
model_path=./models/bvlc_googlenet

# Update the train_val prototxt's batch size
train_val_prototxt=${model_path}/train_val_batch${batchsize_per_gpu}.prototxt
cp ${model_path}/train_val.prototxt ${model_path}/train_val_batch${batchsize_per_gpu}.prototxt
sed -i "s|batch_size: 32|batch_size: ${batchsize_per_gpu}|g" ./${train_val_prototxt}

# Update the solver prototxt's max_iter and train_val prototxt path
solver_prototxt=${model_path}/solver_short.prototxt
cp ${model_path}/solver.prototxt ${solver_prototxt}
sed -i "s|max_iter: 10000000|max_iter: ${iterations}|g" ${solver_prototxt}
sed -i "s|${model_path}/train_val.prototxt|${train_val_prototxt}|g" ${solver_prototxt}

# Run on ImageNet data
ngpus=$(( 1 + $(grep -o "," <<< "$g" | wc -l) ))
train_log=./hipCaffe_nGPUs${ngpus}_batchsizePerGpu${batchsize_per_gpu}.log
train_log_sec=./hipCaffe_nGPUs${ngpus}_batchsizePerGpu${batchsize_per_gpu}_sec.log
./build/tools/caffe train --solver=${solver_prototxt} --gpu ${gpuids} 2>&1 | tee ${train_log}

Hopefully this will help. Either way, let us know how it goes, and we'll get it figured out.

Best,

Jeff

@ginsongsong
Copy link
Author

Thanks @parallelo , problem are solved.
I removed the hipcaffe and cloned it again.
Following you step to rebuild my hipcaffe, and my multiGPU alexnet and Googlenet can work on my MI25.
Bellow attachment was the result captured from hipcaffe.

bvlc_Alexnet iteration=1000
Result-MI25x2_PCIe16_17_07_31_10_50_10000ITER_BS64_ALEXNET.txt

bvlc_Googlenet iteration=1000
Result-MI25x2_PCIe16_17_07_31_11_13_10000ITER_BS32_GOOGLENET.txt

Thank you for your kind assistance.

@jamilbk
Copy link

jamilbk commented Sep 26, 2017

FWIW I'm having this issue on a fresh install of Ubuntu 16.04.3 and the ROCm 1.6.3 stack which was performed yesterday. Following the HipCaffe Quickstart, single-GPU training worked flawlessly with all the examples given. But running with the --gpu "0,1" flag caused the same issue @ginsongsong had above. Running with any combination of gpus besides 0 caused the issue for me.

Following @parallelo's advice, I hit an issue while building MIOpen, seems I'm missing OpenSSL::Crypto. Still trying to figure out which package provides that on Ubuntu 16.04. Here's the crash log:

jamil@fridge:~/code/MIOpen/build (master%=) % make -j$(nproc) && make package -j$(nproc)
[  6%] Built target addkernels
[  8%] Linking CXX shared library ../lib/libMIOpen.so
ld: cannot find -lOpenSSL::Crypto
/opt/rocm/hcc-1.0/bin/hcc(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamE+0x2a)[0x1674f1a]
/opt/rocm/hcc-1.0/bin/hcc(_ZN4llvm3sys17RunSignalHandlersEv+0x3e)[0x1672fbe]
/opt/rocm/hcc-1.0/bin/hcc[0x167310c]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fd93835b390]
[0x7fd93878ba10]
Stack dump:
0.	Program arguments: /opt/rocm/hcc-1.0/bin/hcc -fPIC -O3 -DNDEBUG -shared -Wl,-soname,libMIOpen.so.1 -o ../lib/libMIOpen.so.1 CMakeFiles/MIOpen.dir/convolution.cpp.o CMakeFiles/MIOpen.dir/convolution_api.cpp.o CMakeFiles/MIOpen.dir/convolution_fft.cpp.o CMakeFiles/MIOpen.dir/errors.cpp.o CMakeFiles/MIOpen.dir/load_file.cpp.o CMakeFiles/MIOpen.dir/pooling_api.cpp.o CMakeFiles/MIOpen.dir/kernel_warnings.cpp.o CMakeFiles/MIOpen.dir/logger.cpp.o CMakeFiles/MIOpen.dir/lrn_api.cpp.o CMakeFiles/MIOpen.dir/activ_api.cpp.o CMakeFiles/MIOpen.dir/handle_api.cpp.o CMakeFiles/MIOpen.dir/softmax_api.cpp.o CMakeFiles/MIOpen.dir/batch_norm.cpp.o CMakeFiles/MIOpen.dir/batch_norm_api.cpp.o CMakeFiles/MIOpen.dir/tensor.cpp.o CMakeFiles/MIOpen.dir/tensor_api.cpp.o CMakeFiles/MIOpen.dir/tmp_dir.cpp.o CMakeFiles/MIOpen.dir/binary_cache.cpp.o CMakeFiles/MIOpen.dir/md5.cpp.o CMakeFiles/MIOpen.dir/activ.cpp.o CMakeFiles/MIOpen.dir/kernel_cache.cpp.o CMakeFiles/MIOpen.dir/lrn.cpp.o CMakeFiles/MIOpen.dir/mlo_dir_conv.cpp.o CMakeFiles/MIOpen.dir/ocl/activ_ocl.cpp.o CMakeFiles/MIOpen.dir/ocl/batchnormocl.cpp.o CMakeFiles/MIOpen.dir/ocl/convolutionocl.cpp.o CMakeFiles/MIOpen.dir/ocl/convolutionocl_fft.cpp.o CMakeFiles/MIOpen.dir/ocl/lrn_ocl.cpp.o CMakeFiles/MIOpen.dir/ocl/mloNeuron.cpp.o CMakeFiles/MIOpen.dir/ocl/mloNorm.cpp.o CMakeFiles/MIOpen.dir/ocl/mloPooling.cpp.o CMakeFiles/MIOpen.dir/ocl/pooling_ocl.cpp.o CMakeFiles/MIOpen.dir/ocl/tensorocl.cpp.o CMakeFiles/MIOpen.dir/ocl/softmaxocl.cpp.o CMakeFiles/MIOpen.dir/ocl/utilocl.cpp.o CMakeFiles/MIOpen.dir/ocl/gcn_asm_utils.cpp.o CMakeFiles/MIOpen.dir/pooling.cpp.o CMakeFiles/MIOpen.dir/__/db.cpp.o CMakeFiles/MIOpen.dir/__/kernel.cpp.o CMakeFiles/MIOpen.dir/gemm.cpp.o CMakeFiles/MIOpen.dir/gemm_geometry.cpp.o CMakeFiles/MIOpen.dir/hip/hiperrors.cpp.o CMakeFiles/MIOpen.dir/hip/handlehip.cpp.o CMakeFiles/MIOpen.dir/hipoc/hipoc_kernel.cpp.o CMakeFiles/MIOpen.dir/hipoc/hipoc_program.cpp.o -lstdc++ -amdgpu-target=gfx803 -amdgpu-target=gfx900 -Wno-unused-command-line-argument /opt/rocm/hip/lib/libhip_hcc.so /opt/rocm/hcc-1.0/lib/libhc_am.so /opt/rocm/miopengemm/lib/libmiopengemm.so -lOpenSSL::Crypto -lboost_filesystem -lboost_system -hc -L /opt/rocm/hcc-1.0/lib -Wl,-rpath /opt/rocm/hcc-1.0/lib -Wl,--whole-archive /opt/rocm/hcc-1.0/lib/libmcwamp.a -lunwind -Wl,--no-whole-archive -ldl -lm /opt/rocm/lib/libhsa-runtime64.so -lpthread /opt/rocm/opencl/lib/x86_64/libOpenCL.so -Wl,-rpath,/opt/rocm/hip/lib:/opt/rocm/hcc-1.0/lib:/opt/rocm/miopengemm/lib:/opt/rocm/lib:/opt/rocm/opencl/lib/x86_64: 
Error running link command: Segmentation fault
src/CMakeFiles/MIOpen.dir/build.make:1295: recipe for target 'lib/libMIOpen.so.1' failed
make[2]: *** [lib/libMIOpen.so.1] Error 1
CMakeFiles/Makefile2:424: recipe for target 'src/CMakeFiles/MIOpen.dir/all' failed
make[1]: *** [src/CMakeFiles/MIOpen.dir/all] Error 2
Makefile:149: recipe for target 'all' failed
make: *** [all] Error 2

System details:

  • Threadripper 1950x
  • Gigabyte Gaming 7 x399
  • 4x Vega FE

Figured I would post this here as another data point. I'm hoping these kinks get worked out soon... itching to use fp16 packed math for training my models!

@dagamayank
Copy link
Contributor

I hit an issue while building MIOpen, seems I'm missing OpenSSL::Crypto. Still trying to figure out which package provides that on Ubuntu 16.04.

@jamilbk Can you try installing libssl-dev and check again? sudo apt-get install libssl-dev.

@jamilbk
Copy link

jamilbk commented Sep 26, 2017

Yeah I had installed libssl-dev because of a previous OpenSSL error. Then I ran into this issue which I fixed by symlinking opensslconf.h into /usr/lib/openssl and then I hit this error.

Perhaps I need to symlink the OpenSSL::Crypto library file as well? This all seems to be caused by libssl-dev being installed into the 64-bit specific dirs and MIOpen not looking there?

Apologies for my lacking Linux linker skills!

@dagamayank
Copy link
Contributor

Please paste the output of dpkg -l | grep libssl

@jamilbk
Copy link

jamilbk commented Sep 26, 2017

jamil@fridge:~/tmp/pmbw-0.6.2 % dpkg -l | grep libssl
ii  libssl-dev:amd64                                   1.0.2g-1ubuntu4.8                                amd64        Secure Sockets Layer toolkit - development files
ii  libssl-doc                                         1.0.2g-1ubuntu4.8                                all          Secure Sockets Layer toolkit - development documentation
ii  libssl1.0.0:amd64                                  1.0.2g-1ubuntu4.8                                amd64        Secure Sockets Layer toolkit - shared libraries
ii  libsslcommon2:amd64                                0.16-9ubuntu2                                    amd64        enterprise messaging system - common SSL libraries
ii  libsslcommon2-dev:amd64                            0.16-9ubuntu2                                    amd64        enterprise messaging system - common SSL development files

I had installed libsslcommon2 in the offchance it was related somehow.

@dagamayank
Copy link
Contributor

@pfultz2 do you have any insights?

@pfultz2
Copy link

pfultz2 commented Sep 29, 2017

@jamilbk Did you try removing your build directory and starting again?

@pfultz2
Copy link

pfultz2 commented Sep 29, 2017

Perhaps I need to symlink the OpenSSL::Crypto library file as well?

No, this is an imported target in cmake that is defined by find_package(OpenSSL). The fact that it shows up as -lOpenSSL::Crypto means it was not defined because it was not found the first time running cmake. Clearing out the build directory, and re-running cmake should fix it.

@jamilbk
Copy link

jamilbk commented Oct 1, 2017

@pfultz2 Apologies for the delay. I've removed the build dir and retried parallelo's commands for building MIOpen with the same result -- 'openssl/opensslconf.h' file not found -- though it exists at /usr/include/x86_64-linux-gnu/openssl/opensslconf.h. I have rocm, libssl-dev, and all the packages installed listed on the hipCaffe quickstart. Here's the full log:

jamil@fridge:~/dl/MIOpen (master=) % gst
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean
jamil@fridge:~/dl/MIOpen (master=) % mkdir -p build && cd buildjamil@fridge:~/dl/MIOpen/build (master=) % CXX=`which hcc` cmake -DHIP_OC_COMPILER=/opt/rocm/bin/clang-ocl -DCMAKE_PREFIX_PATH="/opt/rocm/hcc;/opt/rocm/hip" -DOPENCL_INCLUDE_DIRS="/opt/rocm/opencl/include" ..
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is Clang 5.0.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /opt/rocm/bin/hcc
-- Check for working CXX compiler: /opt/rocm/bin/hcc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- hip compiler: /opt/rocm/bin/clang-ocl
-- HIP backend selected.
-- AMDGCN assembler: /opt/rocm/opencl/bin/x86_64/clang
-- Build with miopengemm
-- Found OpenSSL: /usr/lib/x86_64-linux-gnu/libssl.so;/usr/lib/x86_64-linux-gnu/libcrypto.so (found version "1.0.2g")
-- Boost version: 1.58.0
-- Found the following Boost libraries:
--   filesystem
--   system
-- Clang tidy not found
-- Clang tidy checks: *,-cert-err60-cpp,-cert-msc30-c,-cert-msc50-cpp,-clang-analyzer-alpha.core.CastToStruct,-clang-analyzer-optin.performance.Padding,-clang-diagnostic-deprecated-declarations,-clang-diagnostic-extern-c-compat,-cppcoreguidelines-pro-bounds-array-to-pointer-decay,-cppcoreguidelines-pro-bounds-constant-array-index,-cppcoreguidelines-pro-bounds-pointer-arithmetic,-cppcoreguidelines-pro-type-member-init,-cppcoreguidelines-pro-type-reinterpret-cast,-cppcoreguidelines-pro-type-union-access,-cppcoreguidelines-pro-type-vararg,-cppcoreguidelines-special-member-functions,-google-explicit-constructor,-google-readability-braces-around-statements,-google-readability-todo,-google-runtime-int,-google-runtime-references,-hicpp-explicit-conversions,-hicpp-special-member-functions,-hicpp-use-equals-default,-hicpp-use-override,-llvm-header-guard,-llvm-include-order,-misc-macro-parentheses,-misc-misplaced-const,-misc-misplaced-widening-cast,-modernize-loop-convert,-modernize-pass-by-value,-modernize-use-default-member-init,-modernize-use-emplace,-modernize-use-equals-default,-modernize-use-transparent-functors,-performance-unnecessary-value-param,-readability-braces-around-statements,-readability-else-after-return,-readability-implicit-bool-cast,-readability-misleading-indentation,-readability-named-parameter,-modernize-use-override,-readability-non-const-parameter
-- Could NOT find LATEX (missing:  LATEX_COMPILER)
Latex builder not found. Latex builder is required only for building the PDF documentation for MIOpen and is not necessary for building the library, or any other components. To build PDF documentation run make in /home/jamil/dl/MIOpen/doc/pdf, once a latex builder is installed.
-- MIOpen_VERSION= 1.1.1
-- CMAKE_BUILD_TYPE= Release
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Configuring done
WARNING: Target "MIOpenDriver" has EXCLUDE_FROM_ALL set and will not be built by default but an install rule has been provided for it.  CMake does not define behavior for this case.
-- Generating done
CMake Warning:
  Manually-specified variables were not used by the project:

    OPENCL_INCLUDE_DIRS


-- Build files have been written to: /home/jamil/dl/MIOpen/build
jamil@fridge:~/dl/MIOpen/build (master%=) % make -j$(nproc) && make package -j$(nproc)
Scanning dependencies of target addkernels
[  2%] Building CXX object addkernels/CMakeFiles/addkernels.dir/include_inliner.cpp.o
[  4%] Building CXX object addkernels/CMakeFiles/addkernels.dir/addkernels.cpp.o
[  6%] Linking CXX executable ../bin/addkernels
[  6%] Built target addkernels
[  8%] Inlining MIOpen kernels
Scanning dependencies of target MIOpen
[ 10%] Building CXX object src/CMakeFiles/MIOpen.dir/convolution_fft.cpp.o
[ 12%] Building CXX object src/CMakeFiles/MIOpen.dir/errors.cpp.o
[ 14%] Building CXX object src/CMakeFiles/MIOpen.dir/convolution_api.cpp.o
[ 18%] Building CXX object src/CMakeFiles/MIOpen.dir/load_file.cpp.o
[ 18%] Building CXX object src/CMakeFiles/MIOpen.dir/logger.cpp.o
[ 20%] Building CXX object src/CMakeFiles/MIOpen.dir/activ_api.cpp.o
[ 22%] Building CXX object src/CMakeFiles/MIOpen.dir/lrn_api.cpp.o
[ 26%] Building CXX object src/CMakeFiles/MIOpen.dir/convolution.cpp.o
[ 26%] Building CXX object src/CMakeFiles/MIOpen.dir/pooling_api.cpp.o
[ 28%] Building CXX object src/CMakeFiles/MIOpen.dir/batch_norm.cpp.o
[ 30%] Building CXX object src/CMakeFiles/MIOpen.dir/softmax_api.cpp.o
[ 34%] Building CXX object src/CMakeFiles/MIOpen.dir/batch_norm_api.cpp.o
[ 34%] Building CXX object src/CMakeFiles/MIOpen.dir/handle_api.cpp.o
[ 38%] Building CXX object src/CMakeFiles/MIOpen.dir/kernel_warnings.cpp.o
[ 38%] Building CXX object src/CMakeFiles/MIOpen.dir/tmp_dir.cpp.o
[ 40%] Building CXX object src/CMakeFiles/MIOpen.dir/tensor.cpp.o
[ 42%] Building CXX object src/CMakeFiles/MIOpen.dir/mlo_dir_conv.cpp.o
[ 44%] Building CXX object src/CMakeFiles/MIOpen.dir/tensor_api.cpp.o
[ 46%] Building CXX object src/CMakeFiles/MIOpen.dir/binary_cache.cpp.o
[ 48%] Building CXX object src/CMakeFiles/MIOpen.dir/kernel_cache.cpp.o
[ 52%] Building CXX object src/CMakeFiles/MIOpen.dir/md5.cpp.o
[ 52%] Building CXX object src/CMakeFiles/MIOpen.dir/activ.cpp.o
[ 54%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/activ_ocl.cpp.o
[ 56%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/batchnormocl.cpp.o
[ 58%] Building CXX object src/CMakeFiles/MIOpen.dir/lrn.cpp.o
[ 60%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/mloNeuron.cpp.o
[ 62%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/mloNorm.cpp.o
[ 64%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/pooling_ocl.cpp.o
[ 66%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/mloPooling.cpp.o
[ 68%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/lrn_ocl.cpp.o
[ 70%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/convolutionocl.cpp.o
[ 72%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/convolutionocl_fft.cpp.o
In file included from /home/jamil/dl/MIOpen/src/md5.cpp:2:
In file included from /usr/include/openssl/md5.h:62:
/usr/include/openssl/e_os2.h:56:10: fatal error: 'openssl/opensslconf.h' file not found
#include <openssl/opensslconf.h>
         ^~~~~~~~~~~~~~~~~~~~~~~
[ 74%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/tensorocl.cpp.o
[ 76%] Building CXX object src/CMakeFiles/MIOpen.dir/ocl/softmaxocl.cpp.o
1 error generated.
src/CMakeFiles/MIOpen.dir/build.make:543: recipe for target 'src/CMakeFiles/MIOpen.dir/md5.cpp.o' failed
make[2]: *** [src/CMakeFiles/MIOpen.dir/md5.cpp.o] Error 1
make[2]: *** Deleting file 'src/CMakeFiles/MIOpen.dir/md5.cpp.o'
make[2]: *** Waiting for unfinished jobs....

@dagamayank
Copy link
Contributor

@jamilbk Seems like you are installing libssl using apt-get. Please follow the instructions on README on how fix the above error. Review this particular section:
"An example cmake step can be:

OpenSSL installed using apt-get on Ubuntu v16? Yes.

CXX=/opt/rocm/hcc/bin/hcc cmake -DMIOPEN_BACKEND=HIP -DCMAKE_PREFIX_PATH="/opt/rocm/hcc;/opt/rocm/hip" -DCMAKE_CXX_FLAGS="-isystem /usr/include/x86_64-linux-gnu/" ..
"
You need to add -DCMAKE_CXX_FLAGS="-isystem /usr/include/x86_64-linux-gnu/ to your cmake step.

@jamilbk
Copy link

jamilbk commented Oct 3, 2017

Thanks @dagamayank that fixed the compile. Now I'm back to the original problem noted in this thread Page not present or supervisor privilege. -- but I suspect it's because I need to set the batch size as @parallelo pointed out and download some ImageNet data to test with. I'll keep you updated if that doesn't fix the issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants