You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So I am trying to install apex since two weeks in order to train a model and In my last effort; I run onto this error when installing :
Install apex with :
Pytorch version : 1.4.0+cu100
Cuda is 10.0 on the virtual env
command line used :
python setup.py install --cuda_ext --cpp_ext
Stacktrace:
copying apex/pyprof/prof/reduction.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
copying apex/pyprof/prof/softmax.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
copying apex/pyprof/prof/usage.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
copying apex/pyprof/prof/utility.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
running build_ext
building 'apex_C' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/csrc
gcc -pthread -B /home/getalp/kelodjoe/anaconda3/envs/env/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/TH -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/THC -I/home/getalp/kelodjoe/anaconda3/envs/env/include/python3.6m -c csrc/flatten_unflatten.cpp -o build/temp.linux-x86_64-3.6/csrc/flatten_unflatten.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=apex_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
g++ -pthread -shared -B /home/getalp/kelodjoe/anaconda3/envs/env/compiler_compat -L/home/getalp/kelodjoe/anaconda3/envs/env/lib -Wl,-rpath=/home/getalp/kelodjoe/anaconda3/envs/env/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.6/csrc/flatten_unflatten.o -o build/lib.linux-x86_64-3.6/apex_C.cpython-36m-x86_64-linux-gnu.so
building 'amp_C' extension
gcc -pthread -B /home/getalp/kelodjoe/anaconda3/envs/env/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/TH -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/getalp/kelodjoe/anaconda3/envs/env/include/python3.6m -c csrc/amp_C_frontend.cpp -o build/temp.linux-x86_64-3.6/csrc/amp_C_frontend.o -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/usr/local/cuda-10.0/bin/nvcc -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/TH -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/getalp/kelodjoe/anaconda3/envs/env/include/python3.6m -c csrc/multi_tensor_sgd_kernel.cu -o build/temp.linux-x86_64-3.6/csrc/multi_tensor_sgd_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_75,code=sm_75 -std=c++11
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
ld/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing apex-0.1-py3.6-linux-x86_64.egg
creating /data1/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg
Extracting apex-0.1-py3.6-linux-x86_64.egg to /data1/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages
Adding apex 0.1 to easy-install.pth file
Installed /data1/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg
Processing dependencies for apex==0.1
Finished processing dependencies for apex==0.1
(env) kelodjoe@decore2:~/apex$
At the final as you see above, It says apex installed but the error referring to -Wstrict-prototypes bother me especially after when using my set of command to train my model, I get an error like this :
RuntimeError: CUDA error: invalid device function (multi_tensor_apply at csrc/multi_tensor_apply.cuh:108)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f7ea6c64193 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: void multi_tensor_apply<2, ScaleFunctor<float, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, ScaleFunctor<float, float>, float) + 0x183f (0x7f7ea0dd379f in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, sTraceback (most recent call last):
File "train.py", line 391, in
main(params)
File "train.py", line 309, in main
trainer.mlm_step(lang1, lang2, params.lambda_mlm)
File "/data1/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 781, in mlm_step
self.optimize(loss)
File "/data1/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 250, in optimize
scaled_loss.backward()
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 120, in post_backward_models_are_masters
scale_override=grads_have_scale/out_scale)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 117, in unscale
1./scale)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in call
*args)
RuntimeError: CUDA error: invalid device function (multi_tensor_apply at csrc/multi_tensor_apply.cuh:108)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f7ea6c64193 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: void multi_tensor_apply<2, ScaleFunctor<float, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, ScaleFunctor<float, float>, float) + 0x183f (0x7f7ea0dd379f in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, float) + 0x1679 (0x7f7ea0dcff39 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: + 0x200cc (0x7f7ea0dc30cc in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: + 0x1a634 (0x7f7ea0dbd634 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
GPU used says that its cuda is 10.1 do you think the error comes from that but how can I change the cuda on my virtual env I tried to export cuda 10.1 to install apex but it always say cuda 10.0 on the virtual env.
I used this command before installing apex : export CUDA_VISIBLE_DEVICES=0
export CUDA_ROOT=/usr/local/cuda-10.1
export PATH=${CUDA_ROOT}/bin:${PATH}
export LD_LIBRARY_PATH=${CUDA_ROOT}/lib64:${LD_LIBRARY_PATH}
export CUDA_HOME=${CUDAROOT}
and it returns 10.0 OR the gpu I used after installing apex to train my model run with cuda 10.1
The text was updated successfully, but these errors were encountered:
So I am trying to install apex since two weeks in order to train a model and In my last effort; I run onto this error when installing :
Install apex with :
Pytorch version : 1.4.0+cu100
Cuda is 10.0 on the virtual env
command line used :
python setup.py install --cuda_ext --cpp_ext
Stacktrace:
copying apex/pyprof/prof/reduction.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
copying apex/pyprof/prof/softmax.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
copying apex/pyprof/prof/usage.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
copying apex/pyprof/prof/utility.py -> build/lib.linux-x86_64-3.6/apex/pyprof/prof
running build_ext
building 'apex_C' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/csrc
gcc -pthread -B /home/getalp/kelodjoe/anaconda3/envs/env/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/TH -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/THC -I/home/getalp/kelodjoe/anaconda3/envs/env/include/python3.6m -c csrc/flatten_unflatten.cpp -o build/temp.linux-x86_64-3.6/csrc/flatten_unflatten.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=apex_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
g++ -pthread -shared -B /home/getalp/kelodjoe/anaconda3/envs/env/compiler_compat -L/home/getalp/kelodjoe/anaconda3/envs/env/lib -Wl,-rpath=/home/getalp/kelodjoe/anaconda3/envs/env/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.6/csrc/flatten_unflatten.o -o build/lib.linux-x86_64-3.6/apex_C.cpython-36m-x86_64-linux-gnu.so
building 'amp_C' extension
gcc -pthread -B /home/getalp/kelodjoe/anaconda3/envs/env/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/TH -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/getalp/kelodjoe/anaconda3/envs/env/include/python3.6m -c csrc/amp_C_frontend.cpp -o build/temp.linux-x86_64-3.6/csrc/amp_C_frontend.o -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
/usr/local/cuda-10.0/bin/nvcc -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/TH -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/getalp/kelodjoe/anaconda3/envs/env/include/python3.6m -c csrc/multi_tensor_sgd_kernel.cu -o build/temp.linux-x86_64-3.6/csrc/multi_tensor_sgd_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_75,code=sm_75 -std=c++11
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
/usr/local/cuda-10.0/bin/nvcc -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/TH -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/getalp/kelodjoe/anaconda3/envs/env/include/python3.6m -c csrc/multi_tensor_scale_kernel.cu -o build/temp.linux-x86_64-3.6/csrc/multi_tensor_scale_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_75,code=sm_75 -std=c+python setup.py install --cuda_ext --cpp_ext+11
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
/usr/local/cuda-10.0/bin/nvcc -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/TH -I/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/THC -I/usr/local/cuda-10.0/include -I/home/getalp/kelodjoe/anaconda3/envs/env/include/python3.6m -c csrc/multi_tensor_axpby_kernel.cu -o build/temp.linux-x86_64-3.6/csrc/multi_tensor_axpby_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=amp_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_75,code=sm_75 -std=c++11
/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
....
ld/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing apex-0.1-py3.6-linux-x86_64.egg
creating /data1/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg
Extracting apex-0.1-py3.6-linux-x86_64.egg to /data1/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages
Adding apex 0.1 to easy-install.pth file
Installed /data1/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg
Processing dependencies for apex==0.1
Finished processing dependencies for apex==0.1
(env) kelodjoe@decore2:~/apex$
At the final as you see above, It says apex installed but the error referring to -Wstrict-prototypes bother me especially after when using my set of command to train my model, I get an error like this :
RuntimeError: CUDA error: invalid device function (multi_tensor_apply at csrc/multi_tensor_apply.cuh:108)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f7ea6c64193 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: void multi_tensor_apply<2, ScaleFunctor<float, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, ScaleFunctor<float, float>, float) + 0x183f (0x7f7ea0dd379f in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, sTraceback (most recent call last):
File "train.py", line 391, in
main(params)
File "train.py", line 309, in main
trainer.mlm_step(lang1, lang2, params.lambda_mlm)
File "/data1/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 781, in mlm_step
self.optimize(loss)
File "/data1/home/getalp/kelodjoe/eXP/Flaubert/xlm/trainer.py", line 250, in optimize
scaled_loss.backward()
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 241, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/_process_optimizer.py", line 120, in post_backward_models_are_masters
scale_override=grads_have_scale/out_scale)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 117, in unscale
1./scale)
File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in call
*args)
RuntimeError: CUDA error: invalid device function (multi_tensor_apply at csrc/multi_tensor_apply.cuh:108)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f7ea6c64193 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: void multi_tensor_apply<2, ScaleFunctor<float, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, ScaleFunctor<float, float>, float) + 0x183f (0x7f7ea0dd379f in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #2: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, float) + 0x1679 (0x7f7ea0dcff39 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: + 0x200cc (0x7f7ea0dc30cc in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: + 0x1a634 (0x7f7ea0dbd634 in /home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so)
GPU used says that its cuda is 10.1 do you think the error comes from that but how can I change the cuda on my virtual env I tried to export cuda 10.1 to install apex but it always say cuda 10.0 on the virtual env.
I used this command before installing apex : export CUDA_VISIBLE_DEVICES=0
export CUDA_ROOT=/usr/local/cuda-10.1
export PATH=${CUDA_ROOT}/bin:${PATH}
export LD_LIBRARY_PATH=${CUDA_ROOT}/lib64:${LD_LIBRARY_PATH}
export CUDA_HOME=${CUDAROOT}
and it returns 10.0 OR the gpu I used after installing apex to train my model run with cuda 10.1
The text was updated successfully, but these errors were encountered: