Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda Error: an illegal memory access was encountered #182

Closed
jamestang0219 opened this issue Oct 10, 2016 · 13 comments
Closed

Cuda Error: an illegal memory access was encountered #182

jamestang0219 opened this issue Oct 10, 2016 · 13 comments
Assignees
Labels

Comments

@jamestang0219
Copy link

jamestang0219 commented Oct 10, 2016

When I'm training the LSTM model, the error occur. Here is the partial training log:

`I1010 05:43:00.041766 119630 TrainerInternal.cpp:204] ___fc_layer_0__.w0   avg_abs_val=0.175171    max_val=1.02055     avg_abs_grad=0.0117175   max_grad=1.19871
I1010 05:43:00.042105 119630 TrainerInternal.cpp:204] ___fc_layer_0__.wbias avg_abs_val=0.127698    max_val=0.430612    avg_abs_grad=0.0627174   max_grad=1.82515
I1010 05:43:00.042553 119630 TrainerInternal.cpp:204] ___lstmemory_0__.w0  avg_abs_val=0.133314    max_val=0.794175    avg_abs_grad=0.0206954   max_grad=2.81496
I1010 05:43:00.042837 119630 TrainerInternal.cpp:204] ___lstmemory_0__.wbias avg_abs_val=0.115955    max_val=0.508302    avg_abs_grad=0.101172    max_grad=14.5017
I1010 05:43:00.043148 119630 TrainerInternal.cpp:204] ___fc_layer_1__.w0   avg_abs_val=0.274876    max_val=0.900992    avg_abs_grad=0.382739    max_grad=2.3795
I1010 05:43:00.043421 119630 TrainerInternal.cpp:204] ___fc_layer_1__.wbias avg_abs_val=0.217184    max_val=0.217184    avg_abs_grad=0.373559    max_grad=0.37356

I1010 05:43:00.043450 119630 TrainerInternal.cpp:162]  Batch=6400 samples=819200 AvgCost=0.294956 CurrentCost=0.31619 Eval: classification_error_evaluator=0.128513  CurrentEval: classification_error_evaluator=0.1375
...................
I1010 05:43:06.412021 119630 TrainerInternal.cpp:162]  Batch=6420 samples=821760 AvgCost=0.294991 CurrentCost=0.30623 Eval: classification_error_evaluator=0.128552  CurrentEval: classification_error_evaluator=0.141016
...................
I1010 05:43:13.165990 119630 TrainerInternal.cpp:162]  Batch=6440 samples=824320 AvgCost=0.29512 CurrentCost=0.336758 Eval: classification_error_evaluator=0.128653  CurrentEval: classification_error_evaluator=0.160938
...................
I1010 05:43:19.573108 119630 TrainerInternal.cpp:162]  Batch=6460 samples=826880 AvgCost=0.295171 CurrentCost=0.311461 Eval: classification_error_evaluator=0.128692  CurrentEval: classification_error_evaluator=0.141406
...................F1010 05:43:26.722699 119642 hl_cuda_device.cc:646] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered
*** Check failure stack trace: ***
    @     0x7f1c2eccadaa  (unknown)
    @     0x7f1c2eccace4  (unknown)
    @     0x7f1c2ecca6e6  (unknown)
    @     0x7f1c2eccd687  (unknown)
    @           0x8ae00b  hl_stream_synchronize()
    @           0x8ca3b0  hl_max_sequence_backward()
    @           0x6c860e  paddle::GpuMatrix::maxSequenceBackward()
    @           0x5f71db  paddle::MaxLayer::backward()
    @           0x67228e  paddle::NeuralNetwork::backward()
    @           0x65324c  paddle::TrainerThread::backward()
    @           0x65337d  paddle::TrainerThread::computeThread()
    @     0x7f1c2e847a60  (unknown)
    @     0x7f1c2f883184  start_thread
    @     0x7f1c2dfaf37d  (unknown)
    @              (nil)  (unknown)
/usr/local/bin/paddle: line 46: 119630 Aborted                 (core dumped) ${DEBUGGER} $MYDIR/../opt/paddle/bin/paddle_trainer ${@:2}`

and I checked nvidia source manager before the error occured:

Mon Oct 10 05:12:26 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.93     Driver Version: 352.93         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   52C    P0    66W / 125W |   1497MiB /  4095MiB |     41%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GRID K520           Off  | 0000:00:04.0     Off |                  N/A |
| N/A   56C    P0    55W / 125W |   1257MiB /  4095MiB |     47%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GRID K520           Off  | 0000:00:05.0     Off |                  N/A |
| N/A   54C    P0    61W / 125W |   1079MiB /  4095MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GRID K520           Off  | 0000:00:06.0     Off |                  N/A |
| N/A   58C    P0    60W / 125W |   1049MiB /  4095MiB |     70%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1484MiB |
|    1    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1243MiB |
|    2    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1066MiB |
|    3    119630    C   ...ocal/bin/../opt/paddle/bin/paddle_trainer  1036MiB |
+-----------------------------------------------------------------------------+`

The model can be initialized successfully, but when Paddle trained samples, the error will occur.

I've tried several times, each time get the same error.

@backyes
Copy link
Contributor

backyes commented Oct 10, 2016

Please show us more details, such as command line, paddle version, model config file.

@jamestang0219
Copy link
Author

@backyes
paddle version:
=> paddle version PaddlePaddle 0.8.0b, compiled with with_avx: ON with_gpu: ON with_double: OFF with_python: ON with_rdma: OFF with_glog: ON with_gflags: ON with_metric_learning: with_timer: OFF with_predict_sdk:

cmd line:
20 model=lstm.py 21 paddle train \ 22 --config=$model \ 23 --save_dir=./2classoutput \ 24 --trainer_count=4 \ 25 --log_period=100 \ 26 --num_passes=16 \ 27 --use_gpu=true \ 28 --show_parameter_stats_period=1000 \ 29 --test_all_data_in_one_period=1 \ 30 2>&1 | tee 'lstm_train_2class.log'

@hedaoyuan
Copy link
Contributor

hi @jamestang0219
Try the cpu model with --use_gpu=false. Look at whether there will be the same error.

@jamestang0219
Copy link
Author

@hedaoyuan without gpu, it works well but too slow. i wanna use 4 gpus to boost the train.

@hedaoyuan
Copy link
Contributor

There may be a bug with this hl_max_sequence_backward api, but not sure what kind of input data will cause the problem of illegal memory access. @jamestang0219 can you try --trainer_count=1 --use_gpu=true, whether it is the same problem.

@jamestang0219
Copy link
Author

@hedaoyuan same error occurred. but I change the batch size, the problem never occur. I don't think the batch size will cause cuda's error, maybe there are some bugs in hl_max_sequence_backward api.

@hedaoyuan
Copy link
Contributor

@jamestang0219
This commit adds some checking code to determine if there is problem with the input value.
Can you try with this commit?

@jamestang0219
Copy link
Author

@hedaoyuan
hello, I'm wondering if there are some problems with the input value, the error must appear every time that I use the same data to train, but it didn't appear for each time. Only for a large data set, for example more that 500,000 sentences data, this error may occur. If I split the large data set into 2 or more pieces, and only train one split, this error never occur.

@hedaoyuan
Copy link
Contributor

There may be problem with the input value, but not every time a memory error occurs.

@jamestang0219
Copy link
Author

@hedaoyuan
the input value will be mapped to the embedding vector, you mean in this procedure the error occurred?

@hedaoyuan
Copy link
Contributor

@jamestang0219 May be.
hl_max_sequence_backward interface is relatively simple, only when there is a problem in the input value, will encounter memory cross-border issues.
The root cause of this problem, may be input is a sequence with length equal zero.
I do not know how to check your data, however, if the program runs into these two errors(assert(tmpId >= 0); assert(tmpId < inputHeight);), then you can narrow the scope of the problem.

@jamestang0219
Copy link
Author

@hedaoyuan
Thank you, I will try to train my data set again using ur branch, maybe tomorrow. If I get the result, I will reply you in this issue topic.

@reyoung
Copy link
Collaborator

reyoung commented Nov 21, 2016

No response for too long, reopen if there is still some problems here.

@reyoung reyoung closed this as completed Nov 21, 2016
zhhsplendid pushed a commit to zhhsplendid/Paddle that referenced this issue Sep 25, 2019
gglin001 pushed a commit to graphcore/Paddle-fork that referenced this issue Dec 8, 2021
zhoutianzi666 pushed a commit to zhoutianzi666/Paddle that referenced this issue May 23, 2022
DesmonDay added a commit to DesmonDay/Paddle that referenced this issue Dec 8, 2022
* add get_epoch_finish interface

* add return

* delete return
zmxdream pushed a commit to zmxdream/Paddle that referenced this issue Dec 24, 2022
* add get_epoch_finish interface

* add return

* delete return
zmxdream added a commit that referenced this issue Jan 30, 2023
* add set slot_num for psgpuwraper (#177)

* add set slot_num_for_pull_feature for psgpuwarper

* Add get_epoch_finish python interface (#182)

* add get_epoch_finish interface

* add return

* delete return

* add unzip op (#183)

* fix miss key for error dataset (#186)

* fix miss key for error dataset

* fix miss key for error dataset

Co-authored-by: yangjunchao <[email protected]>

* add excluded_train_pair and infer_node_type (#187)

* support return of degree (#188)

* fix task stuck in barrier (#189)

Co-authored-by: yangjunchao <[email protected]>

* check node/feature format when loading (#190)

* check node&feature format when loading

* check node&feature format when loading (2£ (2)

* degrade log (#191)

* [PGLBOX]fix conflict

* [PGLBOX]fix conflict

* [PGLBOX]replace LodTensor with phi::DenseTensor

* [PGLBOX]fix gpu_primitives.h include path

* [PGLBOX]from platform::PADDLE_CUDA_NUM_THREADS to phi::PADDLE_CUDA_NUM_THREADS

* [PGLBOX]fix unzip example code

* [PGLBOX]fix unzip example code

* [PGLBOX]fix unzip example code

* [PGLBOX]fix unzip example code

* [PGLBOX]fix unzip ut

* [PGLBOX]fix unzip ut

* [PGLBOX]fix code style

* [PGLBOX]fix code style

* [PGLBOX]fix code style

* fix code style

* fix code style

* fix unzip ut

* fix unzip ut

* fix unzip ut

* fix unzip

* fix code stype

* add ut

* add c++ ut & fix train_mode_ set

* fix load into memory

* fix c++ ut

* fix c++ ut

* fix c++ ut

* fix c++ ut

* fix code style

* fix collective

* fix unzip_op.cc

* fix barrier

* fix code style

* fix barrier

* fix barrier

* fix code styple

* fix unzip

* add unzip.py

* add unzip.py

* fix unzip.py

---------

Co-authored-by: chao9527 <[email protected]>
Co-authored-by: Siming Dai <[email protected]>
Co-authored-by: huwei02 <[email protected]>
Co-authored-by: yangjunchao <[email protected]>
qizhaoaoe pushed a commit to qizhaoaoe/Paddle that referenced this issue Mar 3, 2023
lizexu123 pushed a commit to lizexu123/Paddle that referenced this issue Feb 23, 2024
WAYKEN-TSE pushed a commit to WAYKEN-TSE/Paddle that referenced this issue Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants