-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deal with Non-Deterministic Behavior (Ensure Determinism?) #3168
Comments
I have encountered similar issues while using a self-coded LSTM layer to train translation model. I could be sure that I hadn't introduce any random factors in my own coding part, either in the config .prototxt -- the random seeds were remained same each time; yet eachtime I run training procedure they were yielding different loss values except for the initial one. Moreover I have turned off the cuDNN flag while compling the enviroment, so I guess there might still be some random factors in the training related part of caffe. FYI I run the experiments on single GPU. |
I think the cuDNN non-deterministic behavior is cuased by resetting the diffs every time in CuDNNConvolutionLayer::Backward_gpu(). Actually, the diffs are already set in Net::ClearParamDiffs(). It seems that it's not a bug in the situation of multi-gpus but in "iter_size > 1" . template void CuDNNConvolutionLayer::Backward_gpu(const vector*>& top, const vector& propagate_down, const vector*>& bottom) { const Dtype* weight = NULL; Dtype* weight_diff = NULL; if (this->param_propagate_down_[0]) { weight = this->blobs_[0]->gpu_data(); weight_diff = this->blobs_[0]->mutable_gpu_diff(); caffe_gpu_set(this->blobs_[0]->count(), Dtype(0), weight_diff); } Dtype* bias_diff = NULL; if (this->bias_term_ && this->param_propagate_down_[1]) { bias_diff = this->blobs_[1]->mutable_gpu_diff(); caffe_gpu_set(this->blobs_[1]->count(), Dtype(0), bias_diff); } |
@FicusRong These two line seems to be introduced in #3160. I'll take a look. Thanks for reporting! |
I encounter the problem 3 in multiple GPU training. - I use two data input layers (One for image and the other for labels (multi-dimensional)) and the program crashed with the following error (Training is all fine with single GPU). Is there any workaround currently to solve this problem?: *** Aborted at 1453395416 (unix time) try "date -d @1453395416" if you are using GNU date *** |
@ronghanghu to fix 1. would it be okay if the default algorithms (bwd_filter_algo_ and bwd_data_algo_) were changed to 1 (determinisc according to cuDNN docs) when a I couldn't find any information on the impact it would cause in terms of performance. Should we expect any side-effect besides performance issues if we manually set those algorithms to 1 and rebuild Caffe as a temporary fix (instead of disabling cuDNN)? NVIDIA's fork already has this change. |
This is fixed with the switch to new parallelism in #4563. The non-determinism of cuDNN can be addressed by setting |
I have meet the same question. |
I am faced with the same issue. Tried rodrigoberriel's solution, but still got non-deterministic results when training. Is there a way to get consistent results for each run without disabling cuDNN since doing so will slow down the training? |
Although there have been a lot of efforts in Caffe (such as unified RNG) to ensure reproducible and deterministic results, Caffe is currently still non-deterministic in several ways as described below:
CUDNN_CONVOLUTION_BWD_DATA_ALGO_0
andCUDNN_CONVOLUTION_BWD_FILTER_ALGO_3
.1 & 2 (numerically non-determinism) can cause tests that relies on deterministic behavior (such as TestSnapshot in test_gradient_based_solver.cpp) to fail, while 3 can result in bugs like #2977.
This thread is opened to discuss how to cope with them (and possibly ensure determinism in Caffe?)
The text was updated successfully, but these errors were encountered: