-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
多线程batch包含特定数目样本时程序崩溃的问题 #732
Comments
It looks like a bug in Fluid, we will try to figure out soon, thanks! |
models/fluid/sequence_tagging_for_ner/network_conf.py Lines 112 to 123 in 1875ab1
This demo only failed in parallel training, to make it clear, we use device_count refer to the parallel device number. When the sample size < device_count, some of the devices do not contain any input data. In our design, these operators will not be running if he didn't catch any data. It seems the logic goes wrong when open the sparse update in lookup gradient operator.
You can detour this problem by setting the input batch size as a multiple of the device counter. We are trying to solve this bug. |
这个解决方式有一定的问题的是我们需要同事考虑batch_size和每一个pass末尾的batch样本数,使这两个样本数都处于比较合适的状态。在特定数目训练集的情况下,这也是一个相对较为麻烦的问题。辛苦问题修复后告知一下。 |
是否有进一步的进展?如果有进展,烦请告知。谢谢! |
我遇到的一个比较奇怪的bug。使用
branch:jzy2 models/fluid/sequence_tagging_for_ner/train.py进行模型的训练时,
使用conll03的训练集进行训练。
我在设置batch_size为200时并不会报错,全程正常训练。但是如果我设置batch_size=35时,就会出现下面的错误。
事实上,不仅是BATCH_SIZE=35时会报错,BATCH_SIZE=50也会报错,这是因为训练样本数%50=35,所以最后一个batch会包含35个样本。另外BATCH_SIZE=34也会报错。
因为样本在使用时会做shuffle,所以这不会是特定样本造成的。看上面提到的错误,是线程池出了问题,具体原因还请相关同学进行追查。追查时可以通过将原始的data/train文件复制多份合成一个大的训练集的方式,同样可以复现上述问题。
The text was updated successfully, but these errors were encountered: