Some findings about testing MKL_DNN on laptop #5462

kuke · 2017-11-08T03:24:25Z

We run the mkl_dnn benchmark test in Docker container on the laptop Dell XPS 15 , and find that:

The batch size of training samples is limited by the memory (8G) of the laptop, up to 48, which is smaller than the minimum batch size of the benchmark test on server.
When batch size is too small (<=8), the training cost will yield nan. Maybe need to modify the test script to avoid such nan cost.

The text was updated successfully, but these errors were encountered:

tensor-tang · 2017-11-08T05:58:38Z

Thanks kuke

I highly recommend expand the memory for benchmark, since 8G is even smaller than some GPU(12G memory).
And for some typologies which are very deep like resnet, we can only choose very small batchsize.
It can not show the best performance of MKL-DNN or MKLML.
When change batchsize to smaller, we should change the learning rate smaller too, since vgg do not have batch norm layer, it's very easy to nan

luotao1 · 2017-11-08T06:10:32Z

since 8G is even smaller than some GPU(12G memory)

我们选择内存不大的笔记本和台式机来做性能测试，主要原因是：笔记本和台式机属于民用市场：考虑到大多数学习场景，GPU都是过剩的，而且学习过程如果用GPU对学生来说也是不小的升级成本。如果MKLDNN能在当前资源下可以跑通大多数模型，对初级用户可能是个特别大的福音。

since vgg do not have batch norm layer, it's very easy to nan

如果在小的bs情况下，vgg容易出现NAN，是否可以考虑测试别的网络？

tensor-tang · 2017-11-08T06:44:27Z

减小learning rate之后就不会nan了。

tensor-tang mentioned this issue Nov 8, 2017

reduce the lr in case of nan in small batchsize #5477

Merged

tensor-tang closed this as completed Dec 7, 2017

Provide feedback