使用fleet.save_inference_model,ERROR:A protocol message was rejected because it was too big (more than 67108864 bytes). #3225

lijun900302 · 2019-08-29T09:30:28Z

from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
fleet.save_inference_model(executor=exe, dirname=model_dir, feeded_var_names=feed_var_names, target_vars=[auc_var, batch_auc_var])保存模型出错，

/usr/local/lib/python2.7/dist-packages/paddle/fluid/io.py:1084: UserWarning: save_inference_model specified the param program_only to True, It will not save params of Program.
2019-08-28 21:33:25 2019-08-28 13:33:23,876 [INFO] [10.38.26.135] --- "save_inference_model specified the param program_only to True, It will not save params of Program."
2019-08-28 21:33:25 2019-08-28 13:33:24,698 [INFO] [10.38.26.135] --- [libprotobuf ERROR /paddle/build/third_party/protobuf/src/extern_protobuf/src/google/protobuf/io/coded_stream.cc:208] A protocol message was rejected because it was too big (more than 67108864 bytes). To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
2019-08-28 21:33:25 2019-08-28 13:33:25,202 [INFO] [10.38.26.135] --- Traceback (most recent call last):
2019-08-28 21:33:25 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- File "/paddle/task-20190828201029-87895/dnn_dense_interaction.py", line 374, in
2019-08-28 21:33:26 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- train()
2019-08-28 21:33:26 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- File "/paddle/task-20190828201029-87895/dnn_dense_interaction.py", line 363, in train
2019-08-28 21:33:26 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- fleet.save_inference_model(executor=exe, dirname=model_dir, feeded_var_names=feed_var_names, target_vars=[auc_var, batch_auc_var])
2019-08-28 21:33:26 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/incubate/fleet/parameter_server/distribute_transpiler/init.py", line 157, in save_inference_model
2019-08-28 21:33:26 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- program = Program.parse_from_string(program_desc_str)
2019-08-28 21:33:26 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/framework.py", line 3315, in parse_from_string
2019-08-28 21:33:26 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- p.desc = core.ProgramDesc(binary_str)
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- paddle.fluid.core_avx.EnforceNotMet: Fail to parse program_desc from binary string. at [/paddle/paddle/fluid/framework/program_desc.cc:95]
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- PaddlePaddle Call Stacks:
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 0 0x7fa349ef999ap void paddle::platform::EnforceNotMet::Init<char const*>(char const*, char const*, int) + 506
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 1 0x7fa349efa6a5p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 165
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 2 0x7fa34a0c97bep paddle::framework::ProgramDesc::ProgramDesc(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 782
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 3 0x7fa349fbbb66p
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 4 0x7fa349f26a14p
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 5 0x4eef5ep
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 15 0x4b9b66p PyEval_EvalCodeEx + 774
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 16 0x4eb69fp
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 17 0x4e58f2p PyRun_FileExFlags + 130
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 13 0x4b9b66p PyEval_EvalCodeEx + 774
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 14 0x4c1f56p PyEval_EvalFrameEx + 24694
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 6 0x4eeb66p
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 7 0x4aaafbp
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 8 0x4c166dp PyEval_EvalFrameEx + 22413
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 9 0x4b9b66p PyEval_EvalCodeEx + 774
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 10 0x4c1f56p PyEval_EvalFrameEx + 24694
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 11 0x4b9b66p PyEval_EvalCodeEx + 774
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 12 0x4c17c6p PyEval_EvalFrameEx + 22758
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 18 0x4e41a6p PyRun_SimpleFileExFlags + 390
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 19 0x4938cep Py_Main + 1358
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 20 0x7fa3bcf7a830p __libc_start_main + 240
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 21 0x493299p _start + 41

The text was updated successfully, but these errors were encountered:

lijun900302 · 2019-08-30T02:35:13Z

加个这个也没用

MrChengmo · 2019-08-30T09:39:30Z

正在尝试本地复现，在持续跟进

MrChengmo · 2019-08-30T12:03:38Z

目前在复现中发现的一个突出问题：构建网络及program的耗时异常，关于wide&deep的实现可以参考官方示例：https://github.com/PaddlePaddle/models/tree/8bca0e4311b444a61024c2a5dd755a22b47487da/legacy/ctr

lijun900302 · 2019-08-30T12:08:18Z

目前在复现中发现的一个突出问题：构建网络及program的耗时异常，关于wide&deep的实现可以参考官方示例：

不是在实现wide&deep 啊；在qq里说了，是将将fm中的二阶交叉值取出来，和dnn输入一起接softmax；

MrChengmo · 2019-08-31T02:47:43Z

目前在复现中发现的一个突出问题：构建网络及program的耗时异常，关于wide&deep的实现可以参考官方示例：

不是在实现wide&deep 啊；在qq里说了，是将将fm中的二阶交叉值取出来，和dnn输入一起接softmax；

您好，我在本地尝试复现您的错误，使用您的组网和Code，但保存模型过程中没有出现相同的问题，最终成功保存了模型。请您提供一下您的运行环境和paddle版本，我继续跟进。
另外，使用fleet进行分布式模型的保存，推荐使用save_psersistables，在进行预测时，使用相同的组网，不添加optimizer，再load_psersistables。具体例子可以看这个fleet的example：https://github.com/PaddlePaddle/Fleet/tree/develop/examples/distribute_ctr

lijun900302 · 2019-09-02T04:04:30Z

目前在复现中发现的一个突出问题：构建网络及program的耗时异常，关于wide&deep的实现可以参考官方示例：

不是在实现wide&deep 啊；在qq里说了，是将将fm中的二阶交叉值取出来，和dnn输入一起接softmax；

您好，我在本地尝试复现您的错误，使用您的组网和Code，但保存模型过程中没有出现相同的问题，最终成功保存了模型。请您提供一下您的运行环境和paddle版本，我继续跟进。
另外，使用fleet进行分布式模型的保存，推荐使用save_psersistables，在进行预测时，使用相同的组网，不添加optimizer，再load_psersistables。具体例子可以看这个fleet的example：https://github.com/PaddlePaddle/Fleet/tree/develop/examples/distribute_ctr

试了save_psersistables 没有用；paddle-fleet-release:v1.5;在docker容器上运行，这个应该没影响

MrChengmo · 2019-09-02T08:24:21Z

目前在复现中发现的一个突出问题：构建网络及program的耗时异常，关于wide&deep的实现可以参考官方示例：

不是在实现wide&deep 啊；在qq里说了，是将将fm中的二阶交叉值取出来，和dnn输入一起接softmax；

您好，我在本地尝试复现您的错误，使用您的组网和Code，但保存模型过程中没有出现相同的问题，最终成功保存了模型。请您提供一下您的运行环境和paddle版本，我继续跟进。
另外，使用fleet进行分布式模型的保存，推荐使用save_psersistables，在进行预测时，使用相同的组网，不添加optimizer，再load_psersistables。具体例子可以看这个fleet的example：https://github.com/PaddlePaddle/Fleet/tree/develop/examples/distribute_ctr

试了save_psersistables 没有用；paddle-fleet-release:v1.5;在docker容器上运行，这个应该没影响

你好，请问使用save_persistables也是相同的错误吗？另外，您使用的分布式配置是怎么样的？几个trainer几个pserver？使用dataset进行异步训练需要注意以下几个关键的配置：DistributeTranspilerConfig().sync_mode = False，同时DistributeTranspilerConfig().runtime_split_send_recv=True

lijun900302 · 2019-09-05T03:02:55Z

save_persistables一样的错误；其他模型可以正常跑，因为上面的训练auc高，所以想试试；pserver 一般16；上面的配置有的

MrChengmo · 2019-09-09T02:11:56Z

save_persistables一样的错误；其他模型可以正常跑，因为上面的训练auc高，所以想试试；pserver 一般16；上面的配置有的

您好，在我们的环境下无法复现您的问题，之前有其他同学提到过相似的问题，您看下是否有帮助呢？

fuyinno4 assigned MrChengmo Aug 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用fleet.save_inference_model,ERROR:A protocol message was rejected because it was too big (more than 67108864 bytes). #3225

使用fleet.save_inference_model,ERROR:A protocol message was rejected because it was too big (more than 67108864 bytes). #3225

lijun900302 commented Aug 29, 2019

lijun900302 commented Aug 30, 2019 •

edited

Loading

MrChengmo commented Aug 30, 2019

MrChengmo commented Aug 30, 2019

lijun900302 commented Aug 30, 2019

MrChengmo commented Aug 31, 2019

lijun900302 commented Sep 2, 2019

MrChengmo commented Sep 2, 2019

lijun900302 commented Sep 5, 2019

MrChengmo commented Sep 9, 2019

使用fleet.save_inference_model,ERROR:A protocol message was rejected because it was too big (more than 67108864 bytes). #3225

使用fleet.save_inference_model,ERROR:A protocol message was rejected because it was too big (more than 67108864 bytes). #3225

Comments

lijun900302 commented Aug 29, 2019

lijun900302 commented Aug 30, 2019 • edited Loading

MrChengmo commented Aug 30, 2019

MrChengmo commented Aug 30, 2019

lijun900302 commented Aug 30, 2019

MrChengmo commented Aug 31, 2019

lijun900302 commented Sep 2, 2019

MrChengmo commented Sep 2, 2019

lijun900302 commented Sep 5, 2019

MrChengmo commented Sep 9, 2019

lijun900302 commented Aug 30, 2019 •

edited

Loading