Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用fleet.save_inference_model,ERROR:A protocol message was rejected because it was too big (more than 67108864 bytes). #3225

Open
lijun900302 opened this issue Aug 29, 2019 · 9 comments
Assignees

Comments

@lijun900302
Copy link

from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
fleet.save_inference_model(executor=exe, dirname=model_dir, feeded_var_names=feed_var_names, target_vars=[auc_var, batch_auc_var])保存模型出错,

/usr/local/lib/python2.7/dist-packages/paddle/fluid/io.py:1084: UserWarning: save_inference_model specified the param program_only to True, It will not save params of Program.
2019-08-28 21:33:25 2019-08-28 13:33:23,876 [INFO] [10.38.26.135] --- "save_inference_model specified the param program_only to True, It will not save params of Program."
2019-08-28 21:33:25 2019-08-28 13:33:24,698 [INFO] [10.38.26.135] --- [libprotobuf ERROR /paddle/build/third_party/protobuf/src/extern_protobuf/src/google/protobuf/io/coded_stream.cc:208] A protocol message was rejected because it was too big (more than 67108864 bytes). To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
2019-08-28 21:33:25 2019-08-28 13:33:25,202 [INFO] [10.38.26.135] --- Traceback (most recent call last):
2019-08-28 21:33:25 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- File "/paddle/task-20190828201029-87895/dnn_dense_interaction.py", line 374, in
2019-08-28 21:33:26 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- train()
2019-08-28 21:33:26 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- File "/paddle/task-20190828201029-87895/dnn_dense_interaction.py", line 363, in train
2019-08-28 21:33:26 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- fleet.save_inference_model(executor=exe, dirname=model_dir, feeded_var_names=feed_var_names, target_vars=[auc_var, batch_auc_var])
2019-08-28 21:33:26 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/incubate/fleet/parameter_server/distribute_transpiler/init.py", line 157, in save_inference_model
2019-08-28 21:33:26 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- program = Program.parse_from_string(program_desc_str)
2019-08-28 21:33:26 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/framework.py", line 3315, in parse_from_string
2019-08-28 21:33:26 2019-08-28 13:33:25,203 [INFO] [10.38.26.135] --- p.desc = core.ProgramDesc(binary_str)
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- paddle.fluid.core_avx.EnforceNotMet: Fail to parse program_desc from binary string. at [/paddle/paddle/fluid/framework/program_desc.cc:95]
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- PaddlePaddle Call Stacks:
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 0 0x7fa349ef999ap void paddle::platform::EnforceNotMet::Init<char const*>(char const*, char const*, int) + 506
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 1 0x7fa349efa6a5p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 165
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 2 0x7fa34a0c97bep paddle::framework::ProgramDesc::ProgramDesc(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 782
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 3 0x7fa349fbbb66p
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 4 0x7fa349f26a14p
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 5 0x4eef5ep
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 15 0x4b9b66p PyEval_EvalCodeEx + 774
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 16 0x4eb69fp
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 17 0x4e58f2p PyRun_FileExFlags + 130
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 13 0x4b9b66p PyEval_EvalCodeEx + 774
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 14 0x4c1f56p PyEval_EvalFrameEx + 24694
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 6 0x4eeb66p
2019-08-28 21:33:26 2019-08-28 13:33:25,204 [INFO] [10.38.26.135] --- 7 0x4aaafbp
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 8 0x4c166dp PyEval_EvalFrameEx + 22413
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 9 0x4b9b66p PyEval_EvalCodeEx + 774
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 10 0x4c1f56p PyEval_EvalFrameEx + 24694
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 11 0x4b9b66p PyEval_EvalCodeEx + 774
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 12 0x4c17c6p PyEval_EvalFrameEx + 22758
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 18 0x4e41a6p PyRun_SimpleFileExFlags + 390
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 19 0x4938cep Py_Main + 1358
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 20 0x7fa3bcf7a830p __libc_start_main + 240
2019-08-28 21:33:26 2019-08-28 13:33:25,205 [INFO] [10.38.26.135] --- 21 0x493299p _start + 41

@lijun900302
Copy link
Author

lijun900302 commented Aug 30, 2019

加个这个也没用
image

@MrChengmo
Copy link
Contributor

正在尝试本地复现,在持续跟进

@MrChengmo
Copy link
Contributor

目前在复现中发现的一个突出问题:构建网络及program的耗时异常,关于wide&deep的实现可以参考官方示例:https://github.com/PaddlePaddle/models/tree/8bca0e4311b444a61024c2a5dd755a22b47487da/legacy/ctr

@lijun900302
Copy link
Author

目前在复现中发现的一个突出问题:构建网络及program的耗时异常,关于wide&deep的实现可以参考官方示例:

不是在实现wide&deep 啊;在qq里说了,是将将fm中的二阶交叉值取出来,和dnn输入一起接softmax;

@MrChengmo
Copy link
Contributor

目前在复现中发现的一个突出问题:构建网络及program的耗时异常,关于wide&deep的实现可以参考官方示例:

不是在实现wide&deep 啊;在qq里说了,是将将fm中的二阶交叉值取出来,和dnn输入一起接softmax;

您好,我在本地尝试复现您的错误,使用您的组网和Code,但保存模型过程中没有出现相同的问题,最终成功保存了模型。请您提供一下您的运行环境和paddle版本,我继续跟进。
另外,使用fleet进行分布式模型的保存,推荐使用save_psersistables,在进行预测时,使用相同的组网,不添加optimizer,再load_psersistables。具体例子可以看这个fleet的example:https://github.com/PaddlePaddle/Fleet/tree/develop/examples/distribute_ctr

@lijun900302
Copy link
Author

目前在复现中发现的一个突出问题:构建网络及program的耗时异常,关于wide&deep的实现可以参考官方示例:

不是在实现wide&deep 啊;在qq里说了,是将将fm中的二阶交叉值取出来,和dnn输入一起接softmax;

您好,我在本地尝试复现您的错误,使用您的组网和Code,但保存模型过程中没有出现相同的问题,最终成功保存了模型。请您提供一下您的运行环境和paddle版本,我继续跟进。
另外,使用fleet进行分布式模型的保存,推荐使用save_psersistables,在进行预测时,使用相同的组网,不添加optimizer,再load_psersistables。具体例子可以看这个fleet的example:https://github.com/PaddlePaddle/Fleet/tree/develop/examples/distribute_ctr

试了save_psersistables 没有用;paddle-fleet-release:v1.5;在docker容器上运行,这个应该没影响

@MrChengmo
Copy link
Contributor

目前在复现中发现的一个突出问题:构建网络及program的耗时异常,关于wide&deep的实现可以参考官方示例:

不是在实现wide&deep 啊;在qq里说了,是将将fm中的二阶交叉值取出来,和dnn输入一起接softmax;

您好,我在本地尝试复现您的错误,使用您的组网和Code,但保存模型过程中没有出现相同的问题,最终成功保存了模型。请您提供一下您的运行环境和paddle版本,我继续跟进。
另外,使用fleet进行分布式模型的保存,推荐使用save_psersistables,在进行预测时,使用相同的组网,不添加optimizer,再load_psersistables。具体例子可以看这个fleet的example:https://github.com/PaddlePaddle/Fleet/tree/develop/examples/distribute_ctr

试了save_psersistables 没有用;paddle-fleet-release:v1.5;在docker容器上运行,这个应该没影响

你好,请问使用save_persistables也是相同的错误吗?另外,您使用的分布式配置是怎么样的?几个trainer几个pserver?使用dataset进行异步训练需要注意以下几个关键的配置:DistributeTranspilerConfig().sync_mode = False,同时DistributeTranspilerConfig().runtime_split_send_recv=True

@lijun900302
Copy link
Author

save_persistables一样的错误;其他模型可以正常跑,因为上面的训练auc高,所以想试试;pserver 一般16;上面的配置有的

@MrChengmo
Copy link
Contributor

save_persistables一样的错误;其他模型可以正常跑,因为上面的训练auc高,所以想试试;pserver 一般16;上面的配置有的

您好,在我们的环境下无法复现您的问题,之前有其他同学提到过相似的问题,您看下是否有帮助呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants