-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint M2: lookup table checkpoint #11490
Conversation
@@ -276,8 +320,8 @@ void AsyncGRPCServer::TryToRegisterNewOne(const std::string& rpc_name, | |||
return; | |||
} | |||
|
|||
VLOG(4) << "register send rpc_name:" << rpc_name | |||
<< ", handler:" << rpc_call_map_[kRequestSend]; | |||
LOG(INFO) << "TryToRegisterNewOne on RPC NAME: " << rpc_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use VLOG(level)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个日志为啥需要修改?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个日志之前是错误的, 不管接收到的rpc_name
是什么,只打印 handler:" << rpc_call_map_[kRequestSend]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
req_id?
@@ -243,28 +247,40 @@ void ListenAndServOp::RunImpl(const framework::Scope &scope, | |||
|
|||
PADDLE_ENFORCE(!rpc_service_); | |||
std::string endpoint = Attr<std::string>("endpoint"); | |||
int checkpoint_notify_id = Attr<int>(kCheckpointBlockId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkpoint_notify_id
这个名字不太好
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to checkpoint_notify_block_id
import os | ||
|
||
pserver_program.global_block().create_var( | ||
name="loopup_table_path", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loopup_table_path
字符串有问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
checkpoint_save_block = pserver_program.create_block(pre_block_idx) | ||
checkpoint_save_block.append_op( | ||
type='save', | ||
inputs={'X': [self.table_name]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table_name
为啥是input,而且应该也不是var吧?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.table_name是var, 已修改attr
python/paddle/fluid/io.py
Outdated
@@ -1068,6 +1187,29 @@ def save_trainer_args(dirname, trainer_id, trainer_args): | |||
|
|||
|
|||
def load_trainer_args(checkpoint_dir, serial, trainer_id, trainer_args): | |||
""" | |||
trainer will load some args from it's independent directory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's independent
多了一个空格 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dirname(str): The directory path | ||
main_program(Program): Find the variable named table_name in main_program | ||
pserver_id(int): the serial number in pserver_endpoints list | ||
table_name(str): lookup table name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns:
前加一行空格
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
paddle/fluid/operators/save_op.cc
Outdated
AddComment(R"DOC( | ||
Save operator | ||
|
||
This operator will serialize and write a tensor variable to file on disk. | ||
This operator will serialize and write a tensor/selected rows variable to file on disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoDTensor / SelectedRows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -224,6 +266,8 @@ void AsyncGRPCServer::StartServer() { | |||
reqs.reserve(kRequestBufSize); | |||
|
|||
for (int i = 0; i < kRequestBufSize; i++) { | |||
LOG(INFO) << "TryToRegisterNewOne on RPC NAME: " << rpc_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些日志太多了,用VLOG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
日志打印有点小问题。
inputs={'X': [self.table_name]}, | ||
outputs={}, | ||
attrs={ | ||
'file_path': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'file_path': ""
put this 'file_path' do not be used in save lookup table variable
into comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
LOG(INFO) << "sync_mode:" << sync_mode << ", fan_in:" << fan_in | ||
<< ", end_point:" << endpoint; | ||
<< ", end_point:" << endpoint | ||
<< ", CheckpointNotify Id: " << checkpoint_notify_block_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"checkpoint_notify_block_id" => "checkpoint_block_id"
"CheckpointNotify Id" => "checkpoint_block_id "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
std::shared_ptr<framework::ExecutorPrepareContext> ckpt_pre_context = nullptr; | ||
if (checkpoint_notify_block_id != -1) { | ||
auto ctx = executor.Prepare(*program, checkpoint_notify_block_id); | ||
ckpt_pre_context = std::move(ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看着有点奇怪,加点注释。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
paddle/fluid/operators/save_op.cc
Outdated
#include "paddle/fluid/platform/device_context.h" | ||
|
||
namespace paddle { | ||
namespace operators { | ||
|
||
// define LOOKUP_TABLE_PATH for checkpoint notify to save lookup table variables | ||
// to directory specified. | ||
constexpr char LOOKUP_TABLE_PATH[] = "lookup_table_path"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://google.github.io/styleguide/cppguide.html#Constant_Names
LOOKUP_TABLE_PATH
=> kLookupTablePath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, we can merge it now~ Thanks for the great work!
A new feature for #11410, it does not complete currently.
project: #10868