Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint M2: lookup table checkpoint #11490

Merged
merged 75 commits into from
Jun 26, 2018

Conversation

seiriosPlus
Copy link
Collaborator

@seiriosPlus seiriosPlus commented Jun 14, 2018

A new feature for #11410, it does not complete currently.
project: #10868

@@ -276,8 +320,8 @@ void AsyncGRPCServer::TryToRegisterNewOne(const std::string& rpc_name,
return;
}

VLOG(4) << "register send rpc_name:" << rpc_name
<< ", handler:" << rpc_call_map_[kRequestSend];
LOG(INFO) << "TryToRegisterNewOne on RPC NAME: " << rpc_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use VLOG(level)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个日志为啥需要修改?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个日志之前是错误的, 不管接收到的rpc_name是什么,只打印 handler:" << rpc_call_map_[kRequestSend]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

req_id?

@@ -243,28 +247,40 @@ void ListenAndServOp::RunImpl(const framework::Scope &scope,

PADDLE_ENFORCE(!rpc_service_);
std::string endpoint = Attr<std::string>("endpoint");
int checkpoint_notify_id = Attr<int>(kCheckpointBlockId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpoint_notify_id 这个名字不太好

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to checkpoint_notify_block_id

import os

pserver_program.global_block().create_var(
name="loopup_table_path",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loopup_table_path 字符串有问题

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

checkpoint_save_block = pserver_program.create_block(pre_block_idx)
checkpoint_save_block.append_op(
type='save',
inputs={'X': [self.table_name]},
Copy link
Member

@jacquesqiao jacquesqiao Jun 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table_name 为啥是input,而且应该也不是var吧?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.table_name是var, 已修改attr

@@ -1068,6 +1187,29 @@ def save_trainer_args(dirname, trainer_id, trainer_args):


def load_trainer_args(checkpoint_dir, serial, trainer_id, trainer_args):
"""
trainer will load some args from it's independent directory,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's independent 多了一个空格 :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dirname(str): The directory path
main_program(Program): Find the variable named table_name in main_program
pserver_id(int): the serial number in pserver_endpoints list
table_name(str): lookup table name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns:

前加一行空格

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

AddComment(R"DOC(
Save operator

This operator will serialize and write a tensor variable to file on disk.
This operator will serialize and write a tensor/selected rows variable to file on disk.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoDTensor / SelectedRows

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -224,6 +266,8 @@ void AsyncGRPCServer::StartServer() {
reqs.reserve(kRequestBufSize);

for (int i = 0; i < kRequestBufSize; i++) {
LOG(INFO) << "TryToRegisterNewOne on RPC NAME: " << rpc_name
Copy link
Contributor

@gongweibao gongweibao Jun 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些日志太多了,用VLOG

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

日志打印有点小问题。

inputs={'X': [self.table_name]},
outputs={},
attrs={
'file_path':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'file_path': ""

put this 'file_path' do not be used in save lookup table variable into comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


LOG(INFO) << "sync_mode:" << sync_mode << ", fan_in:" << fan_in
<< ", end_point:" << endpoint;
<< ", end_point:" << endpoint
<< ", CheckpointNotify Id: " << checkpoint_notify_block_id;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"checkpoint_notify_block_id" => "checkpoint_block_id"
"CheckpointNotify Id" => "checkpoint_block_id "

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

std::shared_ptr<framework::ExecutorPrepareContext> ckpt_pre_context = nullptr;
if (checkpoint_notify_block_id != -1) {
auto ctx = executor.Prepare(*program, checkpoint_notify_block_id);
ckpt_pre_context = std::move(ctx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看着有点奇怪,加点注释。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

#include "paddle/fluid/platform/device_context.h"

namespace paddle {
namespace operators {

// define LOOKUP_TABLE_PATH for checkpoint notify to save lookup table variables
// to directory specified.
constexpr char LOOKUP_TABLE_PATH[] = "lookup_table_path";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

jacquesqiao
jacquesqiao previously approved these changes Jun 26, 2018
Copy link
Member

@jacquesqiao jacquesqiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, we can merge it now~ Thanks for the great work!

@seiriosPlus seiriosPlus merged commit b20fa02 into PaddlePaddle:develop Jun 26, 2018
@seiriosPlus seiriosPlus deleted the ckpt_m2 branch June 26, 2018 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants