Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Serialize LoDTensor, Save/Restore model" #4602

Merged
merged 34 commits into from
Oct 24, 2017

Conversation

dzhwinter
Copy link
Contributor

@dzhwinter dzhwinter commented Oct 5, 2017

Here maybe better for review.

@dzhwinter dzhwinter changed the title "add model format design doc" "[WIP]add model format design doc" Oct 5, 2017
@helinwang
Copy link
Contributor

helinwang commented Oct 5, 2017

Option 1: Use Protobuf to serialize tensor.

Protobuf is very bad for large chunk of data. For example [1]:
screen shot 2017-10-04 at 5 35 12 pm
It's corresponding encoded binary is:
screen shot 2017-10-04 at 5 34 02 pm

[1] From Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

As you can see from the second graph, each repeated field contains a field tag. This is very bad both for the speed and storage. Especially we will need to serialize the gradient and parameter between trainers and pservers.

Option 2: Write custom tensor serialization.

This is the option in this PR. The user need to write the encoding and decoding code.

Option 3: Only use Protobuf to serialize tensor meta data, serialize the tensor memory block directly, and pack them together.

This option depends on Protobuf, but offers forward and backward compatibility.


The parameters are saved as a binary file. As we all know, the protobuf message has the limits of [64M size](https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.io.coded_stream#CodedInputStream.SetTotalBytesLimit.details). So we design a particular format for tensor serialization, for speed we core dump the memory to disk and save the necessary information, such as the`dims`, `name` of the tensor, Even the `LoD` information in [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/1c0a4c901c9fc881d120249c703b15d1c50dae7d/paddle/framework/lod_tensor.md). In detail, as the table shows

```text
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I think the protobuf maybe a better choice than the current design. To break through the limitation of 64M, We can divide big parameter into small ones. So the only concern we need to consider is the packed parameter size.
According to @helinwang 's comment, every repeated message has a small tag, which is not fit to the chunk data.
we will do some benchmark experiments, and choose a better one.

@wangkuiyi
Copy link
Collaborator

wangkuiyi commented Oct 5, 2017

Should we do Option 3. -- write a protobuf message VarDesc and/or LoDTensorDesc followed by the tensor data?

@dzhwinter
Copy link
Contributor Author

dzhwinter commented Oct 5, 2017

I think the tensor serialize efficiency and size are trivial in saving model/checkpoint. Many models in the real applications keep the magnitude from 10M to 100M, which means single tensor cannot violate the 64M size limitation. For the pack/unpack efficiency, protobuf has a field called packed click here in detail to save repeated message tag cost. I'm not sure it is efficient enough to cover our user scenes.

The only pain point of tensor serializing efficiency is swapping tensors frequently between nodes, pserver, etc. Every small enhance in pack/unpack will give us a lot of benefits to save precious bandwidth resources.

Maybe we need some measurement numbers to choose a good one.

@dzhwinter
Copy link
Contributor Author

dzhwinter commented Oct 6, 2017

Here are some results of testing for this three options above. Time cost measures the one round of Tensor serialization and deserialization. We try the different size of Tensors, such as 10x10, 100x100, 1000x1000, and average the time cost of 1000 times. In the table, time cost smaller is better.

time cost
option1 0.075905
option2 0.00283767
option3 0.00294829

Option 3 is the best trade-off between speed/efficiency and maintain difficulty.

Here is the code we tested.
benchmark of tensor serialization.

@dzhwinter dzhwinter changed the title "[WIP]add model format design doc" "[WIP]Serialize LoDTensor, Save/Restore model, Checkpoint" Oct 9, 2017
@dzhwinter dzhwinter changed the title "[WIP]Serialize LoDTensor, Save/Restore model, Checkpoint" "[WIP]Serialize LoDTensor, Save/Restore model" Oct 9, 2017
@dzhwinter dzhwinter changed the title "[WIP]Serialize LoDTensor, Save/Restore model" "Serialize LoDTensor, Save/Restore model" Oct 9, 2017
@dzhwinter
Copy link
Contributor Author

dzhwinter commented Oct 9, 2017

When I complete the checkpoint feature, I found some problem we need to solve.

  1. We need to support operator run asynchronously. Some time consume operators will block the training process for a long time. For example, the checkpoint operator runs every n step, we do not need to wait it finished. so does in SendOp, especially in trainer send to parameter server.

  2. For saving the model or checkpoint, the ProgramDesc also needs to be saved. But currently the operator can not touch the ProgramDesc, only the executor can access it.

  3. Should we save the topology of ProgramDesc before pruning or after pruning? which module should take care of it?

  4. Obviously, when it comes to cluster training, we need to merge all the model partitions together. Should it be a global function of the master or sth?


The model is the output of training process. One complete model consists of two parts, namely, the **topology** and the **parameters**. To support business deployment, we need to make the model format must be self-completed and do not expose any training source code.

As a result, In PaddlePaddle, the **topology** represents as a [ProgramDesc](https://github.com/PaddlePaddle/Paddle/blob/1c0a4c901c9fc881d120249c703b15d1c50dae7d/doc/design/program.md), which describes the model structure. The **parameters** contain all the trainable weights in the model, we must support large size parameter, and high efficiency read/write for speed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"high efficiency read/write for speed." -> "efficient serialization/deserialization".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


## Implementation

The topology is saved as a plain text, in detail, a self-complete protobuf file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "self-complete" mean in this context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"self-contain" Done.

[offset] [type] [value] [description]
0000 32 bit integer ?? HeaderLength, the length of LoDTensorDesc
0004 32 bit integer ?? ContentLength, the length of LodTensor Buffer
0008 32 bit integer ?? TensorDesc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorDesc is not "32 bit integer", it's just a sequence of bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

0008 32 bit integer ?? TensorDesc
0012 32 bit integer ?? TensorDesc
...
00100 32 bit integer ?? Tensor Value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tensor Value is not "32 bit integer", it's just a sequence of bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

0008 32 bit integer ?? TensorDesc
0012 32 bit integer ?? TensorDesc
...
00100 32 bit integer ?? Tensor Value
Copy link
Contributor

@helinwang helinwang Oct 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are just dumping the memory (e.g., float32 array) into "Tensor Value", need to specify the endianness of the element in Tensor Value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


```text
[offset] [type] [value] [description]
0000 32 bit integer ?? HeaderLength, the length of LoDTensorDesc
Copy link
Contributor

@helinwang helinwang Oct 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32 bit integer -> 32 bit little-endian signed integer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

```text
[offset] [type] [value] [description]
0000 32 bit integer ?? HeaderLength, the length of LoDTensorDesc
0004 32 bit integer ?? ContentLength, the length of LodTensor Buffer
Copy link
Contributor

@helinwang helinwang Oct 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

32 bit integer -> 32 bit little-endian signed integer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

PADDLE_ENFORCE(ctx->HasInput("Step"),
"Input(Step) of Checkpoint should not be null.");
std::string absolutePath = ctx->Attrs().Get<std::string>("absolutePath");
PADDLE_ENFORCE(absolutePath != "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!absolutePath.empty()

maybe some more regex change here, if path is set to something like " ".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! fixed the empty check.
But I think we should leave the regex check to python side, Because

  1. we need to check the absolutePath is a valid path, it's related to the client, which should be done in the client language.
  2. In our implementation, we assume user given's input is correct, we only check if the input has to be filled.

// 2. checkpoint op need at least two thread.
// Because checkpoint will happen every, so need a thread wait
// the timer/steps to reach the condition.
auto* Step = ctx.Input<Tensor>("Step");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Because checkpoint will happen every, so need a thread wait
// the timer/steps to reach the condition.
auto* Step = ctx.Input<Tensor>("Step");
const int* curr_step = Step->data<int>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curr_step is not changed in this function, so the pointer is not needed,
just const int curr_step is ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

// flag indicate this op may be skipped.
mutable bool run_once = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_once should be a private member and name to run_once_.
it shouldn't let some external operations to change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


protected:
void InferShape(framework::InferShapeContextBase* ctx) const override {
PADDLE_ENFORCE(ctx->HasOutputs("Out"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PADDLE_ENFORCE_NOT_NULL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just keep the same style with other op implementation.

@@ -103,5 +111,139 @@ void LoDTensor::ShrinkInLevel(size_t level, size_t elem_begin,
lod_ = new_lod;
}

std::string LoDTensor::SerializeToString() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make SerializeToString an external function such as SerializeToString(LoDTensor).

There may be more such serialization functions, such as SerializeToString(OperatorBase), do not change the definition of the original class.

better not to insert methods that no relation with computation into class LoDTensor.

LoDTensor serves as a concept for computation, keep it clean.

and this function is too long, break it and keep the code clean.

if so much code is added and is no relation to the definition or operation of the concepts of LoD or Tensor, place it inside namespace detail or in another source file is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a talk with @Superjom face to face, my opinion on this question as below.

  1. Currently, we only need to serialize the in-memory content into a byte-stream. Namely, SerializeToString(LoDTensor), SerializeToString(Tensor). Operatorbase and other concepts all have their Desc in protobuf, we do not need any other class serializes implementation.

  2. Implement DeserilizeFromString will return a Tensor filled with value, if we don't bind the serialize interface to the Tensor instance, we need another copy of the Tensor.

  3. The offset_ and type in Tensor is hidden. Need to figure them out.

Thanks for this comment!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeserilizeFromString(LoDTensor*) is no need to copy a Tensor, fill the data in-place seems possible.

}

void LoDTensor::DeserializeFromString(const std::string& s,
const platform::Place& dst_place) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above.

@helinwang
Copy link
Contributor

helinwang commented Oct 10, 2017

We need to support operator run asynchronously. Some time consume operators will block the training process for a long time. For example, the checkpoint operator runs every n step, we do not need to wait it finished. so does in SendOp, especially in trainer send to parameter server.

Agree, I think it is very important. I think the executor should:

  1. be able to tell if an OP is completed, and only schedule an OP when all of it's dependency is completed (e.g., recv OP will only be scheduled when send OP is completed), and
  2. have a thread-pool so that one blocking (e.g., IO-intensive) OP should not block everything else.

CC: @QiJune @tonyyang-svail @wangkuiyi

For saving the model or checkpoint, the ProgramDesc also needs to be saved. But currently the operator can not touch the ProgramDesc, only the executor can access it.

Maybe we need to put the ProgramDesc into the global scope.

Should we save the topology of ProgramDesc before pruning or after pruning? which module should take care of it?

Maybe we need save the input of Prune (the ProgramDesc) into the global scope.

Obviously, when it comes to cluster training, we need to merge all the model partitions together. Should it be a global function of the master or sth?

Do you mean who will "merge" different saved models shards? I think we should just put them into the save folder, with the save prefix name. Like: save.00000-of-00002, save.00001-of-00002.

Actually TensorFlow does similar, but with one more index file:

a.index
a.data-00000-of-00001

@typhoonzero
Copy link
Contributor

@helinwang

the "executor" should treat all OPs the same way. It should focus solely on executing OP, and should not look for the specific OP type. E.g., no following code:

Agree with that "session" is an abstract concept. So there also be a "real" instance on each node when doing distributed training, and that must be the "executor". My point is that each node should have an "executor" process(same to the paddle v1 trainer), which executes all ops. And the trainer is responsible to save variables and status. Thought this will be much simpler than adding ops to the graph when implement.

@helinwang
Copy link
Contributor

helinwang commented Oct 12, 2017

@typhoonzero

the trainer is responsible to save variables and status

Thanks for the reply! By trainer do you mean the Python process locally, or the executor running on the cluster? I think the model should be saved on the cloud, if save the variable from Python, it's hard to upload from the cloud. If save from executor, than we probably need to implement as an OP, since executor only executes OP.

@typhoonzero
Copy link
Contributor

@helinwang

By trainer do you mean the Python process locally, or the executor running on the cluster?

The executor running on the cluster.

Well, I agree with save status using op, which fits the design better. By the way, are we going to implement sess.save()? This is also able to convert to adding op to the graph simply.

@helinwang
Copy link
Contributor

helinwang commented Oct 12, 2017

@typhoonzero

are we going to implement sess.save()? This is also able to convert to adding op to the graph simply.

That's a good idea, haven't decided on the "easy to use" (we definitely need one) Python API for saving yet :)

typhoonzero
typhoonzero previously approved these changes Oct 16, 2017
Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++

Pull/rebase the develop branch before merging, please!

@JiayiFeng JiayiFeng dismissed Superjomn’s stale review October 24, 2017 21:11

It has been a huge PR. We can merge it now and refine it in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants