"Serialize LoDTensor, Save/Restore model" #4602

dzhwinter · 2017-10-05T00:11:31Z

Here maybe better for review.

helinwang · 2017-10-05T00:37:35Z

Option 1: Use Protobuf to serialize tensor.

Protobuf is very bad for large chunk of data. For example [1]:

It's corresponding encoded binary is:

[1] From Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

As you can see from the second graph, each repeated field contains a field tag. This is very bad both for the speed and storage. Especially we will need to serialize the gradient and parameter between trainers and pservers.

Option 2: Write custom tensor serialization.

This is the option in this PR. The user need to write the encoding and decoding code.

Option 3: Only use Protobuf to serialize tensor meta data, serialize the tensor memory block directly, and pack them together.

This option depends on Protobuf, but offers forward and backward compatibility.

dzhwinter · 2017-10-05T01:29:59Z

doc/design/model_format.md

+
+The parameters are saved as a binary file. As we all know, the protobuf message has the limits of [64M size](https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.io.coded_stream#CodedInputStream.SetTotalBytesLimit.details). So we design a particular format for tensor serialization, for speed we core dump the memory to disk and save the necessary information, such as the`dims`, `name` of the tensor, Even the `LoD` information in [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/1c0a4c901c9fc881d120249c703b15d1c50dae7d/paddle/framework/lod_tensor.md). In detail, as the table shows
+
+```text


In fact, I think the protobuf maybe a better choice than the current design. To break through the limitation of 64M, We can divide big parameter into small ones. So the only concern we need to consider is the packed parameter size.
According to @helinwang 's comment, every repeated message has a small tag, which is not fit to the chunk data.
we will do some benchmark experiments, and choose a better one.

wangkuiyi · 2017-10-05T16:28:03Z

Should we do Option 3. -- write a protobuf message VarDesc and/or LoDTensorDesc followed by the tensor data?

dzhwinter · 2017-10-05T17:12:41Z

I think the tensor serialize efficiency and size are trivial in saving model/checkpoint. Many models in the real applications keep the magnitude from 10M to 100M, which means single tensor cannot violate the 64M size limitation. For the pack/unpack efficiency, protobuf has a field called packed click here in detail to save repeated message tag cost. I'm not sure it is efficient enough to cover our user scenes.

The only pain point of tensor serializing efficiency is swapping tensors frequently between nodes, pserver, etc. Every small enhance in pack/unpack will give us a lot of benefits to save precious bandwidth resources.

Maybe we need some measurement numbers to choose a good one.

dzhwinter · 2017-10-06T03:58:04Z

Here are some results of testing for this three options above. Time cost measures the one round of Tensor serialization and deserialization. We try the different size of Tensors, such as 10x10, 100x100, 1000x1000, and average the time cost of 1000 times. In the table, time cost smaller is better.

	time cost
option1	0.075905
option2	0.00283767
option3	0.00294829

Option 3 is the best trade-off between speed/efficiency and maintain difficulty.

Here is the code we tested.
benchmark of tensor serialization.

dzhwinter · 2017-10-09T23:35:01Z

When I complete the checkpoint feature, I found some problem we need to solve.

We need to support operator run asynchronously. Some time consume operators will block the training process for a long time. For example, the checkpoint operator runs every n step, we do not need to wait it finished. so does in SendOp, especially in trainer send to parameter server.
For saving the model or checkpoint, the ProgramDesc also needs to be saved. But currently the operator can not touch the ProgramDesc, only the executor can access it.
Should we save the topology of ProgramDesc before pruning or after pruning? which module should take care of it?
Obviously, when it comes to cluster training, we need to merge all the model partitions together. Should it be a global function of the master or sth?

helinwang · 2017-10-10T00:27:05Z

doc/design/model_format.md

+
+The model is the output of training process. One complete model consists of two parts, namely, the **topology** and the **parameters**. To support business deployment, we need to make the model format must be self-completed and do not expose any training source code.
+
+As a result, In PaddlePaddle, the **topology** represents as a  [ProgramDesc](https://github.com/PaddlePaddle/Paddle/blob/1c0a4c901c9fc881d120249c703b15d1c50dae7d/doc/design/program.md), which describes the model structure. The **parameters** contain all the trainable weights in the model, we must support large size parameter, and high efficiency read/write for speed. 


"high efficiency read/write for speed." -> "efficient serialization/deserialization".

helinwang · 2017-10-10T00:27:46Z

doc/design/model_format.md

+
+## Implementation
+
+The topology is saved as a plain text, in detail, a self-complete protobuf file. 


What does "self-complete" mean in this context?

"self-contain" Done.

helinwang · 2017-10-10T00:30:58Z

doc/design/model_format.md

+[offset] [type]          [value]          [description] 
+0000     32 bit integer  ??            	  HeaderLength, the length of LoDTensorDesc
+0004     32 bit integer  ??               ContentLength, the length of LodTensor Buffer
+0008     32 bit integer  ??               TensorDesc


TensorDesc is not "32 bit integer", it's just a sequence of bytes.

helinwang · 2017-10-10T00:31:15Z

doc/design/model_format.md

+0008     32 bit integer  ??               TensorDesc
+0012     32 bit integer  ??               TensorDesc
+...
+00100     32 bit integer  ??              Tensor Value


Tensor Value is not "32 bit integer", it's just a sequence of bytes.

helinwang · 2017-10-10T00:32:24Z

doc/design/model_format.md

+0008     32 bit integer  ??               TensorDesc
+0012     32 bit integer  ??               TensorDesc
+...
+00100     32 bit integer  ??              Tensor Value


Since we are just dumping the memory (e.g., float32 array) into "Tensor Value", need to specify the endianness of the element in Tensor Value.

helinwang · 2017-10-10T00:34:56Z

doc/design/model_format.md

+
+```text
+[offset] [type]          [value]          [description] 
+0000     32 bit integer  ??            	  HeaderLength, the length of LoDTensorDesc


32 bit integer -> 32 bit little-endian signed integer

helinwang · 2017-10-10T00:35:00Z

doc/design/model_format.md

+```text
+[offset] [type]          [value]          [description] 
+0000     32 bit integer  ??            	  HeaderLength, the length of LoDTensorDesc
+0004     32 bit integer  ??               ContentLength, the length of LodTensor Buffer


32 bit integer -> 32 bit little-endian signed integer

Superjomn · 2017-10-10T00:26:10Z

paddle/operators/checkpoint_op.cc

+    PADDLE_ENFORCE(ctx->HasInput("Step"),
+                   "Input(Step) of Checkpoint should not be null.");
+    std::string absolutePath = ctx->Attrs().Get<std::string>("absolutePath");
+    PADDLE_ENFORCE(absolutePath != "",


!absolutePath.empty()

maybe some more regex change here, if path is set to something like " ".

Good point! fixed the empty check.
But I think we should leave the regex check to python side, Because

we need to check the absolutePath is a valid path, it's related to the client, which should be done in the client language.

In our implementation, we assume user given's input is correct, we only check if the input has to be filled.

Superjomn · 2017-10-10T00:27:21Z

paddle/operators/checkpoint_op.cc

+    // 2. checkpoint op need at least two thread.
+    //    Because checkpoint will happen every, so need a thread wait
+    //    the timer/steps to reach the condition.
+    auto* Step = ctx.Input<Tensor>("Step");


variable name should be lowercase
https://google.github.io/styleguide/cppguide.html#Variable_Names

Superjomn · 2017-10-10T00:30:16Z

paddle/operators/checkpoint_op.cc

+    //    Because checkpoint will happen every, so need a thread wait
+    //    the timer/steps to reach the condition.
+    auto* Step = ctx.Input<Tensor>("Step");
+    const int* curr_step = Step->data<int>();


curr_step is not changed in this function, so the pointer is not needed,
just const int curr_step is ok.

Superjomn · 2017-10-10T00:32:50Z

paddle/operators/checkpoint_op.cc

+  }
+
+  // flag indicate this op may be skipped.
+  mutable bool run_once = false;


run_once should be a private member and name to run_once_.
it shouldn't let some external operations to change it.

Superjomn · 2017-10-10T00:33:52Z

paddle/operators/save_restore_op.cc

+
+ protected:
+  void InferShape(framework::InferShapeContextBase* ctx) const override {
+    PADDLE_ENFORCE(ctx->HasOutputs("Out"),


PADDLE_ENFORCE_NOT_NULL

Just keep the same style with other op implementation.

Superjomn · 2017-10-10T00:37:39Z

paddle/framework/lod_tensor.cc

@@ -103,5 +111,139 @@ void LoDTensor::ShrinkInLevel(size_t level, size_t elem_begin,
  lod_ = new_lod;
 }

+std::string LoDTensor::SerializeToString() const {


make SerializeToString an external function such as SerializeToString(LoDTensor).

There may be more such serialization functions, such as SerializeToString(OperatorBase), do not change the definition of the original class.

better not to insert methods that no relation with computation into class LoDTensor.

LoDTensor serves as a concept for computation, keep it clean.

and this function is too long, break it and keep the code clean.

if so much code is added and is no relation to the definition or operation of the concepts of LoD or Tensor, place it inside namespace detail or in another source file is better.

After a talk with @Superjom face to face, my opinion on this question as below.

Currently, we only need to serialize the in-memory content into a byte-stream. Namely, SerializeToString(LoDTensor), SerializeToString(Tensor). Operatorbase and other concepts all have their Desc in protobuf, we do not need any other class serializes implementation.

Implement DeserilizeFromString will return a Tensor filled with value, if we don't bind the serialize interface to the Tensor instance, we need another copy of the Tensor.

The offset_ and type in Tensor is hidden. Need to figure them out.

Thanks for this comment!

DeserilizeFromString(LoDTensor*) is no need to copy a Tensor, fill the data in-place seems possible.

Superjomn · 2017-10-10T00:45:40Z

paddle/framework/lod_tensor.cc

+}
+
+void LoDTensor::DeserializeFromString(const std::string& s,
+                                      const platform::Place& dst_place) {


so is this function.

helinwang · 2017-10-10T00:49:57Z

We need to support operator run asynchronously. Some time consume operators will block the training process for a long time. For example, the checkpoint operator runs every n step, we do not need to wait it finished. so does in SendOp, especially in trainer send to parameter server.

Agree, I think it is very important. I think the executor should:

be able to tell if an OP is completed, and only schedule an OP when all of it's dependency is completed (e.g., recv OP will only be scheduled when send OP is completed), and
have a thread-pool so that one blocking (e.g., IO-intensive) OP should not block everything else.

CC: @QiJune @tonyyang-svail @wangkuiyi

For saving the model or checkpoint, the ProgramDesc also needs to be saved. But currently the operator can not touch the ProgramDesc, only the executor can access it.

Maybe we need to put the ProgramDesc into the global scope.

Should we save the topology of ProgramDesc before pruning or after pruning? which module should take care of it?

Maybe we need save the input of Prune (the ProgramDesc) into the global scope.

Obviously, when it comes to cluster training, we need to merge all the model partitions together. Should it be a global function of the master or sth?

Do you mean who will "merge" different saved models shards? I think we should just put them into the save folder, with the save prefix name. Like: save.00000-of-00002, save.00001-of-00002.

Actually TensorFlow does similar, but with one more index file:

a.index
a.data-00000-of-00001

typhoonzero · 2017-10-12T04:35:43Z

@helinwang

the "executor" should treat all OPs the same way. It should focus solely on executing OP, and should not look for the specific OP type. E.g., no following code:

Agree with that "session" is an abstract concept. So there also be a "real" instance on each node when doing distributed training, and that must be the "executor". My point is that each node should have an "executor" process(same to the paddle v1 trainer), which executes all ops. And the trainer is responsible to save variables and status. Thought this will be much simpler than adding ops to the graph when implement.

helinwang · 2017-10-12T05:07:23Z

@typhoonzero

the trainer is responsible to save variables and status

Thanks for the reply! By trainer do you mean the Python process locally, or the executor running on the cluster? I think the model should be saved on the cloud, if save the variable from Python, it's hard to upload from the cloud. If save from executor, than we probably need to implement as an OP, since executor only executes OP.

typhoonzero · 2017-10-12T05:18:44Z

@helinwang

By trainer do you mean the Python process locally, or the executor running on the cluster?

The executor running on the cluster.

Well, I agree with save status using op, which fits the design better. By the way, are we going to implement sess.save()? This is also able to convert to adding op to the graph simply.

helinwang · 2017-10-12T05:21:31Z

@typhoonzero

are we going to implement sess.save()? This is also able to convert to adding op to the graph simply.

That's a good idea, haven't decided on the "easy to use" (we definitely need one) Python API for saving yet :)

typhoonzero

LGTM++

Pull/rebase the develop branch before merging, please!

… feature/checkpoint

…into feature/checkpoint

It has been a huge PR. We can merge it now and refine it in the future.

"add model format design doc"

fa5b154

dzhwinter changed the title ~~"add model format design doc"~~ "[WIP]add model format design doc" Oct 5, 2017

dzhwinter commented Oct 5, 2017

View reviewed changes

"add restore function"

5c617f6

helinwang requested a review from typhoonzero October 5, 2017 01:45

dzhwinter mentioned this pull request Oct 6, 2017

Benchmark of tensor serialization performance. #4610

Closed

dzhwinter added 8 commits October 6, 2017 09:41

"add parse protobuf"

e63a2b3

Merge branch 'develop' into feature/checkpoint

c6be3f3

"move necessary information to saver.proto"

425a6b6

"format code"

2f8eb95

"add gpu option"

f69e444

"add lod info"

5a6e6b2

"add saveop python test wrapper"

d111c7a

"checkpoint reuse save operator"

70786ee

dzhwinter changed the title ~~"[WIP]add model format design doc"~~ "[WIP]Serialize LoDTensor, Save/Restore model, Checkpoint" Oct 9, 2017

dzhwinter changed the title ~~"[WIP]Serialize LoDTensor, Save/Restore model, Checkpoint"~~ "[WIP]Serialize LoDTensor, Save/Restore model" Oct 9, 2017

dzhwinter added 2 commits October 9, 2017 16:01

"rewrite model format design doc"

1f31265

"async support needed"

1e92448

dzhwinter changed the title ~~"[WIP]Serialize LoDTensor, Save/Restore model"~~ "Serialize LoDTensor, Save/Restore model" Oct 9, 2017

"fix run once"

6d24d23

helinwang reviewed Oct 10, 2017

View reviewed changes

Superjomn previously requested changes Oct 10, 2017

View reviewed changes

typhoonzero previously approved these changes Oct 16, 2017

View reviewed changes

dzhwinter added 7 commits October 20, 2017 10:38

merge into develop

8fdca7d

"remove persistable flag from framework.proto"

7407afd

"add IndicateDataType to restore op"

15fd027

Merge remote-tracking branch 'origin/develop' into feature/checkpoint

a776dfc

"add save test"

d7e25aa

"modify save restore code"

a05883f

"modified the restore logic"

f918bfc

dzhwinter dismissed typhoonzero’s stale review via f918bfc October 23, 2017 00:26

This was referenced Oct 23, 2017

Need Python API to save models in refactored Paddle #5019

Closed

Python API for inference model saving/load #5020

Merged

JiayiFeng and others added 11 commits October 23, 2017 14:34

rm checkpoint_op.cc

feb23f4

rm test_checkpoint_op.py

c5e6307

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

69db65d

… feature/checkpoint

"get inputs outputs name from execution context"

0961597

Merge branch 'fix/add_get_name' of https://github.com/dzhwinter/Paddle …

78b24a6

…into feature/checkpoint

Saving each variable to a independent file

9e8ddc1

Fix bugs

e1c1e2c

Rewrite save_restore_op_test with new Python framework

4bc80d9

Move SaveOp and RestoreOp from OpWithKernel to OpBase

4150e27

Refine unit test of SaveOp and RestoreOp

7fdc536

fix compile errorwq

ad08120

JiayiFeng approved these changes Oct 24, 2017

View reviewed changes

JiayiFeng merged commit fd2eb55 into PaddlePaddle:develop Oct 24, 2017

dzhwinter mentioned this pull request Nov 28, 2017

add support of save/restore model, save/restore checkpoint. #4559

Closed


		The parameters are saved as a binary file. As we all know, the protobuf message has the limits of [64M size](https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.io.coded_stream#CodedInputStream.SetTotalBytesLimit.details). So we design a particular format for tensor serialization, for speed we core dump the memory to disk and save the necessary information, such as the`dims`, `name` of the tensor, Even the `LoD` information in [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/1c0a4c901c9fc881d120249c703b15d1c50dae7d/paddle/framework/lod_tensor.md). In detail, as the table shows

		```text


		The model is the output of training process. One complete model consists of two parts, namely, the topology and the parameters. To support business deployment, we need to make the model format must be self-completed and do not expose any training source code.

		As a result, In PaddlePaddle, the topology represents as a [ProgramDesc](https://github.com/PaddlePaddle/Paddle/blob/1c0a4c901c9fc881d120249c703b15d1c50dae7d/doc/design/program.md), which describes the model structure. The parameters contain all the trainable weights in the model, we must support large size parameter, and high efficiency read/write for speed.


		## Implementation

		The topology is saved as a plain text, in detail, a self-complete protobuf file.

"Serialize LoDTensor, Save/Restore model" #4602

"Serialize LoDTensor, Save/Restore model" #4602

Conversation

dzhwinter commented Oct 5, 2017 • edited Loading

helinwang commented Oct 5, 2017 • edited Loading

Option 1: Use Protobuf to serialize tensor.

Option 2: Write custom tensor serialization.

Option 3: Only use Protobuf to serialize tensor meta data, serialize the tensor memory block directly, and pack them together.

Choose a reason for hiding this comment

wangkuiyi commented Oct 5, 2017 • edited Loading

dzhwinter commented Oct 5, 2017 • edited Loading

dzhwinter commented Oct 6, 2017 • edited Loading

dzhwinter commented Oct 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Oct 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Oct 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Oct 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang commented Oct 10, 2017 • edited Loading

typhoonzero commented Oct 12, 2017

helinwang commented Oct 12, 2017 • edited Loading

typhoonzero commented Oct 12, 2017

helinwang commented Oct 12, 2017 • edited Loading

typhoonzero left a comment

Choose a reason for hiding this comment

dzhwinter commented Oct 5, 2017 •

edited

Loading

helinwang commented Oct 5, 2017 •

edited

Loading

wangkuiyi commented Oct 5, 2017 •

edited

Loading

dzhwinter commented Oct 5, 2017 •

edited

Loading

dzhwinter commented Oct 6, 2017 •

edited

Loading

dzhwinter commented Oct 9, 2017 •

edited

Loading

helinwang Oct 10, 2017 •

edited

Loading

helinwang Oct 10, 2017 •

edited

Loading

helinwang Oct 10, 2017 •

edited

Loading

helinwang commented Oct 10, 2017 •

edited

Loading

helinwang commented Oct 12, 2017 •

edited

Loading

helinwang commented Oct 12, 2017 •

edited

Loading