-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Serialize LoDTensor, Save/Restore model" #4602
"Serialize LoDTensor, Save/Restore model" #4602
Conversation
doc/design/model_format.md
Outdated
|
||
The parameters are saved as a binary file. As we all know, the protobuf message has the limits of [64M size](https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.io.coded_stream#CodedInputStream.SetTotalBytesLimit.details). So we design a particular format for tensor serialization, for speed we core dump the memory to disk and save the necessary information, such as the`dims`, `name` of the tensor, Even the `LoD` information in [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/1c0a4c901c9fc881d120249c703b15d1c50dae7d/paddle/framework/lod_tensor.md). In detail, as the table shows | ||
|
||
```text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I think the protobuf maybe a better choice than the current design. To break through the limitation of 64M, We can divide big parameter into small ones. So the only concern we need to consider is the packed parameter size.
According to @helinwang 's comment, every repeated message has a small tag, which is not fit to the chunk data.
we will do some benchmark experiments, and choose a better one.
Should we do Option 3. -- write a protobuf message |
I think the tensor serialize efficiency and size are trivial in saving model/checkpoint. Many models in the real applications keep the magnitude from 10M to 100M, which means single tensor cannot violate the 64M size limitation. For the pack/unpack efficiency, protobuf has a field called packed click here in detail to save repeated message tag cost. I'm not sure it is efficient enough to cover our user scenes. The only pain point of tensor serializing efficiency is swapping tensors frequently between nodes, pserver, etc. Every small enhance in pack/unpack will give us a lot of benefits to save precious bandwidth resources. Maybe we need some measurement numbers to choose a good one. |
Here are some results of testing for this three options above. Time cost measures the one round of Tensor serialization and deserialization. We try the different size of Tensors, such as 10x10, 100x100, 1000x1000, and average the time cost of 1000 times. In the table, time cost smaller is better.
Option 3 is the best trade-off between speed/efficiency and maintain difficulty. Here is the code we tested. |
When I complete the checkpoint feature, I found some problem we need to solve.
|
doc/design/model_format.md
Outdated
|
||
The model is the output of training process. One complete model consists of two parts, namely, the **topology** and the **parameters**. To support business deployment, we need to make the model format must be self-completed and do not expose any training source code. | ||
|
||
As a result, In PaddlePaddle, the **topology** represents as a [ProgramDesc](https://github.com/PaddlePaddle/Paddle/blob/1c0a4c901c9fc881d120249c703b15d1c50dae7d/doc/design/program.md), which describes the model structure. The **parameters** contain all the trainable weights in the model, we must support large size parameter, and high efficiency read/write for speed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"high efficiency read/write for speed." -> "efficient serialization/deserialization".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/model_format.md
Outdated
|
||
## Implementation | ||
|
||
The topology is saved as a plain text, in detail, a self-complete protobuf file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "self-complete" mean in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"self-contain" Done.
doc/design/model_format.md
Outdated
[offset] [type] [value] [description] | ||
0000 32 bit integer ?? HeaderLength, the length of LoDTensorDesc | ||
0004 32 bit integer ?? ContentLength, the length of LodTensor Buffer | ||
0008 32 bit integer ?? TensorDesc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TensorDesc is not "32 bit integer", it's just a sequence of bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/design/model_format.md
Outdated
0008 32 bit integer ?? TensorDesc | ||
0012 32 bit integer ?? TensorDesc | ||
... | ||
00100 32 bit integer ?? Tensor Value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tensor Value is not "32 bit integer", it's just a sequence of bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/model_format.md
Outdated
0008 32 bit integer ?? TensorDesc | ||
0012 32 bit integer ?? TensorDesc | ||
... | ||
00100 32 bit integer ?? Tensor Value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are just dumping the memory (e.g., float32 array) into "Tensor Value", need to specify the endianness of the element in Tensor Value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/model_format.md
Outdated
|
||
```text | ||
[offset] [type] [value] [description] | ||
0000 32 bit integer ?? HeaderLength, the length of LoDTensorDesc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
32 bit integer -> 32 bit little-endian signed integer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/model_format.md
Outdated
```text | ||
[offset] [type] [value] [description] | ||
0000 32 bit integer ?? HeaderLength, the length of LoDTensorDesc | ||
0004 32 bit integer ?? ContentLength, the length of LodTensor Buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
32 bit integer -> 32 bit little-endian signed integer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
paddle/operators/checkpoint_op.cc
Outdated
PADDLE_ENFORCE(ctx->HasInput("Step"), | ||
"Input(Step) of Checkpoint should not be null."); | ||
std::string absolutePath = ctx->Attrs().Get<std::string>("absolutePath"); | ||
PADDLE_ENFORCE(absolutePath != "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!absolutePath.empty()
maybe some more regex change here, if path is set to something like " ".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! fixed the empty check.
But I think we should leave the regex check to python side, Because
- we need to check the absolutePath is a valid path, it's related to the client, which should be done in the client language.
- In our implementation, we assume user given's input is correct, we only check if the input has to be filled.
paddle/operators/checkpoint_op.cc
Outdated
// 2. checkpoint op need at least two thread. | ||
// Because checkpoint will happen every, so need a thread wait | ||
// the timer/steps to reach the condition. | ||
auto* Step = ctx.Input<Tensor>("Step"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variable name should be lowercase
https://google.github.io/styleguide/cppguide.html#Variable_Names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
paddle/operators/checkpoint_op.cc
Outdated
// Because checkpoint will happen every, so need a thread wait | ||
// the timer/steps to reach the condition. | ||
auto* Step = ctx.Input<Tensor>("Step"); | ||
const int* curr_step = Step->data<int>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curr_step is not changed in this function, so the pointer is not needed,
just const int curr_step
is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
paddle/operators/checkpoint_op.cc
Outdated
} | ||
|
||
// flag indicate this op may be skipped. | ||
mutable bool run_once = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run_once should be a private member and name to run_once_
.
it shouldn't let some external operations to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
paddle/operators/save_restore_op.cc
Outdated
|
||
protected: | ||
void InferShape(framework::InferShapeContextBase* ctx) const override { | ||
PADDLE_ENFORCE(ctx->HasOutputs("Out"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PADDLE_ENFORCE_NOT_NULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just keep the same style with other op implementation.
@@ -103,5 +111,139 @@ void LoDTensor::ShrinkInLevel(size_t level, size_t elem_begin, | |||
lod_ = new_lod; | |||
} | |||
|
|||
std::string LoDTensor::SerializeToString() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make SerializeToString
an external function such as SerializeToString(LoDTensor)
.
There may be more such serialization functions, such as SerializeToString(OperatorBase)
, do not change the definition of the original class.
better not to insert methods that no relation with computation into class LoDTensor
.
LoDTensor
serves as a concept for computation, keep it clean.
and this function is too long, break it and keep the code clean.
if so much code is added and is no relation to the definition or operation of the concepts of LoD
or Tensor
, place it inside namespace detail
or in another source file is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a talk with @Superjom face to face, my opinion on this question as below.
-
Currently, we only need to serialize the in-memory content into a byte-stream. Namely,
SerializeToString(LoDTensor)
,SerializeToString(Tensor)
. Operatorbase and other concepts all have theirDesc
in protobuf, we do not need any other class serializes implementation. -
Implement
DeserilizeFromString
will return a Tensor filled with value, if we don't bind the serialize interface to the Tensor instance, we need another copy of the Tensor. -
The
offset_
andtype
in Tensor is hidden. Need to figure them out.
Thanks for this comment!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeserilizeFromString(LoDTensor*)
is no need to copy a Tensor, fill the data in-place seems possible.
} | ||
|
||
void LoDTensor::DeserializeFromString(const std::string& s, | ||
const platform::Place& dst_place) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so is this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above.
Agree, I think it is very important. I think the executor should:
CC: @QiJune @tonyyang-svail @wangkuiyi
Maybe we need to put the
Maybe we need save the input of
Do you mean who will "merge" different saved models shards? I think we should just put them into the save folder, with the save prefix name. Like: Actually TensorFlow does similar, but with one more index file:
|
Agree with that "session" is an abstract concept. So there also be a "real" instance on each node when doing distributed training, and that must be the "executor". My point is that each node should have an "executor" process(same to the paddle v1 trainer), which executes all ops. And the trainer is responsible to save variables and status. Thought this will be much simpler than adding ops to the graph when implement. |
Thanks for the reply! By trainer do you mean the Python process locally, or the executor running on the cluster? I think the model should be saved on the cloud, if save the variable from Python, it's hard to upload from the cloud. If save from executor, than we probably need to implement as an OP, since executor only executes OP. |
The executor running on the cluster. Well, I agree with save status using op, which fits the design better. By the way, are we going to implement |
That's a good idea, haven't decided on the "easy to use" (we definitely need one) Python API for saving yet :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
Pull/rebase the develop branch before merging, please!
… feature/checkpoint
…into feature/checkpoint
It has been a huge PR. We can merge it now and refine it in the future.
Here maybe better for review.