-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New api about checkpoint and models #10878
New api about checkpoint and models #10878
Conversation
python/paddle/fluid/io.py
Outdated
checkpoint_dir = os.getcwd() | ||
raise ValueError("The values of 'checkpoint_dir' should not be None") | ||
|
||
if trainer_args and not isinstance(trainer_args, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what're the details about trainer_args
? Shall we need a class instead of a parameter with dict type, it's confusing for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trainer_args
not for the user, just for the developers, currently, trainer_args
only contains step_id, epoch_id
, there maybe have more arguments need to be saved in the checkpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the clear parameters, it would be confused with other developers, use step_id
and epoch_id
or a Class as the configuration parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I write two functions to make it more clearly.
python/paddle/fluid/io.py
Outdated
save_trainer_args(cur_dir, trainer_id, trainer_args) | ||
|
||
if is_chief: | ||
save_persist_vars_without_grad(executor, cur_dir, main_program) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks all gradient vars are all not persistent
, so maybe the function name would shorter for save_persistent_vars
? BTW, persist is a verb, we need the adjective one: persistent .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, I find arguments named "X@GRAD" are persistent
.
Second, save_persistent_vars
do not filter RAW
arguments.
python/paddle/fluid/io.py
Outdated
if "@GRAD" in var.name: | ||
return False | ||
|
||
if ".trainer_" in var.name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some comments to explain what's the meaning of the hard code .blcok
and .trainer_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/paddle/fluid/io.py
Outdated
_lru_delete(checkpoint_dir, max_num_checkpoints) | ||
|
||
|
||
def load_checkpoint(executor, checkpoint_dir=None, main_program=None): | ||
def need_load_checkpoint(checkpoint_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confusing about this function, for need to do ... means the function should return a boolean variable, so that we can use the function as:
if need_load_checkpoint(xxx):
# resume from the checkpoint
else:
# train from the beginning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed need_load_checkpoint
to get_latest_checkpoint_serial
, make it more meaningful.
python/paddle/fluid/io.py
Outdated
""" | ||
if checkpoint_dir is None: | ||
checkpoint_dir = os.getcwd() | ||
raise ValueError("The values of 'checkpoint_dir' should not be None") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkpoint_dir
should not be None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkpoint_dir
should be checked in Trainer.py
, A property directory will be given by Trainer.py
.
python/paddle/fluid/io.py
Outdated
_lru_delete(checkpoint_dir, max_num_checkpoints) | ||
|
||
|
||
def load_checkpoint(executor, checkpoint_dir=None, main_program=None): | ||
def get_latest_checkpoint_serial(checkpoint_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we don't need get_latest_checkpoint_serial
, do the same thing with _get_latest_checkpoint_dir
, or we can rename _get_latest_checkpoint_dir
to get_latest_checkpoint_serial
.
python/paddle/fluid/io.py
Outdated
:param main_program | ||
""" | ||
|
||
if checkpoint_dir is None: | ||
checkpoint_dir = os.getcwd() | ||
raise ValueError( | ||
"The values of 'checkpoint_dir' or 'serial' should not be None") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The values of 'checkpoint_dir' or 'serial' should not be None")
Seems here only check checkpoint_dir
?
python/paddle/fluid/io.py
Outdated
if serial < 0: | ||
return | ||
if main_program is None: | ||
raise ValueError("The values of 'main_program'should not be None") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values of 'main_program'should not be None
main_program should not be None.
python/paddle/fluid/io.py
Outdated
load_persist_vars_without_grad will load variables from a directory by an executor, | ||
the variable named end with "@GRAD" will not be loaded. | ||
|
||
:param executor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the details comments about the parameters.
python/paddle/fluid/trainer.py
Outdated
@@ -193,14 +253,18 @@ def _dist_transpile_if_necessary(self, optimize_ops, params_grads): | |||
current_endpoint = os.getenv("PADDLE_CURRENT_IP", "") + ":" + port | |||
# the unique trainer id, starting from 0, needed by trainer | |||
# only | |||
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0")) | |||
self.trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0")) | |||
self.chief = self.trainer_id == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why hard code trainer_id
and chief
ere equal to 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will default initialize trainer_id =0
and chief = True
as default.
If run PaddlePaddle as local, there is only one trainer, its trainer_id
is 0, and it is the chief
obviously.
If run PaddlePaddle as distribution, we will get PADDLE_TRAINER_ID
from env
, there will only have one trainer as the chief.
python/paddle/fluid/io.py
Outdated
checkpoint_dir = os.getcwd() | ||
raise ValueError("The values of 'checkpoint_dir' should not be None") | ||
|
||
if trainer_args and not isinstance(trainer_args, dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the clear parameters, it would be confused with other developers, use step_id
and epoch_id
or a Class as the configuration parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a unit test for the checkpoint feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some comments above, please follow them thanks.
python/paddle/fluid/io.py
Outdated
if not os.path.isdir(trainer_dir): | ||
os.makedirs(trainer_dir) | ||
|
||
return trainer_dir | ||
|
||
|
||
def _lru_delete(dirname, max_num_checkpoints=3): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this function does not do implement a real LRU algorithms, scroll_delete
would be better.
python/paddle/fluid/io.py
Outdated
int(serial) | ||
except ValueError: | ||
serial = _get_dir_serial(cur_dir) | ||
if serial == -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please merge the two condition statement.
python/paddle/fluid/trainer.py
Outdated
@@ -348,6 +423,41 @@ def _get_or_create_parallel_executor(self): | |||
loss_name=self.train_func_outputs[0].name) | |||
return self._get_parallel_executor() | |||
|
|||
def _clean_checkpoint(self): | |||
if not self.checkpoint: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assert
instead of return directly, otherwise we don't know whether this function success.
python/paddle/fluid/trainer.py
Outdated
return trainer_args | ||
|
||
def _save_checkpoint(self, epoch_id, step_id): | ||
if not self.checkpoint: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same reason, use assert please.
python/paddle/fluid/io.py
Outdated
@@ -473,79 +478,143 @@ def save_checkpoint(executor, | |||
|
|||
:param executor | |||
:param checkpoint_dir | |||
:param max_num_checkpoints | |||
:param save_interval_secs | |||
:param trainer_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need more details comments.
|
||
class TestCheckpoint(unittest.TestCase): | ||
def setUp(self): | ||
self.dirname = "/tmp/ckpt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to use tempfile instead of the hard code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, this feature is related to the API for user, please @typhoonzero double check.
python/paddle/fluid/io.py
Outdated
:param executor executor for save the value | ||
:param checkpoint_dir the checkpoint directory | ||
:param trainer_id currect trainer id | ||
:param is_chief if the trainer id equals 0, the is_chief will be true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If have is_chief
why still need to pass trainer_id
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each trainer
need to save its arguments practicality.
Only chief
need to save variables
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have deleted code about chief
def _get_serial_dir(serial, checkpoint_dir): | ||
serial_folder = CHECKPOINT_PREFIX + CHECKPOINT_SEPARATOR + str(serial) | ||
return os.path.join(checkpoint_dir, serial_folder) | ||
def load_persist_vars_without_grad(executor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write load_persist_vars_without_grad
just because of the filter
.
I need to write a new filter
to filter variables
.
The test seems didn't pass? |
I will check it. |
load_model/save_model
APIRelated update fluid Train API param_path to checkpoint_config #10828
Related Incremental Learning Support for Fluid with Distribution #10870