New api about checkpoint and models #10878

seiriosPlus · 2018-05-23T12:04:27Z

add Checkpoint handle and config in Train.py
update load_model/save_model API
Related update fluid Train API param_path to checkpoint_config #10828
Related Incremental Learning Support for Fluid with Distribution #10870

…_about_cpkt

Yancey1989 · 2018-06-04T04:06:49Z

python/paddle/fluid/io.py

-        checkpoint_dir = os.getcwd()
+        raise ValueError("The values of 'checkpoint_dir' should not be None")
+
+    if trainer_args and not isinstance(trainer_args, dict):


what're the details about trainer_args? Shall we need a class instead of a parameter with dict type, it's confusing for users.

trainer_args not for the user, just for the developers, currently， trainer_args only contains step_id, epoch_id， there maybe have more arguments need to be saved in the checkpoint.

Please use the clear parameters, it would be confused with other developers, use step_id and epoch_id or a Class as the configuration parameter.

I write two functions to make it more clearly.

Yancey1989 · 2018-06-04T04:09:41Z

python/paddle/fluid/io.py

+    save_trainer_args(cur_dir, trainer_id, trainer_args)
+
+    if is_chief:
+        save_persist_vars_without_grad(executor, cur_dir, main_program)


It looks all gradient vars are all not persistent, so maybe the function name would shorter for save_persistent_vars? BTW, persist is a verb, we need the adjective one: persistent .

First, I find arguments named "X@GRAD" are persistent .
Second, save_persistent_vars do not filter RAW arguments.

Yancey1989 · 2018-06-04T04:13:38Z

python/paddle/fluid/io.py

+    if "@GRAD" in var.name:
+        return False
+
+    if ".trainer_" in var.name:


Can you add some comments to explain what's the meaning of the hard code .blcok and .trainer_ ?

Yancey1989 · 2018-06-04T06:34:41Z

python/paddle/fluid/io.py

    _lru_delete(checkpoint_dir, max_num_checkpoints)


-def load_checkpoint(executor, checkpoint_dir=None, main_program=None):
+def need_load_checkpoint(checkpoint_dir):


I'm confusing about this function, for need to do ... means the function should return a boolean variable, so that we can use the function as:

if need_load_checkpoint(xxx): # resume from the checkpoint else: # train from the beginning

I renamed need_load_checkpoint to get_latest_checkpoint_serial, make it more meaningful.

Yancey1989 · 2018-06-05T05:04:00Z

python/paddle/fluid/io.py

    """
    if checkpoint_dir is None:
-        checkpoint_dir = os.getcwd()
+        raise ValueError("The values of 'checkpoint_dir' should not be None")


checkpoint_dir should not be None.

checkpoint_dir should be checked in Trainer.py, A property directory will be given by Trainer.py.

Yancey1989 · 2018-06-05T05:06:09Z

python/paddle/fluid/io.py

    _lru_delete(checkpoint_dir, max_num_checkpoints)


-def load_checkpoint(executor, checkpoint_dir=None, main_program=None):
+def get_latest_checkpoint_serial(checkpoint_dir):


Seems we don't need get_latest_checkpoint_serial , do the same thing with _get_latest_checkpoint_dir, or we can rename _get_latest_checkpoint_dir to get_latest_checkpoint_serial.

Yancey1989 · 2018-06-05T05:08:19Z

python/paddle/fluid/io.py

    :param main_program
    """

    if checkpoint_dir is None:
-        checkpoint_dir = os.getcwd()
+        raise ValueError(
+            "The values of 'checkpoint_dir' or 'serial' should not be None")


"The values of 'checkpoint_dir' or 'serial' should not be None")

Seems here only check checkpoint_dir ?

Yancey1989 · 2018-06-05T05:08:40Z

python/paddle/fluid/io.py

-    if serial < 0:
-        return
+    if main_program is None:
+        raise ValueError("The values of 'main_program'should not be None")


The values of 'main_program'should not be None

main_program should not be None.

Yancey1989 · 2018-06-05T05:10:22Z

python/paddle/fluid/io.py

+    load_persist_vars_without_grad will load variables from a directory by an executor,
+    the variable named end with "@GRAD" will not be loaded.
+
+    :param executor


Please add the details comments about the parameters.

Yancey1989 · 2018-06-05T05:22:43Z

python/paddle/fluid/trainer.py

@@ -193,14 +253,18 @@ def _dist_transpile_if_necessary(self, optimize_ops, params_grads):
        current_endpoint = os.getenv("PADDLE_CURRENT_IP", "") + ":" + port
        # the unique trainer id, starting from 0, needed by trainer
        # only
-        trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+        self.trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+        self.chief = self.trainer_id == 0


Why hard code trainer_id and chief ere equal to 0 ?

We will default initialize trainer_id =0 and chief = True as default.
If run PaddlePaddle as local, there is only one trainer, its trainer_id is 0, and it is the chief obviously.
If run PaddlePaddle as distribution, we will get PADDLE_TRAINER_ID from env, there will only have one trainer as the chief.

Yancey1989 · 2018-06-05T05:24:28Z

python/paddle/fluid/io.py

-        checkpoint_dir = os.getcwd()
+        raise ValueError("The values of 'checkpoint_dir' should not be None")
+
+    if trainer_args and not isinstance(trainer_args, dict):


Please use the clear parameters, it would be confused with other developers, use step_id and epoch_id or a Class as the configuration parameter.

Yancey1989

Please add a unit test for the checkpoint feature.

Yancey1989

I have some comments above, please follow them thanks.

Yancey1989 · 2018-06-06T08:37:18Z

python/paddle/fluid/io.py

+    if not os.path.isdir(trainer_dir):
+        os.makedirs(trainer_dir)
+
+    return trainer_dir


 def _lru_delete(dirname, max_num_checkpoints=3):


Seems this function does not do implement a real LRU algorithms, scroll_delete would be better.

Yancey1989 · 2018-06-06T08:39:04Z

python/paddle/fluid/io.py

-            int(serial)
-        except ValueError:
+        serial = _get_dir_serial(cur_dir)
+        if serial == -1:


please merge the two condition statement.

Yancey1989 · 2018-06-06T08:43:07Z

python/paddle/fluid/trainer.py

@@ -348,6 +423,41 @@ def _get_or_create_parallel_executor(self):
                loss_name=self.train_func_outputs[0].name)
        return self._get_parallel_executor()

+    def _clean_checkpoint(self):
+        if not self.checkpoint:


use assert instead of return directly, otherwise we don't know whether this function success.

Yancey1989 · 2018-06-06T08:43:47Z

python/paddle/fluid/trainer.py

+        return trainer_args
+
+    def _save_checkpoint(self, epoch_id, step_id):
+        if not self.checkpoint:


The same reason, use assert please.

Yancey1989 · 2018-06-06T08:44:16Z

python/paddle/fluid/io.py

@@ -473,79 +478,143 @@ def save_checkpoint(executor,

    :param executor
    :param checkpoint_dir
-    :param max_num_checkpoints
-    :param save_interval_secs
+    :param trainer_id


Need more details comments.

Yancey1989 · 2018-06-06T08:46:26Z

python/paddle/fluid/tests/unittests/test_checkpoint.py

+
+class TestCheckpoint(unittest.TestCase):
+    def setUp(self):
+        self.dirname = "/tmp/ckpt"


better to use tempfile instead of the hard code path.

Yancey1989

LGTM, this feature is related to the API for user, please @typhoonzero double check.

typhoonzero · 2018-06-07T02:21:39Z

python/paddle/fluid/io.py

+    :param executor executor for save the value
+    :param checkpoint_dir the checkpoint directory 
+    :param trainer_id currect trainer id
+    :param is_chief if the trainer id equals 0, the is_chief will be true


If have is_chief why still need to pass trainer_id?

each trainer need to save its arguments practicality.
Only chief need to save variables.

I have deleted code about chief

typhoonzero · 2018-06-07T03:00:06Z

python/paddle/fluid/io.py

-def _get_serial_dir(serial, checkpoint_dir):
-    serial_folder = CHECKPOINT_PREFIX + CHECKPOINT_SEPARATOR + str(serial)
-    return os.path.join(checkpoint_dir, serial_folder)
+def load_persist_vars_without_grad(executor,


Why this is needed?

write load_persist_vars_without_grad just because of the filter.
I need to write a new filter to filter variables.

typhoonzero · 2018-06-08T11:13:25Z

The test seems didn't pass?

seiriosPlus · 2018-06-08T11:17:50Z

I will check it.

…_about_cpkt

seiriosPlus added 11 commits May 22, 2018 14:35

update fluid Train API param_path to checkpoint_config

b044724

restore param_path

dca0b6d

Merge branch 'develop' of github.com:PaddlePaddle/Paddle into new_api…

73b6723

…_about_cpkt

Merge branch 'develop' of github.com:PaddlePaddle/Paddle into new_api…

b2cb7c6

…_about_cpkt

add save/load persist_vars_without_grad

514b242

optimized checkpoint and save_model

5eea5db

optimized checkpoint and save_model

5f5d6a9

bug fix and optimize

ad9dfeb

bug fix and optimize

486e1e3

bug fix and optimize

9086043

Merge branch 'develop' of github.com:PaddlePaddle/Paddle into new_api…

9357078

…_about_cpkt

seiriosPlus changed the title ~~[WIP] New api about cpkt/params~~ New api about checkpoint and models May 29, 2018

seiriosPlus requested review from typhoonzero, daming-lu and jacquesqiao May 29, 2018 13:00

seiriosPlus added 2 commits May 30, 2018 11:45

bug fix

0211c5d

annotation optimized and code style optimized

0deb6f9

seiriosPlus requested a review from Yancey1989 May 30, 2018 06:26

seiriosPlus added 9 commits May 30, 2018 16:29

add distribute config

d712af2

bug fix

b44ede8

bug fix about lru and save

94eaf94

bug fix about clean

e44c278

cancle only chief delete files

bca4da4

bug fix

46f2688

bug fix

7973d9b

Merge branch 'develop' of github.com:PaddlePaddle/Paddle into new_api…

55d908c

…_about_cpkt

Merge branch 'develop' of github.com:PaddlePaddle/Paddle into new_api…

7734034

…_about_cpkt

Yancey1989 reviewed Jun 4, 2018

View reviewed changes

seiriosPlus added 2 commits June 4, 2018 15:20

add annotation about _is_checkpoint_var

c06f43b

rename need_load_checkpoint to get_latest_checkpoint_serial

08e5f0a

grammar optimized.

bfdcf18

Yancey1989 reviewed Jun 5, 2018

View reviewed changes

seiriosPlus added 8 commits June 5, 2018 14:47

optimized

9735f25

delete pyc

be16af3

add checkpoint unittest

eea5762

add checkpoint unittest

951fa74

update checkpoint unittest

3b5e3f9

update trainer about epoch_id and step id

6db240d

update io.py annotations and codes

f28f41d

code optimized

53409a2

Yancey1989 reviewed Jun 6, 2018

View reviewed changes

seiriosPlus added 3 commits June 6, 2018 17:26

code optimized

2f44585

code optimized

cb7c124

bug fix

7fbddaa

Yancey1989 reviewed Jun 7, 2018

View reviewed changes

typhoonzero reviewed Jun 7, 2018

View reviewed changes

remove chief

9e026a9

remove chief in test

5c8397a

typhoonzero approved these changes Jun 8, 2018

View reviewed changes

Merge branch 'develop' of github.com:PaddlePaddle/Paddle into new_api…

bf2c53a

…_about_cpkt

seiriosPlus merged commit d896134 into PaddlePaddle:develop Jun 10, 2018

seiriosPlus deleted the new_api_about_cpkt branch June 10, 2018 09:13

New api about checkpoint and models #10878

New api about checkpoint and models #10878

Conversation

seiriosPlus commented May 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seiriosPlus Jun 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 left a comment

Choose a reason for hiding this comment

Yancey1989 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

typhoonzero commented Jun 8, 2018

seiriosPlus commented Jun 8, 2018

seiriosPlus commented May 23, 2018 •

edited

Loading

seiriosPlus Jun 4, 2018 •

edited

Loading