[Feature] Refactor FedRunner, optimize trainer module and optimize CI #415

rayrayraykk · 2022-10-28T09:11:34Z

This PR is Based on #408, and I add CI and doc based on it.

What's changed

Runner -> StandaloneRunner + DistributedRunner and use get_runner to replace FedRunner
Add README for Trainer module
Make num_val_batch, num_total_train_batch.... (related to cfg) as @Property method (Users could still use setattr or = to change these values, but the changes will make the @property function invalid.)

@property
def num_total_train_batch(self):
    if self.get('num_total_train_batch'):
        return self.get('num_total_train_batch')
    return self._calculate_batch_epoch_num(mode='train')[3]

Optimize doc string for hook functions and API ref.
Bind each metric with one and only key THE_LARGER_THE_BETTER
Move functions in utils to xxx.utils
Optimize CI, as you can see below.

federatedscope/core/auxiliaries/metric_builder.py

federatedscope/core/auxiliaries/trainer_builder.py

federatedscope/core/fed_runner.py

joneswong

good job! as the changes are huge, is it possible to supplement UT cases in this pr?

xieyxclack

LGTM, and thanks a lot for such great work on the refactoring and documenting! @rayrayraykk

xieyxclack · 2022-11-04T05:58:30Z

.github/workflows/test_distribute.yml

@@ -0,0 +1,42 @@
+name: UnitTests for Distributed Mode


I am very appreciate for the work on providing unit test for distributed mode

xieyxclack · 2022-11-04T06:01:25Z

environment/extra_dependencies_torch1.10-application.sh

@@ -8,7 +8,7 @@ conda install -y nltk
 # Speech and NLP
 conda install -y sentencepiece textgrid typeguard -c conda-forge
 conda install -y transformers==4.16.2 tokenizers==0.10.3 datasets -c huggingface -c conda-forge
-conda install -y torchtext -c pytorch
+conda install -y torchtext==0.9.0 -c pytorch


Why torchtext==0.9.0 here but torchtext==0.11.1 in README.md

xieyxclack · 2022-11-04T06:07:53Z

federatedscope/contrib/trainer/torch_example.py

@@ -84,21 +83,15 @@ def evaluate(self, target_data_split_name='test'):

    def update(self, model_parameters, strict=False):
        self.model.load_state_dict(model_parameters, strict)
+        return self.get_model_para()


Why update needs a return value here? And I find that in GeneralTorchTrainer, the update does not have a return value

xieyxclack · 2022-11-04T06:13:16Z

federatedscope/core/auxiliaries/data_builder.py

+        The dataset object and the updated configuration.
+
+    Note:
+      The available ``data.type`` is shown below:


It should be remind that, it is developers' duty later to modify these notes in get_xxx when adding new items (such as data, model, ...)

Yes, we can modify the developer guidance later

xieyxclack · 2022-11-04T06:16:20Z

federatedscope/core/auxiliaries/model_builder.py

@@ -137,7 +155,7 @@ def get_model(model_config, local_data=None, backend='torch'):
        from federatedscope.tabular.model import QuadraticModel
        model = QuadraticModel(input_shape[-1], 1)

-    elif model_config.type.lower() in ['convnet2', 'convnet5', 'vgg11', 'lr']:


A bug here before? We call get_cnn when the model type is lr?

Yes, there was a bug, but it won't affect anything

xieyxclack · 2022-11-04T06:34:02Z

federatedscope/core/splitters/generic/iid_splitter.py

@@ -3,6 +3,12 @@


 class IIDSplitter(BaseSplitter):
+    """
+    This splitter split dataset randomly .


split -> splits.
And I think the description is not correct here, maybe change it to "This splitter splits dataset following the independent and identically distribution"?

xieyxclack · 2022-11-04T06:38:44Z

federatedscope/core/trainers/README.md

@@ -0,0 +1,461 @@
+# Local Learning Abstraction: Trainer


The provided docs for trainer are good and useful to me, thanks a lot

xieyxclack · 2022-11-04T06:45:23Z

federatedscope/core/trainers/context.py

@@ -46,7 +45,9 @@ def clear(self, lifecycle):


 class Context(LifecycleDict):
-    """Record and pass variables among different hook functions
+    """


I just need more time to understand how context works here, and please @DavdGao check this file, thx!

Should we support any combination of different mode(train/finetune/eval) and different split (dataset)? Or customized name of split, e.g. "train1"

xieyxclack · 2022-11-04T06:48:53Z

federatedscope/core/workers/base_client.py

+            ``ask_for_join_in_info``     ``callback_funcs_for_join_in_info()``
+            ``address``                  ``callback_funcs_for_address()``
+            ``model_para``               ``callback_funcs_for_model_para()``
+            ``ss_model_para``            ``callback_funcs_for_model_para()``


Maybe we can move ss_model_para out of the base_client later

xieyxclack · 2022-11-04T07:16:01Z

scripts/distributed_scripts/distributed_configs/distributed_client_1.yaml

@@ -12,7 +12,7 @@ distribute:
  client_host: '127.0.0.1'
  client_port: 50052
  role: 'client'
-  data_file: 'toy_data/client_1_data'
+  data_idx: 1


TODO: provide an example that loads data from separate data files in distributed mode

rayrayraykk · 2022-11-04T09:05:50Z

I have update this PR according to your valuable suggestions, thx!

joneswong

Approved.

DavdGao

Please see the inline comments.

DavdGao · 2022-11-16T10:30:58Z

federatedscope/core/configs/cfg_training.py

@@ -64,7 +64,6 @@ def extend_training_cfg(cfg):
    cfg.early_stop.delta = 0.0
    # Early stop when no improve to last `patience` round, in ['mean', 'best']
    cfg.early_stop.improve_indicator_mode = 'best'
-    cfg.early_stop.the_smaller_the_better = True


Why do we remove cfg.early_stop.the_smaller_the_better here?

The key the_smaller_the_better/ the_larger_the_betteris bound with the metric name of update_round_wise_key, so it's redundant.

DavdGao · 2022-11-16T11:21:14Z

federatedscope/core/trainers/context.py

+    def num_train_batch(self):
+        if self.get('num_train_batch'):
+            return self.get('num_train_batch')
+        return self._calculate_batch_epoch_num(mode='train')[0]


The number of epoch is decided by the selected dataset. Maybe we should name the parameter as split or dataset here rather than mode

DavdGao · 2022-11-16T11:21:27Z

federatedscope/core/trainers/context.py

+    def num_train_batch_last_epoch(self):
+        if self.get('num_train_batch_last_epoch'):
+            return self.get('num_train_batch_last_epoch')
+        return self._calculate_batch_epoch_num(mode='train')[1]


the same as above

DavdGao · 2022-11-16T11:23:12Z

federatedscope/core/trainers/context.py

@@ -46,7 +45,9 @@ def clear(self, lifecycle):


 class Context(LifecycleDict):
-    """Record and pass variables among different hook functions
+    """


Should we support any combination of different mode(train/finetune/eval) and different split (dataset)? Or customized name of split, e.g. "train1"

DavdGao · 2022-11-16T11:29:24Z

BTW, maybe we should unify the names of subdirectories, e.g. trainers or trainer/workers or worker/criterion or loss in different tasks(core/cv/gfl/nlp/mf)

rayrayraykk and others added 2 commits October 28, 2022 17:05

merge action and refactor

75dd470

rm codecov for now

c023a95

rayrayraykk added documentation Improvements or additions to documentation enhancement New feature or request labels Oct 28, 2022

rayrayraykk requested review from joneswong, xieyxclack, yxdyc and DavdGao October 28, 2022 09:11

rayrayraykk mentioned this pull request Oct 28, 2022

[Feature] Refactor FedRunner and optimize trainer module #408

Closed

joneswong self-assigned this Oct 30, 2022

joneswong reviewed Oct 30, 2022

View reviewed changes

federatedscope/core/auxiliaries/metric_builder.py Show resolved Hide resolved

joneswong reviewed Oct 30, 2022

View reviewed changes

federatedscope/core/auxiliaries/trainer_builder.py Show resolved Hide resolved

joneswong reviewed Oct 30, 2022

View reviewed changes

federatedscope/core/fed_runner.py Outdated Show resolved Hide resolved

joneswong reviewed Oct 30, 2022

View reviewed changes

update docstring

0377791

rayrayraykk force-pushed the refactor_new branch from 30a27e5 to 0377791 Compare October 31, 2022 08:09

rayrayraykk added 14 commits October 31, 2022 16:35

update API ref for xxbuidler

574a3ff

fix minor bugs in test

ab4ccb6

update docstring for builders

1056b40

fix minor bugs

f6a8ca0

fix minor bugs

b85aa7b

roll back optimizer

e1a0a32

roll back for scheduler

ba9d313

update docstring for workers

782549a

update docstring for workers

6156ae2

add docstring for data

4e76009

fix typo

a761e28

fix typo

2e7ee2d

add doc string for monitor

68cd6d0

fix docstring in app

ab345a5

xieyxclack reviewed Nov 4, 2022

View reviewed changes

Add ut for trainer and fix some typos

3303734

rayrayraykk and others added 4 commits November 9, 2022 20:21

Merge branch 'master' into refactor_new

ac641e1

fix typo

b03d720

fix bug caused by merge

bc3a850

fix utils

6a5ea83

joneswong approved these changes Nov 16, 2022

View reviewed changes

joneswong merged commit 1001d54 into alibaba:master Nov 16, 2022

DavdGao reviewed Nov 16, 2022

View reviewed changes

This was referenced Nov 25, 2022

There may be an error in the monitor's update_best_result function. #441

Closed

The combination of different mode and split leads to wrong calculation for number of batches and number of epochs #264

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Refactor FedRunner, optimize trainer module and optimize CI #415

[Feature] Refactor FedRunner, optimize trainer module and optimize CI #415

rayrayraykk commented Oct 28, 2022 •

edited

Loading

joneswong left a comment

xieyxclack left a comment

xieyxclack Nov 4, 2022

xieyxclack Nov 4, 2022

xieyxclack Nov 4, 2022

xieyxclack Nov 4, 2022

rayrayraykk Nov 4, 2022 •

edited

Loading

xieyxclack Nov 4, 2022

rayrayraykk Nov 4, 2022

xieyxclack Nov 4, 2022

xieyxclack Nov 4, 2022

xieyxclack Nov 4, 2022

DavdGao Nov 16, 2022

xieyxclack Nov 4, 2022

xieyxclack Nov 4, 2022

rayrayraykk commented Nov 4, 2022

joneswong left a comment

DavdGao left a comment

DavdGao Nov 16, 2022 •

edited

Loading

rayrayraykk Nov 30, 2022

DavdGao Nov 16, 2022

DavdGao Nov 16, 2022

DavdGao Nov 16, 2022

DavdGao commented Nov 16, 2022 •

edited

Loading

[Feature] Refactor FedRunner, optimize trainer module and optimize CI #415

[Feature] Refactor FedRunner, optimize trainer module and optimize CI #415

Conversation

rayrayraykk commented Oct 28, 2022 • edited Loading

What's changed

joneswong left a comment

Choose a reason for hiding this comment

xieyxclack left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rayrayraykk Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rayrayraykk commented Nov 4, 2022

joneswong left a comment

Choose a reason for hiding this comment

DavdGao left a comment

Choose a reason for hiding this comment

DavdGao Nov 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DavdGao commented Nov 16, 2022 • edited Loading

rayrayraykk commented Oct 28, 2022 •

edited

Loading

rayrayraykk Nov 4, 2022 •

edited

Loading

DavdGao Nov 16, 2022 •

edited

Loading

DavdGao commented Nov 16, 2022 •

edited

Loading