Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Refactor FedRunner, optimize trainer module and optimize CI #415

Merged
merged 22 commits into from
Nov 16, 2022

Conversation

rayrayraykk
Copy link
Collaborator

@rayrayraykk rayrayraykk commented Oct 28, 2022

This PR is Based on #408, and I add CI and doc based on it.

What's changed

  1. Runner -> StandaloneRunner + DistributedRunner and use get_runner to replace FedRunner
  2. Add README for Trainer module
  3. Make num_val_batch, num_total_train_batch.... (related to cfg) as @Property method (Users could still use setattr or = to change these values, but the changes will make the @property function invalid.)
@property
def num_total_train_batch(self):
    if self.get('num_total_train_batch'):
        return self.get('num_total_train_batch')
    return self._calculate_batch_epoch_num(mode='train')[3]
  1. Optimize doc string for hook functions and API ref.
  2. Bind each metric with one and only key THE_LARGER_THE_BETTER
  3. Move functions in utils to xxx.utils
  4. Optimize CI, as you can see below.

@rayrayraykk rayrayraykk added documentation Improvements or additions to documentation enhancement New feature or request labels Oct 28, 2022
@joneswong joneswong self-assigned this Oct 30, 2022
Copy link
Collaborator

@joneswong joneswong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good job! as the changes are huge, is it possible to supplement UT cases in this pr?

Copy link
Collaborator

@xieyxclack xieyxclack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and thanks a lot for such great work on the refactoring and documenting! @rayrayraykk

@@ -0,0 +1,42 @@
name: UnitTests for Distributed Mode
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very appreciate for the work on providing unit test for distributed mode

@@ -8,7 +8,7 @@ conda install -y nltk
# Speech and NLP
conda install -y sentencepiece textgrid typeguard -c conda-forge
conda install -y transformers==4.16.2 tokenizers==0.10.3 datasets -c huggingface -c conda-forge
conda install -y torchtext -c pytorch
conda install -y torchtext==0.9.0 -c pytorch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why torchtext==0.9.0 here but torchtext==0.11.1 in README.md

@@ -84,21 +83,15 @@ def evaluate(self, target_data_split_name='test'):

def update(self, model_parameters, strict=False):
self.model.load_state_dict(model_parameters, strict)
return self.get_model_para()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why update needs a return value here? And I find that in GeneralTorchTrainer, the update does not have a return value

The dataset object and the updated configuration.

Note:
The available ``data.type`` is shown below:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be remind that, it is developers' duty later to modify these notes in get_xxx when adding new items (such as data, model, ...)

Copy link
Collaborator Author

@rayrayraykk rayrayraykk Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can modify the developer guidance later

@@ -137,7 +155,7 @@ def get_model(model_config, local_data=None, backend='torch'):
from federatedscope.tabular.model import QuadraticModel
model = QuadraticModel(input_shape[-1], 1)

elif model_config.type.lower() in ['convnet2', 'convnet5', 'vgg11', 'lr']:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bug here before? We call get_cnn when the model type is lr?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there was a bug, but it won't affect anything

@@ -3,6 +3,12 @@


class IIDSplitter(BaseSplitter):
"""
This splitter split dataset randomly .
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split -> splits.
And I think the description is not correct here, maybe change it to "This splitter splits dataset following the independent and identically distribution"?

@@ -0,0 +1,461 @@
# Local Learning Abstraction: Trainer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provided docs for trainer are good and useful to me, thanks a lot

@@ -46,7 +45,9 @@ def clear(self, lifecycle):


class Context(LifecycleDict):
"""Record and pass variables among different hook functions
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just need more time to understand how context works here, and please @DavdGao check this file, thx!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support any combination of different mode(train/finetune/eval) and different split (dataset)? Or customized name of split, e.g. "train1"

``ask_for_join_in_info`` ``callback_funcs_for_join_in_info()``
``address`` ``callback_funcs_for_address()``
``model_para`` ``callback_funcs_for_model_para()``
``ss_model_para`` ``callback_funcs_for_model_para()``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can move ss_model_para out of the base_client later

@@ -12,7 +12,7 @@ distribute:
client_host: '127.0.0.1'
client_port: 50052
role: 'client'
data_file: 'toy_data/client_1_data'
data_idx: 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: provide an example that loads data from separate data files in distributed mode

@rayrayraykk
Copy link
Collaborator Author

I have update this PR according to your valuable suggestions, thx!

Copy link
Collaborator

@joneswong joneswong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

@joneswong joneswong merged commit 1001d54 into alibaba:master Nov 16, 2022
Copy link
Collaborator

@DavdGao DavdGao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the inline comments.

@@ -64,7 +64,6 @@ def extend_training_cfg(cfg):
cfg.early_stop.delta = 0.0
# Early stop when no improve to last `patience` round, in ['mean', 'best']
cfg.early_stop.improve_indicator_mode = 'best'
cfg.early_stop.the_smaller_the_better = True
Copy link
Collaborator

@DavdGao DavdGao Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we remove cfg.early_stop.the_smaller_the_better here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key the_smaller_the_better/ the_larger_the_betteris bound with the metric name of update_round_wise_key, so it's redundant.

def num_train_batch(self):
if self.get('num_train_batch'):
return self.get('num_train_batch')
return self._calculate_batch_epoch_num(mode='train')[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of epoch is decided by the selected dataset. Maybe we should name the parameter as split or dataset here rather than mode

def num_train_batch_last_epoch(self):
if self.get('num_train_batch_last_epoch'):
return self.get('num_train_batch_last_epoch')
return self._calculate_batch_epoch_num(mode='train')[1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same as above

@@ -46,7 +45,9 @@ def clear(self, lifecycle):


class Context(LifecycleDict):
"""Record and pass variables among different hook functions
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support any combination of different mode(train/finetune/eval) and different split (dataset)? Or customized name of split, e.g. "train1"

@DavdGao
Copy link
Collaborator

DavdGao commented Nov 16, 2022

BTW, maybe we should unify the names of subdirectories, e.g. trainers or trainer/workers or worker/criterion or loss in different tasks(core/cv/gfl/nlp/mf)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants