-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Refactor FedRunner, optimize trainer module and optimize CI #415
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good job! as the changes are huge, is it possible to supplement UT cases in this pr?
30a27e5
to
0377791
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and thanks a lot for such great work on the refactoring and documenting! @rayrayraykk
@@ -0,0 +1,42 @@ | |||
name: UnitTests for Distributed Mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very appreciate for the work on providing unit test for distributed mode
@@ -8,7 +8,7 @@ conda install -y nltk | |||
# Speech and NLP | |||
conda install -y sentencepiece textgrid typeguard -c conda-forge | |||
conda install -y transformers==4.16.2 tokenizers==0.10.3 datasets -c huggingface -c conda-forge | |||
conda install -y torchtext -c pytorch | |||
conda install -y torchtext==0.9.0 -c pytorch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why torchtext==0.9.0
here but torchtext==0.11.1
in README.md
@@ -84,21 +83,15 @@ def evaluate(self, target_data_split_name='test'): | |||
|
|||
def update(self, model_parameters, strict=False): | |||
self.model.load_state_dict(model_parameters, strict) | |||
return self.get_model_para() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why update
needs a return value here? And I find that in GeneralTorchTrainer
, the update
does not have a return value
The dataset object and the updated configuration. | ||
|
||
Note: | ||
The available ``data.type`` is shown below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be remind that, it is developers' duty later to modify these notes in get_xxx
when adding new items (such as data, model, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can modify the developer guidance later
@@ -137,7 +155,7 @@ def get_model(model_config, local_data=None, backend='torch'): | |||
from federatedscope.tabular.model import QuadraticModel | |||
model = QuadraticModel(input_shape[-1], 1) | |||
|
|||
elif model_config.type.lower() in ['convnet2', 'convnet5', 'vgg11', 'lr']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bug here before? We call get_cnn
when the model type is lr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there was a bug, but it won't affect anything
@@ -3,6 +3,12 @@ | |||
|
|||
|
|||
class IIDSplitter(BaseSplitter): | |||
""" | |||
This splitter split dataset randomly . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split -> splits.
And I think the description is not correct here, maybe change it to "This splitter splits dataset following the independent and identically distribution"?
@@ -0,0 +1,461 @@ | |||
# Local Learning Abstraction: Trainer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The provided docs for trainer are good and useful to me, thanks a lot
@@ -46,7 +45,9 @@ def clear(self, lifecycle): | |||
|
|||
|
|||
class Context(LifecycleDict): | |||
"""Record and pass variables among different hook functions | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just need more time to understand how context works here, and please @DavdGao check this file, thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we support any combination of different mode
(train/finetune/eval) and different split
(dataset)? Or customized name of split, e.g. "train1"
``ask_for_join_in_info`` ``callback_funcs_for_join_in_info()`` | ||
``address`` ``callback_funcs_for_address()`` | ||
``model_para`` ``callback_funcs_for_model_para()`` | ||
``ss_model_para`` ``callback_funcs_for_model_para()`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can move ss_model_para
out of the base_client later
@@ -12,7 +12,7 @@ distribute: | |||
client_host: '127.0.0.1' | |||
client_port: 50052 | |||
role: 'client' | |||
data_file: 'toy_data/client_1_data' | |||
data_idx: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: provide an example that loads data from separate data files in distributed mode
I have update this PR according to your valuable suggestions, thx! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the inline comments.
@@ -64,7 +64,6 @@ def extend_training_cfg(cfg): | |||
cfg.early_stop.delta = 0.0 | |||
# Early stop when no improve to last `patience` round, in ['mean', 'best'] | |||
cfg.early_stop.improve_indicator_mode = 'best' | |||
cfg.early_stop.the_smaller_the_better = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we remove cfg.early_stop.the_smaller_the_better
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key the_smaller_the_better
/ the_larger_the_better
is bound with the metric name of update_round_wise_key
, so it's redundant.
def num_train_batch(self): | ||
if self.get('num_train_batch'): | ||
return self.get('num_train_batch') | ||
return self._calculate_batch_epoch_num(mode='train')[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of epoch is decided by the selected dataset. Maybe we should name the parameter as split
or dataset
here rather than mode
def num_train_batch_last_epoch(self): | ||
if self.get('num_train_batch_last_epoch'): | ||
return self.get('num_train_batch_last_epoch') | ||
return self._calculate_batch_epoch_num(mode='train')[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same as above
@@ -46,7 +45,9 @@ def clear(self, lifecycle): | |||
|
|||
|
|||
class Context(LifecycleDict): | |||
"""Record and pass variables among different hook functions | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we support any combination of different mode
(train/finetune/eval) and different split
(dataset)? Or customized name of split, e.g. "train1"
BTW, maybe we should unify the names of subdirectories, e.g. |
This PR is Based on #408, and I add CI and doc based on it.
What's changed
get_runner
to replaceFedRunner
num_val_batch
,num_total_train_batch
.... (related tocfg
) as @Property method (Users could still usesetattr
or=
to change these values, but the changes will make the@property
function invalid.)THE_LARGER_THE_BETTER
utils
toxxx.utils