Refactor data-related interfaces & add interfaces for trainer and worker #365

rayrayraykk · 2022-09-06T12:03:32Z

Main Changes

Data

Data module overview

federatedscope.core.auxiliaries.data_builder.get_data -700 lines!

Load Dataset (federatedscope.core.data.utils.load_dataset):
- Load local file to torch dataset, -300 lines!
Translate data (federatedscope.core.data.BaseDataTranslator)
- Dataset -> ML split -> FL split -> DataLoader for FS
Convert mode(federatedscope.core.data.utils.convert_data_mode)
- To adapt simulation mode and distributed mode

Data Translator

Dataset -> (ML split) -> (FL split) -> DataLoader for FS

ML split(split_train_val_test):
- Build train/val/test
FL split(split_to_client):
- Build Data for each client

Data interface

ClientData(federatedscope.core.data.ClientData)

A subclass of dict with train, val and test.
Convert dataset to DataLoader.
- cfg.dataloader.type, cfg.dataloader.batch_size, cfg.dataloader.shuffle

Example:

# Instantiate client_data for each Client
client_data = ClientData(PyGDataLoader, 
                         cfg, 
                         train=train_data, 
                         val=None, 
                         test=test_data)
# other_cfg with different batch size
client_data.setup(other_cfg)
print(client_data)

>> {'train': PyGDataLoader(train_data), 'test': PyGDataLoader(test_data)}

StandaloneDataDict(federatedscope.core.data.StandaloneDataDict)
- A subclass of dict with client_id as keys:
  - {1: ClientData, 2: ClientData, ...}
- Responsible for training/evaluation method conversion：
  - Global evaluation, global training, etc.

Trainer

BaseTrainer

Code without hook-like functions (For starters)

Example:

class BaseTrainer(abc.ABC):
    def __init__(self, model, data, device, **kwargs):
        self.model = model
        self.data = data
        self.device = device
        self.kwargs = kwargs

    @abc.abstractmethod
    def train(self):
        raise NotImplementedError

    @abc.abstractmethod
    def evaluate(self, target_data_split_name='test'):
        raise NotImplementedError

    @abc.abstractmethod
    def update(self, model_parameters, strict=False):
        raise NotImplementedError

    @abc.abstractmethod
    def get_model_para(self):
        raise NotImplementedError

    @abc.abstractmethod
    def print_trainer_meta_info(self):
        raise NotImplementedError

Example for a trainer without hook-like functions

Less than 100 lines Vs. 300+ lines

FedRunner

Move data process to Data module
- merge data, global eval, etc.

Worker

BaseClient & BaseServer

Example

class BaseClient(Worker):
    def __init__(self, ID, state, config, model, strategy):
        super(BaseClient, self).__init__(ID, state, config, model, strategy)
        self.msg_handlers = dict()

    def register_handlers(self, msg_type, callback_func):
      	if msg_type in self.msg_handlers.keys():
          	logger.warning(f"Overwriting msg_handlers {msg_type}.")
        self.msg_handlers[msg_type] = callback_func

    def _register_default_handlers(self):
				pass

    @abc.abstractmethod
    def run(self):
        raise NotImplementedError

    @abc.abstractmethod
    def callback_funcs_for_model_para(self, message):
        raise NotImplementedError

    @abc.abstractmethod
    def callback_funcs_for_assign_id(self, message):
        raise NotImplementedError

    @abc.abstractmethod
    def callback_funcs_for_join_in_info(self, message):
        raise NotImplementedError

    @abc.abstractmethod
    def callback_funcs_for_address(self, message):
        raise NotImplementedError

    @abc.abstractmethod
    def callback_funcs_for_evaluate(self, message):
        raise NotImplementedError

    @abc.abstractmethod
    def callback_funcs_for_finish(self, message):
        raise NotImplementedError

    @abc.abstractmethod
    def callback_funcs_for_converged(self, message):
        raise NotImplementedError

joneswong

generally, looks good to me. As all this design and implementations have been discussed before, there are just two minor questions I wanna discuss: (1) do these class names make sense to you? e.g., the dict containing client-wise data is called standalone, the class responsible for converting a vanilla torch dataset into a fed counterpart is called translator, etc. (2) is it necessary to make the methods and attributes that would not be accessed outside the class private (i.e., self._balabala)? @xieyxclack @yxdyc @DavdGao @Osier-Yi

joneswong

LGTM

xieyxclack · 2022-10-10T03:51:38Z

xieyxclack

LGTM, please see the inline comments for minor suggestions, thx!

xieyxclack · 2022-09-22T03:25:15Z

federatedscope/contrib/trainer/torch_example.py

+        self.model.load_state_dict(model_parameters, strict)
+
+    def get_model_para(self):
+        return self.model.cpu().state_dict()


Why we move it to cpu?

I follow the principle of GeneralTorchTrainer, see here.

xieyxclack · 2022-09-22T03:40:38Z

federatedscope/core/trainers/base_trainer.py

+        raise NotImplementedError
+
+    @abc.abstractmethod
+    def print_trainer_meta_info(self):


It is necessary for an FL course?

This function is called in FedRunner.

xieyxclack · 2022-10-10T03:11:59Z

federatedscope/contrib/data/example.py

@@ -1,7 +1,7 @@
 from federatedscope.register import register_data


-def MyData(config):
+def MyData(config, client_cfgs):


Shall we provide client_cfgs in MyData?

I believe the client_cfgs will useful in MyData when using personalized cfg. (cfg.dataloader.batch_size varies.)

So we can set client_cfgs=None by default here since it just serves as an example.

xieyxclack · 2022-10-10T03:16:01Z

federatedscope/contrib/trainer/torch_example.py

+            loss = self.criterion(outputs, y)
+
+            # _hook_on_batch_backward
+            self.optimizer.zero_grad()


I suggest putting this line before the forward process (i.e., outputs = self.model(x)).

If so, should we modify GeneralTorchTrainer, which I follow the principle of.

xieyxclack · 2022-10-10T03:17:00Z

federatedscope/contrib/trainer/torch_example.py

+
+        # _hook_on_fit_end
+        return num_samples, self.model.cpu().state_dict(), \
+            {'loss_total': total_loss}


We can provide the avg_loss here

xieyxclack · 2022-10-10T03:49:08Z

scripts/personalization_exp_scripts/ditto/ditto_lstm_on_shakespeare.yaml

@@ -10,9 +10,7 @@ federate:
 data:
  root: data/
  type: shakespeare
-  batch_size: 64


Same as above

xieyxclack · 2022-10-10T03:49:14Z

scripts/personalization_exp_scripts/fedbn/fedbn_convnet2_on_femnist.yaml

@@ -11,9 +11,7 @@ data:
  root: data/
  type: femnist
  splits: [0.6,0.2,0.2]
-  batch_size: 64


Same as above

xieyxclack · 2022-10-10T03:49:20Z

scripts/personalization_exp_scripts/fedem/fedem_convnet2_on_femnist.yaml

@@ -11,9 +11,7 @@ data:
  root: data/
  type: femnist
  splits: [0.6,0.2,0.2]
-  batch_size: 64


Same as above

xieyxclack · 2022-10-10T03:49:26Z

scripts/personalization_exp_scripts/fedem/fedem_lr_on_synthetic.yaml

@@ -11,7 +11,6 @@ federate:
 data:
  root: data/
  type: synthetic
-  batch_size: 64


Same as above

xieyxclack · 2022-10-10T03:49:31Z

scripts/personalization_exp_scripts/fedem/fedem_lstm_on_shakespeare.yaml

@@ -8,9 +8,7 @@ federate:
 data:
  root: data/
  type: shakespeare
-  batch_size: 64


Same as above

xieyxclack · 2022-10-10T04:00:57Z

@rayrayraykk @joneswong
(1) IMO, some names such as standalone and translator might be confusing for new users, but we can provide docs and comments to explain. (And I cannot give better names for them at this time)
(2) Since we have not distinguished private methods/attributes from others in this version most time, maybe we can remain it as a TODO item and fix them later

yxdyc

LGTM

yxdyc · 2022-10-13T03:25:39Z

federatedscope/core/data/base_translator.py

+        datadict = self.split_to_client(train, val, test)
+        return datadict
+
+    def split_train_val_test(self, dataset):


So what is the recommended way to implement customized split funcs? e.g., VMF and HMF datasets need to be split according to user/item ids.

In that way, we use a dummysplitter, where we treat MF datasets as FL datasets.

DavdGao

Please see the inline comments

DavdGao · 2022-09-08T07:25:55Z

federatedscope/core/auxiliaries/data_builder.py

@@ -616,22 +584,22 @@ def get_data(config):
    # will restore the user-specified on after the generation
    setup_seed(12345)


This random seed is out of the control of cfg.seed.

The seed is set to generate data, we'd better keep is as fixed or we could use a cfg.data.seed instead of cfg.seed.

DavdGao · 2022-10-12T11:28:24Z

federatedscope/core/configs/cfg_data.py

+
+    # DataLoader related args
+    cfg.dataloader = CN()
+    cfg.dataloader.type = 'base'


Maybe list all the options for cfg.dataloader.type in annotation

DavdGao · 2022-10-12T11:29:28Z

federatedscope/core/configs/cfg_data.py

+    cfg.dataloader.batch_size = 64
+    cfg.dataloader.shuffle = True
+    cfg.dataloader.num_workers = 0
+    cfg.dataloader.drop_last = False


Is drop_last only valid for training dataloader?

Thanks for your suggestion! I've updated it accordingly.

DavdGao · 2022-10-13T03:16:48Z

federatedscope/core/data/base_translator.py

+        # Split train/val/test to client
+        if len(train) > 0:
+            split_train = self.splitter(train)
+            if self.global_cfg.data.consistent_label_distribution:


what's the meaning of consistent_label_distribution?

When using splitter, the train/test/val might be non-iid. With consistent_label_distribution set True, the ML split is iid.

DavdGao · 2022-10-13T03:19:31Z

federatedscope/core/trainers/base_trainer.py

+        self.kwargs = kwargs
+
+    @abc.abstractmethod
+    def train(self):


do we need a abstract method for fintuning?

I think finetuning is optional, and if someone needs ft, he could implement finetuning in his way.

DavdGao · 2022-10-13T03:24:47Z

federatedscope/mf/trainer/trainer_sgdmf.py

    # Noise multiplier
    tmp = cfg.sgdmf.constant * np.power(sample_ratio, 2) * (
        cfg.federate.total_round_num * ctx.num_total_train_batch) * np.log(
            1. / cfg.sgdmf.delta)
    noise_multipler = np.sqrt(tmp / np.power(cfg.sgdmf.epsilon, 2))
-    ctx.scale = max(cfg.sgdmf.theta, 1.) * noise_multipler * np.power(
+    ctx.scale = max(cfg.dataloader.theta, 1.) * noise_multipler * np.power(


Since theta is only used in sgdmf, is it appropriate to place theta under the namespace of dataloader? (that all users can see it)

We can add a docstring to explain this, as many other args (sizes is graph-related) in dataloader are optional.

rayrayraykk · 2022-10-17T04:16:11Z

I change the version to 0.2.1.

joneswong

approved.

Refactor data-related interfaces

e9b3a8c

rayrayraykk requested review from joneswong, xieyxclack, yxdyc and DavdGao September 6, 2022 12:03

rayrayraykk added 4 commits September 7, 2022 10:59

fix minor bugs

7aa3387

move interface

30f81d6

move data translator

0e85138

rename file

0c7cd41

rayrayraykk linked an issue Sep 7, 2022 that may be closed by this pull request

Cannot set different data related parameters for different clients. #186

Closed

rayrayraykk added enhancement New feature or request Feature New feature and removed Feature New feature labels Sep 9, 2022

rayrayraykk mentioned this pull request Sep 9, 2022

add more registers & refactor splitter #372

Merged

Merge branch 'alibaba:master' into refactor_data

8d6b1c3

rayrayraykk changed the title ~~Refactor data-related interfaces~~ [WIP]Refactor data-related interfaces Sep 13, 2022

add README for data protocal

1477c61

rayrayraykk changed the title ~~[WIP]Refactor data-related interfaces~~ Refactor data-related interfaces Sep 13, 2022

rayrayraykk changed the title ~~Refactor data-related interfaces~~ [WIP] Refactor data-related interfaces Sep 14, 2022

rayrayraykk removed request for joneswong, xieyxclack, yxdyc and DavdGao September 14, 2022 06:21

rayrayraykk added 7 commits September 14, 2022 15:05

update interface of data translator

d9b1dc0

move toy to tabular folder

2dd4bca

WIP

c7ed566

[WIP] refactor clientdata, TODO: update yaml, apply translator

bb2f044

update yaml

fc97a52

remove torch in dataset

7e60fdc

fix node trainer

c87c428

joneswong reviewed Oct 3, 2022

View reviewed changes

joneswong self-assigned this Oct 3, 2022

rayrayraykk and others added 5 commits October 8, 2022 16:33

minor changes

070b432

fix bugs

e6fcf99

fix minor bug

f81b17d

Merge branch 'master' into refactor_data

b0f70bb

update merge_data according to alibaba#385

ba5df54

joneswong previously approved these changes Oct 10, 2022

View reviewed changes

remove unnecessary clone of cfg

edb4d4b

rayrayraykk dismissed joneswong’s stale review via edb4d4b October 10, 2022 03:39

xieyxclack closed this Oct 10, 2022

xieyxclack reopened this Oct 10, 2022

xieyxclack previously approved these changes Oct 10, 2022

View reviewed changes

minor changes

b99cc31

rayrayraykk dismissed xieyxclack’s stale review via b99cc31 October 10, 2022 04:21

fix minor bug

551fe3a

rayrayraykk mentioned this pull request Oct 10, 2022

Add checks for completeness of msg_handler #388

Merged

yxdyc previously approved these changes Oct 13, 2022

View reviewed changes

DavdGao reviewed Oct 13, 2022

View reviewed changes

Update dataloader_builder.py

05c6514

rayrayraykk dismissed yxdyc’s stale review via 05c6514 October 13, 2022 10:16

update version

8875422

rayrayraykk and others added 3 commits October 17, 2022 14:02

retriger UT

e25a4eb

Merge branch 'alibaba:master' into refactor_data

bb9a9e9

fix format

47c89b1

joneswong approved these changes Oct 19, 2022

View reviewed changes

joneswong merged commit 84a3722 into alibaba:master Oct 19, 2022

		@@ -616,22 +584,22 @@ def get_data(config):
		# will restore the user-specified on after the generation
		setup_seed(12345)

Refactor data-related interfaces & add interfaces for trainer and worker #365

Refactor data-related interfaces & add interfaces for trainer and worker #365

Conversation

rayrayraykk commented Sep 6, 2022 • edited Loading

Main Changes

Data

Data module overview

Data Translator

Data interface

Trainer

FedRunner

Worker

joneswong left a comment

Choose a reason for hiding this comment

joneswong left a comment

Choose a reason for hiding this comment

xieyxclack commented Oct 10, 2022

xieyxclack left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xieyxclack commented Oct 10, 2022 • edited Loading

yxdyc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DavdGao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rayrayraykk commented Oct 17, 2022

joneswong left a comment

Choose a reason for hiding this comment

rayrayraykk commented Sep 6, 2022 •

edited

Loading

xieyxclack commented Oct 10, 2022 •

edited

Loading