Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor splitter&transform; Modify some data related config; Add external dataset. #33

Merged
merged 28 commits into from
Apr 25, 2022
Merged

Conversation

rayrayraykk
Copy link
Collaborator

@rayrayraykk rayrayraykk commented Apr 20, 2022

  1. Refactor splitter;
  2. Modify some data-related config, like transforms and args;
  3. Add some external datasets from torchvision and torchtext(datasets from torchaudio are coming soon).
  4. Remove old config.py which is not used.
  5. Refactor transforms.
  6. Add a uni-test with MNIST and IMDB for the external dataset.

@rayrayraykk rayrayraykk added the enhancement New feature or request label Apr 21, 2022
@rayrayraykk rayrayraykk changed the title Refactor splitter; Modify some data related config; Add external dataset. Refactor splitter&transform; Modify some data related config; Add external dataset. Apr 21, 2022
Copy link
Collaborator

@joneswong joneswong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a discussion about this pr

Copy link
Collaborator

@DavdGao DavdGao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see the detailed comments

federatedscope/core/configs/cfg_data.py Outdated Show resolved Hide resolved


def get_transform(config, package):
transform_funcs = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not override the default values by None

Copy link
Collaborator

@joneswong joneswong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good but some minor issues need to be discussed/resolved

Copy link
Collaborator

@yxdyc yxdyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job. Plz see the inline comments. After merging this PR, I will integrate more audio datasets from wenet based on the new dataset customization style.

@@ -118,6 +118,238 @@ def _generate_data(client_num=5,
return data, config


def load_external_data(config=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function appears to do a variety of things (with several internal sub-functions). It would be nice to have a document at the beginning that explains its input and output and the logical work flow.

targets = label_to_index(targets)
data = pad_sequence(data).transpose(0,
1)[:, :raw_args['max_len'], :]
data_list.append([(x, y) for x, y in zip(data, targets)])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the IterableDataset, we may need a more general and principled load function. I will add and test more speech data based on IterableDataset and we can improve this in future discussions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will add a TODO tag here.

dataset_func = getattr(import_module('torch_geometric.datasets'), name)
raise NotImplementedError

load_data = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be a global variable with the name DATA_LOAD_FUNCS, and be put in a more forward position.

if config.data.splitter == 'lda':
from federatedscope.core.splitters.generic import LDASplitter
splitter = LDASplitter(client_num, **args)
# graph splitter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make these graph splitters applicable to other data formats? That is, they can use the same splitting algorithms but with a bit different pre/post processes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will add and modify these splitters in the next PR.

import federatedscope.register as register


def get_transform(config, package):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple documentation can be added to tell users the input/output of the transform.

cfg.data.splitter_args = [] # args for splitter, eg. [{'alpha': 0.5}]
cfg.data.transform = [] # transform for x, eg. [['ToTensor'], ['Normalize', {'mean': [0.1307], 'std': [0.3081]}]]
cfg.data.target_transform = [] # target_transform for y, use as above
cfg.data.pre_transform = [] # pre_transform for `torch_geometric` dataset, use as above
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the three args, how about naming them as feat_transform, label_transform, and graph_pre_transform?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't agree. For the torchvision and torch_geometric dataset, we will use **transform_funcs as their args directly.

@@ -0,0 +1,3 @@
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No newline here. Check whether the linter is used.

'val': DataLoader()
}
}
config: `CN` from `federatedscope/core/configs/config.py`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is confusing for users why we take a configand return a config. It would be better by naming it modifed_cfg with more informative explanation.

Other modifications look good to me

Copy link
Collaborator

@joneswong joneswong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved

@joneswong joneswong merged commit 2bc22c0 into alibaba:master Apr 25, 2022
cheneydon pushed a commit to cheneydon/FederatedScope that referenced this pull request Jun 14, 2022
Co-authored-by: yuexiang.xyx <[email protected]>

Update `tests/run.py` for Jenkins server (alibaba#4)

just a workaround

Feature/synchronize (alibaba#3)

sync with the master branch of our original gitlab

Feature/config refactor (alibaba#5)

refactored configuration-related code

modify README; minor fix (alibaba#6)

Updated README

fix gan cra loss_batch-> loss_task bug

improved the environments set-up guidance

improved the environments set-up guidance

improved the environments set-up guidance

Fix setup requirements.

Update required python version to 3.9.

updated auto-doc component according to the latest changes

[Feature] Add dropout and log training metric. (alibaba#11)

* Add dropout option for CNN and NLP model; Add training metric to logs.

* allow users to determine whether to conduct evaluation on a specific split

* Enable metric in global eval for users to determine whether to conduct evaluation on a specific split.

* fix minor bug when importing nlp loss

* Replace and remove `validate` with `evaluate(target_data_split_name=split)` to keep code clean.

enabled the log file name valid in windows environment (alibaba#13)

* enabled the log file name valid in windows environment

update readme (alibaba#15)

* update README

added a demo for black-box optimization (alibaba#14)

- added a demo for black-box optimization
- enabled installation with cuda10

[Bugfix] fixed the invalid logger set-up if the logging is used before we call setup_logger (alibaba#17)

* fixed the invalid logger set-up if the `logging` is used before we call `setup_logger`

Change source of `download_url` from our own and fix `README` (alibaba#20)

* Change source of `download_url` from our own `utils.py` and fix `README.md`.

add logo (alibaba#26)

- add logo
- add more icons

modify grpc_comm according to official tutorial (alibaba#25)

fix path issue

fix wrong logger usage

reformatted

Communication efficiency optimization (alibaba#19)

* minor fixed for distributed mode

* For the communication efficiency: dynamic type selection in gRPC servicer; transformer & parser

Refactored the logger by reducing its redundancy and fixed some minor issues (alibaba#29)

* Reducing the redundancy of the logger

Update test_mf.py

modify the unit test of mf task

Refactor splitter&transform; Modify some data related config; Add external dataset. (alibaba#33)

[Feature] FedEx (alibaba#37)

[Feature] FedEx (alibaba#37)

[Hotfix] print the missing ``Final`` results (alibaba#41)

* hotfix for the missing ``Final`` results print

Add pre-trained transformers as NLP model.

TODO:@ZHEN, please fix online aggregator when the device is not specific.

Add a example for transformers.

Fix url. (alibaba#46)

- added the local training baseline
- enabled each client has its own early-stopper

formatted by linter

formatted by linter

not use early_stopper in non-local mode

bugfix for the cast "sample_client_num = -1"

added global training mode via a proxy client that holds all data

Fix un-consistent device for the PIA test

added local fine-tuning before local evaluation

linter format

bugfix for fedex

update README (alibaba#49)

Feature/attack doc (alibaba#50)

* improved the doc for attack module

added API comments (alibaba#52)

Fix docs about graph. (alibaba#51)

Add api ref for mf task and context (alibaba#53)

* add mf api reference and modify README.md

typos fix

Fix minor bugs

Timeout strategy and minimal received number (alibaba#36)

* For async: timeout strategy and minimal received number

modify api reference (alibaba#56)

update doc of core (alibaba#57)

Add datasets from hugging face.

Formatted and fix minor bugs.

Add datasets and scripts for openml.

Modify the example `yaml` of openml datasets.

Add materials (paper lists, tutorials) (alibaba#60)

* add FL paper list

Add paper lists (alibaba#61)

* add FL paper list

fixed some missing API reference in fs.core (alibaba#54)

As the title says.

update release version (alibaba#64)

Update graph paper list. (alibaba#65)

Add paper list for FedHPO (alibaba#67)

* added paper list for fedhpo

rename and modify some val

Add paper list for FedRec (alibaba#68)

add paper list for FedRec

added pfl paper list (alibaba#72)

added pfl paper list

hotfix for transformers to avoid import error

updated pfl paper list (alibaba#73)

updated pfl paper list

fix url in dblp_new.py (alibaba#76)

update README

update

debug squad model

update

update

update
AnthonyXuan pushed a commit to AnthonyXuan/FederatedScope that referenced this pull request Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants