Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load data from files #558

Merged
merged 3 commits into from
Mar 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions .github/workflows/test_distribute.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,14 +32,23 @@ jobs:
- name: Install FS
run: |
pip install -e .[test]
- name: Test Distributed (LR on toy)
- name: Test Distributed (LR on toy with a unified files)
run: |
python scripts/distributed_scripts/gen_data.py
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_server.yaml &
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_server_no_data.yaml &
sleep 2
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_1.yaml &
sleep 2
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_2.yaml &
sleep 2
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_3.yaml
[ $? -eq 1 ] && exit 1 || echo "Passed"
[ $? -eq 1 ] && exit 1 || echo "Passed"
- name: Test Distributed (LR on toy with multiple files)
run: |
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_server.yaml data.file_path 'toy_data/server_data' distribute.data_idx -1 &
sleep 2
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_1.yaml data.file_path 'toy_data/client_1_data' distribute.data_idx -1 &
sleep 2
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_2.yaml data.file_path 'toy_data/client_2_data' distribute.data_idx -1 &
sleep 2
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_3.yaml data.file_path 'toy_data/client_3_data' distribute.data_idx -1
18 changes: 9 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -202,20 +202,20 @@ The distributed mode in FederatedScope denotes running multiple procedures to bu

To run with distributed mode, you only need to:

- Prepare isolated data file and set up `cfg.distribute.data_file = PATH/TO/DATA` for each participant;
- Prepare isolated data file and set up `cfg.data.file_path = PATH/TO/DATA` for each participant;
- Change `cfg.federate.model = 'distributed'`, and specify the role of each participant by `cfg.distributed.role = 'server'/'client'`.
- Set up a valid address by `cfg.distribute.server_host/client_host = x.x.x.x` and `cfg.distribute.server_port/client_port = xxxx`. (Note that for a server, you need to set up `server_host` and `server_port` for listening messages, while for a client, you need to set up `client_host` and `client_port` for listening as well as `server_host` and `server_port` for joining in an FL course)

We prepare a synthetic example for running with distributed mode:

```bash
# For server
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_server.yaml distribute.data_file 'PATH/TO/DATA' distribute.server_host x.x.x.x distribute.server_port xxxx
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_server.yaml data.file_path 'PATH/TO/DATA' distribute.server_host x.x.x.x distribute.server_port xxxx

# For clients
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_1.yaml distribute.data_file 'PATH/TO/DATA' distribute.server_host x.x.x.x distribute.server_port xxxx distribute.client_host x.x.x.x distribute.client_port xxxx
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_2.yaml distribute.data_file 'PATH/TO/DATA' distribute.server_host x.x.x.x distribute.server_port xxxx distribute.client_host x.x.x.x distribute.client_port xxxx
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_3.yaml distribute.data_file 'PATH/TO/DATA' distribute.server_host x.x.x.x distribute.server_port xxxx distribute.client_host x.x.x.x distribute.client_port xxxx
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_1.yaml data.file_path 'PATH/TO/DATA' distribute.server_host x.x.x.x distribute.server_port xxxx distribute.client_host x.x.x.x distribute.client_port xxxx
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_2.yaml data.file_path 'PATH/TO/DATA' distribute.server_host x.x.x.x distribute.server_port xxxx distribute.client_host x.x.x.x distribute.client_port xxxx
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_3.yaml data.file_path 'PATH/TO/DATA' distribute.server_host x.x.x.x distribute.server_port xxxx distribute.client_host x.x.x.x distribute.client_port xxxx
```

An executable example with generated toy data can be run with (a script can be found in `scripts/run_distributed_lr.sh`):
Expand All @@ -224,14 +224,14 @@ An executable example with generated toy data can be run with (a script can be f
python scripts/distributed_scripts/gen_data.py

# Firstly start the server that is waiting for clients to join in
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_server.yaml distribute.data_file toy_data/server_data distribute.server_host 127.0.0.1 distribute.server_port 50051
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_server.yaml data.file_path toy_data/server_data distribute.server_host 127.0.0.1 distribute.server_port 50051

# Start the client #1 (with another process)
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_1.yaml distribute.data_file toy_data/client_1_data distribute.server_host 127.0.0.1 distribute.server_port 50051 distribute.client_host 127.0.0.1 distribute.client_port 50052
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_1.yaml data.file_path toy_data/client_1_data distribute.server_host 127.0.0.1 distribute.server_port 50051 distribute.client_host 127.0.0.1 distribute.client_port 50052
# Start the client #2 (with another process)
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_2.yaml distribute.data_file toy_data/client_2_data distribute.server_host 127.0.0.1 distribute.server_port 50051 distribute.client_host 127.0.0.1 distribute.client_port 50053
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_2.yaml data.file_path toy_data/client_2_data distribute.server_host 127.0.0.1 distribute.server_port 50051 distribute.client_host 127.0.0.1 distribute.client_port 50053
# Start the client #3 (with another process)
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_3.yaml distribute.data_file toy_data/client_3_data distribute.server_host 127.0.0.1 distribute.server_port 50051 distribute.client_host 127.0.0.1 distribute.client_port 50054
python federatedscope/main.py --cfg scripts/distributed_scripts/distributed_configs/distributed_client_3.yaml data.file_path toy_data/client_3_data distribute.server_host 127.0.0.1 distribute.server_port 50051 distribute.client_host 127.0.0.1 distribute.client_port 50054
```

And you can observe the results as (the IP addresses are anonymized with 'x.x.x.x'):
Expand Down
54 changes: 54 additions & 0 deletions federatedscope/contrib/data/load_from_files.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import os
import pickle

from federatedscope.register import register_data
from federatedscope.core.data.utils import convert_data_mode
from federatedscope.core.auxiliaries.utils import setup_seed


def load_data_from_file(config, client_cfgs=None):
from federatedscope.core.data import DummyDataTranslator

file_path = config.data.file_path

if not os.path.exists(file_path):
raise ValueError(f'The file {file_path} does not exist.')

with open(file_path, 'br') as file:
data = pickle.load(file)
# The shape of data is expected to be:
# (1) the data consist of all participants' data:
# {
# 'client_id': {
# 'train/val/test': {
# 'x/y': np.ndarray
# }
# }
# }
# (2) isolated data
# {
# 'train/val/test': {
# 'x/y': np.ndarray
# }
# }

# translator = DummyDataTranslator(config, client_cfgs)
# data = translator(data)

# Convert `StandaloneDataDict` to `ClientData` when in distribute mode
data = convert_data_mode(data, config)

# Restore the user-specified seed after the data generation
setup_seed(config.seed)

return data, config


def call_file_data(config, client_cfgs):
if config.data.type == "file":
# All the data (clients and servers) are loaded from one unified files
data, modified_config = load_data_from_file(config, client_cfgs)
return data, modified_config


register_data("file", call_file_data)
15 changes: 9 additions & 6 deletions federatedscope/core/auxiliaries/data_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,13 +126,16 @@ def get_data(config, client_cfgs=None):

# Apply translator to non-FL dataset to transform it into its federated
# counterpart
translator = getattr(import_module('federatedscope.core.data'),
DATA_TRANS_MAP[config.data.type.lower()])(
modified_config, client_cfgs)
data = translator(dataset)
if dataset is not None:
translator = getattr(import_module('federatedscope.core.data'),
DATA_TRANS_MAP[config.data.type.lower()])(
modified_config, client_cfgs)
data = translator(dataset)

# Convert `StandaloneDataDict` to `ClientData` when in distribute mode
data = convert_data_mode(data, modified_config)
# Convert `StandaloneDataDict` to `ClientData` when in distribute mode
data = convert_data_mode(data, modified_config)
else:
data = None

# Restore the user-specified seed after the data generation
setup_seed(config.seed)
Expand Down
7 changes: 5 additions & 2 deletions federatedscope/core/auxiliaries/model_builder.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import logging
import numpy as np
import federatedscope.register as register

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -63,7 +64,7 @@ def get_shape_from_data(data, model_config, backend='torch'):

if isinstance(data_representative, dict):
if 'x' in data_representative:
shape = data_representative['x'].shape
shape = np.asarray(data_representative['x']).shape
if len(shape) == 1: # (batch, ) = (batch, 1)
return 1
else:
Expand Down Expand Up @@ -121,7 +122,9 @@ def get_model(model_config, local_data=None, backend='torch'):
``mf.model.model_builder.get_mfnet()``
=================================== ==============================
"""
if local_data is not None:
if model_config.type.lower() in ['xgb_tree', 'gbdt_tree', 'random_forest']:
input_shape = None
elif local_data is not None:
input_shape = get_shape_from_data(local_data, model_config, backend)
else:
input_shape = model_config.input_shape
Expand Down
2 changes: 1 addition & 1 deletion federatedscope/core/configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ The configurations related to the data/dataset are defined in `cfg_data.py`.
|:--------------------------------------------:|:-----:|:---------- |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `data.root` | (string) 'data' | The folder where the data file located. `data.root` would be used together with `data.type` to load the dataset. | - |
| `data.type` | (string) 'toy' | Dataset name | CV: 'femnist', 'celeba' ; NLP: 'shakespeare', 'subreddit', 'twitter'; Graph: 'cora', 'citeseer', 'pubmed', 'dblp_conf', 'dblp_org', 'csbm', 'epinions', 'ciao', 'fb15k-237', 'wn18', 'fb15k' , 'MUTAG', 'BZR', 'COX2', 'DHFR', 'PTC_MR', 'AIDS', 'NCI1', 'ENZYMES', 'DD', 'PROTEINS', 'COLLAB', 'IMDB-BINARY', 'IMDB-MULTI', 'REDDIT-BINARY', 'IMDB-BINARY', 'IMDB-MULTI', 'HIV', 'ESOL', 'FREESOLV', 'LIPO', 'PCBA', 'MUV', 'BACE', 'BBBP', 'TOX21', 'TOXCAST', 'SIDER', 'CLINTOX', 'graph_multi_domain_mol', 'graph_multi_domain_small', 'graph_multi_domain_mix', 'graph_multi_domain_biochem'; MF: 'vflmovielens1m', 'vflmovielens10m', 'hflmovielens1m', 'hflmovielens10m', 'vflnetflix', 'hflnetflix'; Tabular: 'toy', 'synthetic'; External dataset: 'DNAME@torchvision', 'DNAME@torchtext', 'DNAME@huggingface_datasets', 'DNAME@openml'. |
| `data.file_path` | (string) '' | The path to the data file, only makes effect when data.type = 'file' | - |
| `data.args` | (list) [] | Args for the external dataset | Used for external dataset, eg. `[{'download': False}]` |
| `data.save_data` | (bool) False | Whether to save the generated toy data | - |
| `data.splitter` | (string) '' | Splitter name for standalone dataset | Generic splitter: 'lda'; Graph splitter: 'louvain', 'random', 'rel_type', 'graph_type', 'scaffold', 'scaffold_lda', 'rand_chunk' |
Expand Down Expand Up @@ -238,7 +239,6 @@ The configurations related to FL settings are defined in `cfg_fl_setting.py`.
| `distribute.client_host` | (string) '0.0.0.0' | The host of client's ip address for communication | - |
| `distribute.client_port` | (string) 50050 | The port of client's ip address for communication | - |
| `distribute.role` | (string) 'client' </br> Choices: {'server', 'client'} | The role of the worker | - |
| `distribute.data_file` | (string) 'data' | The path to the data dile | - |
| `distribute.data_idx` | (int) -1 | It is used to specify the data index in distributed mode when adopting a centralized dataset for simulation (formatted as {data_idx: data/dataloader}). | `data_idx=-1` means that the entire dataset is owned by the participant. And we randomly sample the index in simulation for other invalid values excepted for -1.
| `distribute.` </br>`grpc_max_send_message_length` | (int) 100 * 1024 * 1024 | The maximum length of sent messages | - |
| `distribute.` </br>`grpc_max_receive_message_length` | (int) 100 * 1024 * 1024 | The maximum length of received messages | - |
Expand Down
3 changes: 3 additions & 0 deletions federatedscope/core/configs/cfg_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,9 @@ def extend_data_cfg(cfg):
cfg.data.test_target_transform = []
cfg.data.test_pre_transform = []

# data.file_path takes effect when data.type = 'files'
cfg.data.file_path = ''

# DataLoader related args
cfg.dataloader = CN()
cfg.dataloader.type = 'base'
Expand Down
4 changes: 2 additions & 2 deletions federatedscope/cross_backends/distributed_tf_client_3.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@ distribute:
client_host: '127.0.0.1'
client_port: 50054
role: 'client'
data_file: 'toy_data/client_3_data'
trainer:
type: 'general'
eval:
freq: 10
data:
type: 'toy'
type: 'file'
file_path: 'toy_data/client_3_data'
model:
type: 'lr'
4 changes: 2 additions & 2 deletions federatedscope/cross_backends/distributed_tf_server.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,12 @@ distribute:
server_host: '127.0.0.1'
server_port: 50051
role: 'server'
data_file: 'toy_data/server_data'
trainer:
type: 'general'
eval:
freq: 10
data:
type: 'toy'
type: 'file'
file_path: 'toy_data/server_data'
model:
type: 'lr'
8 changes: 4 additions & 4 deletions federatedscope/vertical_fl/dataloader/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ def batch_iter(data, batch_size, shuffled=True):
"""

assert 'x' in data and 'y' in data
data_x = data['x']
data_y = data['y']
data_x = np.asarray(data['x'])
data_y = np.asarray(data['y'])
data_size = len(data_y)
num_batches_per_epoch = math.ceil(data_size / batch_size)

Expand Down Expand Up @@ -44,8 +44,8 @@ def __init__(self,
use_full_trainset=True,
feature_frac=1.0):
assert 'x' in data
self.data_x = data['x']
self.data_y = data['y'] if 'y' in data else None
self.data_x = np.asarray(data['x'])
self.data_y = np.asarray(data['y']) if 'y' in data else None
self.data_size = self.data_x.shape[0]
self.feature_size = self.data_x.shape[1]
self.replace = replace
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ trainer:
eval:
freq: 10
data:
type: 'toy'
type: 'file'
file_path: 'toy_data/all_data'
model:
type: 'lr'
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,13 @@ distribute:
client_host: '127.0.0.1'
client_port: 50053
role: 'client'
data_idx: 1
data_idx: 2
trainer:
type: 'general'
eval:
freq: 10
data:
type: 'toy'
type: 'file'
file_path: 'toy_data/all_data'
model:
type: 'lr'
Loading