Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add demo for vertical federated learning #9103

Merged
merged 1 commit into from
Apr 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 6 additions & 55 deletions demo/nvflare/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,61 +3,12 @@
This directory contains a demo of Federated Learning using
[NVFlare](https://nvidia.github.io/NVFlare/).

## Training with CPU only
## Horizontal Federated XGBoost

To run the demo, first build XGBoost with the federated learning plugin enabled (see the
[README](../../plugin/federated/README.md)).
For horizontal federated learning using XGBoost (data is split row-wise), check out the `horizontal` directory
(see the [README](horizontal/README.md)).

Install NVFlare (note that currently NVFlare only supports Python 3.8):
```shell
pip install nvflare
```
## Vertical Federated XGBoost

Prepare the data:
```shell
./prepare_data.sh
```

Start the NVFlare federated server:
```shell
/tmp/nvflare/poc/server/startup/start.sh
```

In another terminal, start the first worker:
```shell
/tmp/nvflare/poc/site-1/startup/start.sh
```

And the second worker:
```shell
/tmp/nvflare/poc/site-2/startup/start.sh
```

Then start the admin CLI:
```shell
/tmp/nvflare/poc/admin/startup/fl_admin.sh
```

In the admin CLI, run the following command:
```shell
submit_job hello-xgboost
```

Once the training finishes, the model file should be written into
`/tmp/nvlfare/poc/site-1/run_1/test.model.json` and `/tmp/nvflare/poc/site-2/run_1/test.model.json`
respectively.

Finally, shutdown everything from the admin CLI, using `admin` as password:
```shell
shutdown client
shutdown server
```

## Training with GPUs

To demo with Federated Learning using GPUs, make sure your machine has at least 2 GPUs.
Build XGBoost with the federated learning plugin enabled along with CUDA, but with NCCL
turned off (see the [README](../../plugin/federated/README.md)).

Modify `config/config_fed_client.json` and set `use_gpus` to `true`, then repeat the steps
above.
For vertical federated learning using XGBoost (data is split column-wise), check out the `vertical` directory
(see the [README](vertical/README.md)).
23 changes: 0 additions & 23 deletions demo/nvflare/config/config_fed_client.json

This file was deleted.

22 changes: 0 additions & 22 deletions demo/nvflare/config/config_fed_server.json

This file was deleted.

63 changes: 63 additions & 0 deletions demo/nvflare/horizontal/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Experimental Support of Horizontal Federated XGBoost using NVFlare

This directory contains a demo of Horizontal Federated Learning using
[NVFlare](https://nvidia.github.io/NVFlare/).

## Training with CPU only

To run the demo, first build XGBoost with the federated learning plugin enabled (see the
[README](../../plugin/federated/README.md)).

Install NVFlare (note that currently NVFlare only supports Python 3.8):
```shell
pip install nvflare
```

Prepare the data:
```shell
./prepare_data.sh
```

Start the NVFlare federated server:
```shell
/tmp/nvflare/poc/server/startup/start.sh
```

In another terminal, start the first worker:
```shell
/tmp/nvflare/poc/site-1/startup/start.sh
```

And the second worker:
```shell
/tmp/nvflare/poc/site-2/startup/start.sh
```

Then start the admin CLI:
```shell
/tmp/nvflare/poc/admin/startup/fl_admin.sh
```

In the admin CLI, run the following command:
```shell
submit_job horizontal-xgboost
```

Once the training finishes, the model file should be written into
`/tmp/nvlfare/poc/site-1/run_1/test.model.json` and `/tmp/nvflare/poc/site-2/run_1/test.model.json`
respectively.

Finally, shutdown everything from the admin CLI, using `admin` as password:
```shell
shutdown client
shutdown server
```

## Training with GPUs

To demo with Federated Learning using GPUs, make sure your machine has at least 2 GPUs.
Build XGBoost with the federated learning plugin enabled along with CUDA, but with NCCL
turned off (see the [README](../../plugin/federated/README.md)).

Modify `config/config_fed_client.json` and set `use_gpus` to `true`, then repeat the steps
above.
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ split -n l/${world_size} --numeric-suffixes=1 -a 1 ../data/agaricus.txt.train ag
split -n l/${world_size} --numeric-suffixes=1 -a 1 ../data/agaricus.txt.test agaricus.txt.test-site-

nvflare poc -n 2 --prepare
mkdir -p /tmp/nvflare/poc/admin/transfer/hello-xgboost
cp -fr config custom /tmp/nvflare/poc/admin/transfer/hello-xgboost
mkdir -p /tmp/nvflare/poc/admin/transfer/horizontal-xgboost
cp -fr config custom /tmp/nvflare/poc/admin/transfer/horizontal-xgboost
cp server-*.pem client-cert.pem /tmp/nvflare/poc/server/
for id in $(eval echo "{1..$world_size}"); do
cp server-cert.pem client-*.pem /tmp/nvflare/poc/site-"$id"/
Expand Down
59 changes: 59 additions & 0 deletions demo/nvflare/vertical/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Experimental Support of Vertical Federated XGBoost using NVFlare

This directory contains a demo of Vertical Federated Learning using
[NVFlare](https://nvidia.github.io/NVFlare/).

## Training with CPU only

To run the demo, first build XGBoost with the federated learning plugin enabled (see the
[README](../../plugin/federated/README.md)).

Install NVFlare (note that currently NVFlare only supports Python 3.8):
```shell
pip install nvflare
```

Prepare the data (note that this step will download the HIGGS dataset, which is 2.6GB compressed, and 7.5GB
uncompressed, so make sure you have enough disk space and are on a fast internet connection):
```shell
./prepare_data.sh
```

Start the NVFlare federated server:
```shell
/tmp/nvflare/poc/server/startup/start.sh
```

In another terminal, start the first worker:
```shell
/tmp/nvflare/poc/site-1/startup/start.sh
```

And the second worker:
```shell
/tmp/nvflare/poc/site-2/startup/start.sh
```

Then start the admin CLI:
```shell
/tmp/nvflare/poc/admin/startup/fl_admin.sh
```

In the admin CLI, run the following command:
```shell
submit_job vertical-xgboost
```

Once the training finishes, the model file should be written into
`/tmp/nvlfare/poc/site-1/run_1/test.model.json` and `/tmp/nvflare/poc/site-2/run_1/test.model.json`
respectively.

Finally, shutdown everything from the admin CLI, using `admin` as password:
```shell
shutdown client
shutdown server
```

## Training with GPUs

Currently GPUs are not yet supported by vertical federated XGBoost.
68 changes: 68 additions & 0 deletions demo/nvflare/vertical/custom/controller.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
"""
Example of training controller with NVFlare
===========================================
"""
import multiprocessing

from nvflare.apis.client import Client
from nvflare.apis.fl_context import FLContext
from nvflare.apis.impl.controller import Controller, Task
from nvflare.apis.shareable import Shareable
from nvflare.apis.signal import Signal
from trainer import SupportedTasks

import xgboost.federated


class XGBoostController(Controller):
def __init__(self, port: int, world_size: int, server_key_path: str,
server_cert_path: str, client_cert_path: str):
"""Controller for federated XGBoost.

Args:
port: the port for the gRPC server to listen on.
world_size: the number of sites.
server_key_path: the path to the server key file.
server_cert_path: the path to the server certificate file.
client_cert_path: the path to the client certificate file.
"""
super().__init__()
self._port = port
self._world_size = world_size
self._server_key_path = server_key_path
self._server_cert_path = server_cert_path
self._client_cert_path = client_cert_path
self._server = None

def start_controller(self, fl_ctx: FLContext):
self._server = multiprocessing.Process(
target=xgboost.federated.run_federated_server,
args=(self._port, self._world_size, self._server_key_path,
self._server_cert_path, self._client_cert_path))
self._server.start()

def stop_controller(self, fl_ctx: FLContext):
if self._server:
self._server.terminate()

def process_result_of_unknown_task(self, client: Client, task_name: str,
client_task_id: str, result: Shareable,
fl_ctx: FLContext):
self.log_warning(fl_ctx, f"Unknown task: {task_name} from client {client.name}.")

def control_flow(self, abort_signal: Signal, fl_ctx: FLContext):
self.log_info(fl_ctx, "XGBoost training control flow started.")
if abort_signal.triggered:
return
task = Task(name=SupportedTasks.TRAIN, data=Shareable())
self.broadcast_and_wait(
task=task,
min_responses=self._world_size,
fl_ctx=fl_ctx,
wait_time_after_min_received=1,
abort_signal=abort_signal,
)
if abort_signal.triggered:
return

self.log_info(fl_ctx, "XGBoost training control flow finished.")
Loading