The DGL Operator makes it easy to run Deep Graph Library (DGL) graph neural network distributed or non-distributed training on Kubernetes. Please check out here for an introduction to DGL and dgl distributed training philosophy.
- Kubernetes >= 1.16
You can deploy the operator with default settings by running the following commands:
git clone https://github.com/Qihoo360/dgl-operator
cd dgl-operator
kubectl create -f deploy/v1alpha1/dgl-operator.yaml
You can check whether the DGL Job custom resource is installed via:
kubectl get crd
The output should include dgljobs.qihoo.net
like the following:
NAME AGE
...
dgljobs.qihoo.net 1m
...
You can create a DGL job by defining an DGLJob config file. See GraphSAGE.yaml or GraphSAGE_dist.yaml example config file for launching a single-node or multi-node GraphSAGE training job. You may change the config file based on your requirements.
# standalone GraphSAGE
cat examples/v1alpha1/GraphSAGE.yaml
# or a distributed version
cat examples/v1alpha1/GraphSAGE_dist.yaml
Deploy the DGLJob resource to start training:
# standalone GraphSAGE
kubectl create -f examples/v1alpha1/GraphSAGE.yaml
# or a distributed version
kubectl create -f examples/v1alpha1/GraphSAGE_dist.yaml
Please check out these previous works that helped inspire the creation of DGL Operator
-
PaddleFlow/paddle-operator - Elastic Deep Learning Training based on Kubernetes by Leveraging EDL and Volcano.
-
kubeflow/mpi-operator - Kubernetes Operator for Allreduce-style Distributed Training.