While many hardware and software manufacturers are working on improving the running time of deep learning jobs, EDL optimizes
- the global utilization of the cluster, and
- the waiting time of job submitters.
For more about the project EDL, please refer to this invited blog post on the Kubernetes official blog.
EDL includes two parts:
-
a Kubernetes controller for the elastic scheduling of distributed deep learning jobs, and
-
making PaddlePaddle a fault-tolerable deep learning framework. This directory contains the Kubernetes controller. For more information about fault-tolerance, please refer to the design.
We deployed EDL on a real Kubernetes cluster, dlnel.com, opened for graduate students of Tsinghua University. The performance test report of EDL on this cluster is here.
glide install --strip-vendor
go build -o path/to/output github.com/paddlepaddle/edl/cmd/edl