-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoscaling Experiment. #399
Comments
Cluster ResoucesCPU: 2348(Core)
GPU: 0(GPU card) NOTE: The performance of HDFS VFS mount in the internal CPU cluster is not stabilized(somtimes ls will need 10+ seconds), this will cause the difference of each experiment result is very huge. So I make a report using TestCase1Submit The Fault-tolerant Jobs
> AUTO_SCALING=ON JOB_COUNT=8 ./run.sh start case1
> JOB_COUNT=8 ./run.sh start case1
TestCase2
|
Great!!!
|
@helinwang I push a PR. #447 to reproduce the experiment, I think I will fixed 1. and 2. of your comment abvoe. |
Thanks! @Yancey1989 for fault tolerant mode, I get the below error log, it seems that the
EDIT: Now I don't get the above error, sometimes I get (also in fault tolerant mode):
I have to do the following fix to make it train (but no longer uses recordio):
Btw, the current ft code does not use the master server. Maybe we can upload the recordio files to a public folder, and use the master server in the ft training. |
For documentation purpose in case anyone else run into the problem: For none fault tolerant job (
And to make the trainer run longer, we need to do:
|
Hi @helinwang @typhoonzero , I have update the PR #447 , please follow code :) The update as following:
|
Remaining problems:
|
maybe can add definition description about |
@Yancey1989 I sent a PR to improve scale down: #456
controller.yaml: apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: training-job-controller
spec:
replicas: 1
template:
metadata:
labels:
name: training-job-controller
spec:
containers:
- name: training-job-controller
image: helinwang/training-job-controller
imagePullPolicy: Always
command: ["/controller", "-logtostderr", "-log_level", "debug"]
Maybe we can overcome this problem by getting running more experiment passes? Currently we run 5 experiment passes, in each experiment pass, 20 passes of mnist data is trained. Maybe we can reduce the Another way is to run pods that copy the data to every node and the trainer and master uses the host mount... Another issue: current |
The controller needs a ClusterRoleBind to bind with the ClusterRole(admin),in the internal CPU cluster, please submit to namespace with paddlecloud
As the discuss on Hi, to produce the dataset in Docker image is an easy way...
Sure, the experiment does not use train.py , I will remove it. |
Here is one run for test case 1 today:
|
有一些问题请教: case1:开启autoscale的情况,平均job执行时间变长了,整体集群CPU利用率上升不明显,而且貌似也没有达到很高的利用率?是否是因为每个pod request的CPU比较多呢? case2:没有开启autoscale的情况,最后两行,trainer也变得很少,是job执行完成了么?这时是否nginx有大量pending的pod? |
同请教: case1:为何autoscaling=ON会有pending? case2:为何autoscaling=ON的时候有450s,而OFF的时候只有212s? 还有case2因为总时间是固定的(比如都是450s),考虑加一个total trainer running time(trainer数在时间上的积分,越高越好)? |
@putcn 这个方法甚好,而且足够直观。但可能需要 @Yancey1989 也记录执行过程的详细数值。 |
@typhoonzero @Yancey1989 我改了下程序,能够输出每秒的情况,push到了@Yancey1989的PR里面:https://github.com/PaddlePaddle/cloud/pull/453/commits |
Cool!! Thanks @helinwang @putcn |
Update PR #452
BUG
|
sure, will do |
Updated on #453
|
Remaining problem:
|
Hi @helinwang
I think the reason is the same as #465 , There is a calico Pod which request 250m CPU running on each Node, and the Trainer Pod request 5 CPU in the epxeriment, so always some CPU is idle.
I submit a PR to fix this problem, #467 |
Update #453
|
Update:
|
Problems:
wuyi added: This may due to the job is still mounting HDFS to trainers? @Yancey1989 The job always mount the hostPath by default, it's a global configuration... From yanxu |
TODO:
|
Update #453
|
Update #453
|
Known problem: Sometimes case 2 stuck at:
@helinwang FROM yanxu |
我们最后会在timestamp 550截断试验(之后会kill job,实验数据没有意义)。 $ cat case1-mnist-OFF-20-1-ON-400-round_*/mnist-case1-pass0.log|awk -F, '{if ($1<=550) {a=a+$2; b=b+1}} END {print a/b}' case2: $ cat case2-mnist-OFF-6-1-ON-400-round_*/mnist-case2.log|awk -F, '{if ($1<=550) {a=a+$2; b=b+1}} END {print a/b}' |
Push a PR: #470
|
Seems something wrong with the state of autoscaler, I have to kill the pod between every run of case 1 to see the expected experiment result.
|
@Yancey1989 is that a bug of case 1 job submit scripts, it's not deleting jobs after they finishes? |
@typhoonzero , Yes, I fixed it by a tracking way, but I think it's a bug of controller or cloud server.. |
No description provided.
The text was updated successfully, but these errors were encountered: