Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaling Experiment. #399

Closed
helinwang opened this issue Oct 18, 2017 · 39 comments
Closed

Autoscaling Experiment. #399

helinwang opened this issue Oct 18, 2017 · 39 comments
Assignees

Comments

@helinwang
Copy link
Collaborator

No description provided.

@Yancey1989
Copy link
Collaborator

Yancey1989 commented Oct 24, 2017

@gongweibao
Copy link
Collaborator

@Yancey1989
Copy link
Collaborator

Yancey1989 commented Oct 26, 2017

Cluster Resouces

CPU: 2348(Core)
GPU: 0(GPU card)

NOTE: The performance of HDFS VFS mount in the internal CPU cluster is not stabilized(somtimes ls will need 10+ seconds), this will cause the difference of each experiment result is very huge. So I make a report using sleep 180 instead of python train.py train as the baseline.

TestCase1

Submit The Fault-tolerant Jobs

> paddlecloud submit -jobname $JOBNAME \
        -cpu 10\
        -gpu 0 \
        -memory 8Gi \
        -parallelism 30 \
        -pscpu 6 \
        -pservers 10 \
        -psmemory 5Gi \
        -entry $ENTRY \
        -faulttolerant \
        ./mnist
  • AUTO_SCALING ON
> AUTO_SCALING=ON JOB_COUNT=8 ./run.sh start case1
PASS AVG RUNNING TIME AVG PENDING TIME JOB RUNNING TIME AVG CLUSTER CPU UTILS
0 306 43 193,378,371,366,383,378,191,193 62.03
AVG 306 43 N/A 62.03
  • AUTO_SCALING OFF
> JOB_COUNT=8 ./run.sh start case1
PASS AVG RUNNING TIME AVG PENDING TIME JOB RUNNING TIME AVG CLUSTER CPU UTILS
0 250 64 420,191,194,193,377,190,220,222 61.49
AVG 250 64 N/A 61.49

TestCase2

  • AUTO_SCALING ON

AUTO_SCALING=ON JOB_COUNT=3 ./run.sh start case2

TIME NGINX PODS RUNNING TRAINERS CLUSTER CPU UTILS
1 144 0 18.40
4 200 15 31.94
8 262 15 39.86
12 335 45 61.97
17 400 52 76.66
53 391 52 76.83
56 302 52 65.46
60 238 52 57.28
63 210 52 53.71
67 202 52 55.75
70 200 52 55.49
83 191 52 54.34
86 121 52 45.40
89 110 52 43.99
92 102 52 42.97
95 100 52 42.72
116 180 52 52.94
120 199 52 55.37
127 200 52 55.49
141 200 61 59.33
148 268 61 68.02
152 339 61 77.09
156 385 61 82.96
205 385 49 77.85
209 385 41 74.45
213 385 46 76.58
217 390 46 77.21
221 400 46 78.49
260 400 24 69.12
264 400 38 75.09
327 400 29 71.25
395 400 24 69.12
399 400 14 64.86
450 400 0 58.90
PASS AVG RUNNING TIME AVG PENDING TIME JOB RUNNING TIME AVG CLUSTER CPU UTILS
0 341 2 205,438,382 67.55
AVG 341 2 N/A 67.55
  • AUTO_SCALING OFF

AUTO_SCALING=OFF JOB_COUNT=3 ./run.sh start case2

TIME NGINX PODS RUNNING TRAINERS CLUSTER CPU UTILS
1 140 0 17.89
4 203 15 32.33
7 267 15 40.50
11 328 30 54.68
14 393 32 63.84
18 400 45 78.07
54 334 45 69.63
57 264 45 60.69
61 207 45 53.41
64 202 45 52.77
67 201 45 52.64
70 200 45 52.51
83 191 45 51.36
86 112 45 41.27
88 103 45 40.12
94 100 45 39.74
114 137 45 44.46
117 199 45 52.39
120 200 45 52.51
147 251 45 59.03
152 339 45 70.27
156 400 45 78.07
204 400 38 75.09
208 400 1 59.33
212 400 0 58.90
PASS AVG RUNNING TIME AVG PENDING TIME JOB RUNNING TIME AVG CLUSTER CPU UTILS
0 199 2 204,197,198 59.65
AVG 199 2 N/A 59.65

@helinwang
Copy link
Collaborator Author

helinwang commented Oct 26, 2017

Great!!!

  1. Maybe we can change the "WAITING" in 0 mnist0 WAITING 5.60 0 0 to "NOT EXISTS" or "N/A"? To me waiting means more like pending, which we already have.

  2. Can you put the link to the trainer Python file in the experiment doc, and keep the Python file up-to-date. So everyone could be on the same page?

  3. The experiment only ran once, we need to run it multiple times (e.g., 10 - 30 times, we probably need a script to run the experiment multiple times, otherwise do it manually takes too much effort) to make the measurement statically sound. Also, the number of jobs running on a cluster with 2348 core probably would be more than 2. Maybe each run should have more than 2 jobs (e.g., 5 - 20 jobs).

  4. Maybe after clearing all the bugs related to the experiment, the final experiment could be longer (e.g., 5 - 30 mins, by adding more passes), a typical machine learning task takes more than 220s :) .

@Yancey1989
Copy link
Collaborator

@helinwang I push a PR. #447 to reproduce the experiment, I think I will fixed 1. and 2. of your comment abvoe.

@helinwang
Copy link
Collaborator Author

helinwang commented Oct 27, 2017

Thanks! @Yancey1989 for fault tolerant mode, I get the below error log, it seems that the ETCD_IP environment variable is not set in the fault tolerant mode (with the newest built paddlecloud from the code in PR #447 )

$ ./control_case1.sh start 1 ON
$ kc logs -f mnist0-trainer-w98q9
label selector: paddle-job-master=mnist0, desired: 1
Starting training job:  ..., num_gradient_servers: 2, trainer_id:  1, version: 
I1027 22:58:52.600109    28 Util.cpp:166] commandline:  --num_gradient_servers=2 --ports_num_for_sparse=1 --use_gpu=0 --trainer_id=1 --trainer_count=10 --num_passes=1 --ports_num=1 --port=7164 
[INFO 2017-10-27 22:58:52,696 layers.py:2556] output for __conv_pool_0___conv: c = 20, h = 24, w = 24, size = 11520
[INFO 2017-10-27 22:58:52,697 layers.py:2684] output for __conv_pool_0___pool: c = 20, h = 12, w = 12, size = 2880
[INFO 2017-10-27 22:58:52,698 layers.py:2556] output for __conv_pool_1___conv: c = 50, h = 8, w = 8, size = 3200
[INFO 2017-10-27 22:58:52,699 layers.py:2684] output for __conv_pool_1___pool: c = 50, h = 4, w = 4, size = 800
I1027 22:58:52.706857    28 GradientMachine.cpp:94] Initing parameters..
I1027 22:58:52.708870    28 GradientMachine.cpp:101] Init parameters done.
t=2017-10-27T22:58:57+0000 lvl=eror msg="Init etcd connection failed" error="dial tcp: lookup None on 11.1.0.10:53: no such host" stack="[github.com/PaddlePaddle/Paddle/go/pserver/client/etcd_client.go:145 c/cclient.go:129 _obj/_cgo_gotypes.go:109]"
t=2017-10-27T22:59:07+0000 lvl=eror msg="Init etcd connection failed" error="dial tcp: lookup None on 11.1.0.10:53: no such host" stack="[github.com/PaddlePaddle/Paddle/go/pserver/client/etcd_client.go:145 c/cclient.go:129 _obj/_cgo_gotypes.go:109]"

EDIT:

Now I don't get the above error, sometimes I get (also in fault tolerant mode):

$ paddlecloud logs mnist0
==========================mnist0-trainer-7nz62==========================
[INFO 2017-10-27 23:53:57,573 layers.py:2556] output for __conv_pool_0___conv: c = 20, h = 24, w = 24, size = 11520
[INFO 2017-10-27 23:53:57,574 layers.py:2684] output for __conv_pool_0___pool: c = 20, h = 12, w = 12, size = 2880
[INFO 2017-10-27 23:53:57,575 layers.py:2556] output for __conv_pool_1___conv: c = 50, h = 8, w = 8, size = 3200
[INFO 2017-10-27 23:53:57,575 layers.py:2684] output for __conv_pool_1___pool: c = 50, h = 4, w = 4, size = 800
I1027 23:53:57.583065    29 GradientMachine.cpp:94] Initing parameters..
I1027 23:53:57.585075    29 GradientMachine.cpp:101] Init parameters done.
I1027 23:53:57.776031    29 NewRemoteParameterUpdater.cpp:130] NewRemoteParameterUpdater initialized
job returned 0...setting pod return message...
===============================
termination log wroted...
==========================mnist0-trainer-l21dr==========================
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
learning_rate_args: ""
async_lagged_grad_discard_ratio: 1.5
I1027 23:53:57.313676    28 NewRemoteParameterUpdater.cpp:125] paddle_begin_init_params done
I1027 23:53:57.313689    28 NewRemoteParameterUpdater.cpp:130] NewRemoteParameterUpdater initialized
job returned 0...setting pod return message...
===============================
termination log wroted...

I have to do the following fix to make it train (but no longer uses recordio):

--- a/doc/autoscale/experiment/mnist/train_ft.py
+++ b/doc/autoscale/experiment/mnist/train_ft.py
@@ -2,6 +2,7 @@ from PIL import Image
 import numpy as np
 import paddle.v2 as paddle
 import paddle.v2.dataset.common as common
+import paddle.v2.dataset.mnist as mnist
 import os
 import sys
 import glob
@@ -138,20 +139,14 @@ def main():
             if event.batch_id % 100 == 0:
                 print "Pass %d, Batch %d, Cost %f, %s" % (
                     event.pass_id, event.batch_id, event.cost, event.metrics)
-        if isinstance(event, paddle.event.EndPass):
-            result = trainer.test(
-                    reader=paddle.batch(
-                    cluster_reader_recordio(TRAINER_ID, TRAINER_COUNT, "test"),
-                    batch_size=2))
-            print "Test with Pass %d, Cost %f, %s\n" % (
-                event.pass_id, result.cost, result.metrics)
 
     trainer.train(
         reader=paddle.batch(
-            cluster_reader_recordio(TRAINER_ID, TRAINER_COUNT, "train"),
+            #cluster_reader_recordio(TRAINER_ID, TRAINER_COUNT, "train"),
+            mnist.train(),

Btw, the current ft code does not use the master server. Maybe we can upload the recordio files to a public folder, and use the master server in the ft training.

@helinwang
Copy link
Collaborator Author

For documentation purpose in case anyone else run into the problem:

For none fault tolerant job (./control_case1.sh start 1 OFF), I have to comment out the below line to make it start training reliably:

     elif sys.argv[1] == "train":
-        prepare_dataset()
+        #prepare_dataset()
         main()

And to make the trainer run longer, we need to do:

-        num_passes=1)
+        num_passes=100)

@Yancey1989
Copy link
Collaborator

Yancey1989 commented Oct 28, 2017

Hi @helinwang @typhoonzero , I have update the PR #447 , please follow code :) The update as following:

  • Fix train_ft.py, fetch record from master
  • Support PASSES and JOB_COUNT args to support run the experiment multipl passes(the pass for the experiement, distinction with the pass with training), such as:
    > PASSES=5 JOB_COUNT=10 ./control_case1.sh start
    The above command will run the experiment 5 times, and submit 10 jobs for each time.
  • Generating a experiment report after all passes are finihsed.

@Yancey1989
Copy link
Collaborator

Remaining problems:

  1. Scale down does not work, Auto-scaling controller does not scale-down the Job #450,
  2. The result of the experiment is unstable because of the unstable HDFS mount.

@jacquesqiao
Copy link
Member

maybe can add definition description about AVG_RUNNINT_TIME, AVG_PENDING_TIME and JOB_RUNNING_TIME.

@helinwang
Copy link
Collaborator Author

helinwang commented Oct 30, 2017

Scale down does not work

@Yancey1989 I sent a PR to improve scale down: #456
(pushed to dockerhub: helinwang/training-job-controller). But could not test it because I got this error after starting autoscaler, seems to be a RBAC related issue, do you have any idea?

E1030 21:10:27.490697 1 reflector.go:201] github.com/PaddlePaddle/cloud/go/controller/controller.go:100 : Failed to list *api.TrainingJob: User "system:serviceaccount:helinwang-baidu-com:default" cannot list trainingjobs.paddlepaddle.org at the cluster scope. (get TrainingJobs.paddlepaddle.org)

controller.yaml:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: training-job-controller
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: training-job-controller
    spec:
      containers:
      - name: training-job-controller
        image: helinwang/training-job-controller
        imagePullPolicy: Always
        command: ["/controller", "-logtostderr", "-log_level", "debug"]

The result of the experiment is unstable because of the unstable HDFS mount.

Maybe we can overcome this problem by getting running more experiment passes? Currently we run 5 experiment passes, in each experiment pass, 20 passes of mnist data is trained. Maybe we can reduce the 20 to some smaller numbers, such as 3? so we can increase 5 experiment passes to maybe 20 experiment passes.

Another way is to run pods that copy the data to every node and the trainer and master uses the host mount...


Another issue: current train.py does not uses cloud_reader (with the master server), we may want to use it. Otherwise the variable for test case 1 is more than "autoscaling ON/OFF".
Edit: oh my mistake, maybe we are no longer using train.py?

@Yancey1989
Copy link
Collaborator

Yancey1989 commented Oct 31, 2017

I sent a PR to improve scale down: #456 . But could not test it because I got this error after starting > autoscaler, seems to be a RBAC related issue, do you have any idea?

The controller needs a ClusterRoleBind to bind with the ClusterRole(admin),in the internal CPU cluster, please submit to namespace with paddlecloud

Another way is to run pods that copy the data to every node and the trainer and master uses the host mount...

As the discuss on Hi, to produce the dataset in Docker image is an easy way...

Edit: oh my mistake, maybe we are no longer using train.py?

Sure, the experiment does not use train.py , I will remove it.

@helinwang
Copy link
Collaborator Author

helinwang commented Oct 31, 2017

Here is one run for test case 1 today:

PASSES=5 JOB_COUNT=1 FAULT_TOLERANT=ON ./control_case1.sh start

PASS_NUM AVG_RUNNINT_TIME AVG_PENDING_TIME JOB_RUNNING_TIME CPU_UTILS
0 2360 5 2360 13.69
1 320 5 320 14.54
2 1630 0 1630 13.91
3 4570 0 4570 14.80
4 5870 0 5870 14.94
TOTALLY 2950 2 N/A 14.38

@typhoonzero
Copy link
Collaborator

有一些问题请教:

case1:开启autoscale的情况,平均job执行时间变长了,整体集群CPU利用率上升不明显,而且貌似也没有达到很高的利用率?是否是因为每个pod request的CPU比较多呢?

case2:没有开启autoscale的情况,最后两行,trainer也变得很少,是job执行完成了么?这时是否nginx有大量pending的pod?

@helinwang
Copy link
Collaborator Author

helinwang commented Oct 31, 2017

同请教:

case1:为何autoscaling=ON会有pending?

case2:为何autoscaling=ON的时候有450s,而OFF的时候只有212s?

还有case2因为总时间是固定的(比如都是450s),考虑加一个total trainer running time(trainer数在时间上的积分,越高越好)?

@putcn
Copy link

putcn commented Oct 31, 2017

是不是AVG RUNNING TIME和AVG CPU UTIL并不能反映整体的性能提升, 因为均值的算法非常的一维, 就是几个值加起来除以个数并不能表示一个面的特征.
我们对于case 1是不是可以也收集下time series的每个pod和总体cpu的利用率, 这样我们就可以生成如下的图:
image
上下俩chart分别对应打开auto scale和没开的情况. 横轴为时间, 纵轴为CPU利用率.
除了红色的每条线都是一个pod的CPU利用曲线, 通过这个可以看出来每个pod的调度情况, 这样可以显示出打开auto scale的情况下pending的time都很少.
红色的曲线代表总体CPU的利用率. 对时间求下积分, 算出x轴和红色曲线围成的面积, 作为CPU utilization score. 分数越高说明CPU做功越多. 如果同样功情况下时间越短, 说明效率越高.
不知道我说明白了没...

@typhoonzero
Copy link
Collaborator

@putcn 这个方法甚好,而且足够直观。但可能需要 @Yancey1989 也记录执行过程的详细数值。

@helinwang
Copy link
Collaborator Author

helinwang commented Nov 1, 2017

@typhoonzero @Yancey1989 我改了下程序,能够输出每秒的情况,push到了@Yancey1989的PR里面:https://github.com/PaddlePaddle/cloud/pull/453/commits

@Yancey1989
Copy link
Collaborator

Cool!! Thanks @helinwang @putcn

@Yancey1989
Copy link
Collaborator

Yancey1989 commented Nov 1, 2017

Update PR #452

  1. Do not need to run python main.py print_info, the time serise data will be generated at ./out/mnist-case[1|2]-pass[0-9].log folder.
  2. Average data will be generated at ./out/mnist-case[1|2]-result.csv
  3. Package the dataset and recordio files in the Docker image registry.baidu.com/paddlepaddle/paddlecloud-job:mnist
  4. Add two colume in the time serise data, running trainers for each job and cpu utils for each job, @putcn could you help to add these two colume on the figure?

BUG

  1. Also find a bug that the auto-scaling job would be hanged after the job will be scaled up and scaled down, Can not fetch new task after some trainers have been scaled down Paddle#5279

@putcn
Copy link

putcn commented Nov 1, 2017

sure, will do

@helinwang
Copy link
Collaborator Author

helinwang commented Nov 1, 2017

Updated on #453

  1. Change case 2 so that it will run around 10 min for each experiment, no matter how many jobs has been submitted. The experiment will stop after the Nginx is scaled back to 400.

  2. Not longer rm ./out every time, test will have different output folder depending on the configuration.
    So now we can run multiple experiments in a loop:

    $ for i in `seq 1 2`; do echo pass $i; TAG=round_$i JOB_COUNT=5 ./run.sh start case2; done
    pass 1
    outputing output to folder: ./out/mnist-OFF-5-1-ON-400-case_case2-round_1
    
  3. Change master default timeout dur from 20min to 16s, chunk per task from 10 to 1. Pushed image to registry.baidu.com/paddlepaddle/paddlecloud-job:mnist .

@helinwang
Copy link
Collaborator Author

helinwang commented Nov 2, 2017

Remaining problem:

  1. The max cluster CPU util is capped around 85%, it would be great if we can explain why this happens, to the readers.

  2. case 1 there are many pending jobs. Maybe we need to be more aggressive for scale down, now we only scale down when util reaches 100%, maybe we should scale down when reaching 90% or 95%.

@helinwang
Copy link
Collaborator Author

helinwang commented Nov 2, 2017

Test case 2 graphs:

The most important graphs are: # of Nginx pods, # of trainers, cluster CPU utils.

TODO: 现在trainer太早结束了,要延长trainer训练pass。

Autoscaling ON:
auto_on_0

Autoscaling OFF:
auto_off_0

@Yancey1989
Copy link
Collaborator

TODO: 现在trainer太早结束了,要延长trainer训练pass。

现在是改到跑30个pass

Test case 1 graphs:

Autoscaling OFF
case1-mnist-off-20-10-on-400-yx

Autoscaling ON
case1-mnist-on-20-1-on-400-yx

Test case 2 graphs:

Autoscaling OFF
case2-mnist-off-6-1-on-400-yx

Autoscaling ON
case2-mnist-on-12-1-on-400-yx

@Yancey1989
Copy link
Collaborator

Hi @helinwang

The max cluster CPU util is capped around 85%, it would be great if we can explain why this happens, to the readers.

I think the reason is the same as #465 , There is a calico Pod which request 250m CPU running on each Node, and the Trainer Pod request 5 CPU in the epxeriment, so always some CPU is idle.

case 1 there are many pending jobs. Maybe we need to be more aggressive for scale down, now we only scale down when util reaches 100%, maybe we should scale down when reaching 90% or 95%.

I submit a PR to fix this problem, #467

@Yancey1989
Copy link
Collaborator

Update #453

  1. Run case1 for 10 mins for each experiment, the same as case2.
  2. Upload the logs to the ./experiment/result foler.
  3. Update the contorller Docker image to registry.baidu.com/paddlepaddle/controller:yx2, build from Fix scale up with no assigned node #467

@helinwang
Copy link
Collaborator Author

helinwang commented Nov 2, 2017

Update:

  1. 一个比较方便的debug controller的方法,在本地启动:../../../go/cmd/controller/controller -kubeconfig ~/.kube/config | tee clog.txt
  2. 我是在本地启动的controller,cluster里的controller被我杀掉了,如果需要的话要重新启动一下(我不是很确定启动的yaml,所以就没有启动了)。
  3. commit pushed to Fix scale up with no assigned node #467:
    • Print less log, fix unit test
    • Increase max_load_desired from 0.9 to 0.97
    • Scale all by target rather than by diff, fix TrainerJob maybe nil
    • Fix crash, add logs
  4. Merged Fix scale up with no assigned node #467 into Add testcase2 scripts #453 , pushed to Add testcase2 scripts #453
  5. Build the latest controller from Add testcase2 scripts #453 , pushed to registry.baidu.com/paddlepaddle/controller:yx2

@helinwang
Copy link
Collaborator Author

helinwang commented Nov 2, 2017

Problems:

  1. Not very related to the experiment, but kubectl get events sometimes shows:

    2m 2m 1 mnist4-trainer-xlqk1 Pod Warning FailedMount kubelet, yq01-jpaas-paddle01-wrk18.yq01.baidu.com Unable to mount volumes for pod "mnist4-trainer-xlqk1_helinwang-baidu-com(e8f619c6-c024-11e7-aa74-6c92bf4727a8)": timeout expired waiting for volumes to attach/mount for pod "helinwang-baidu-com"/"mnist4-trainer-xlqk1". list of unattached/unmounted volumes=[public mulan default-token-hkc45]


wuyi added:

This may due to the job is still mounting HDFS to trainers? @Yancey1989


The job always mount the hostPath by default, it's a global configuration...

From yanxu

@helinwang
Copy link
Collaborator Author

TODO:

  1. Let's always use the latest Add testcase2 scripts #453 for the experiment, as it contains changes to experiment script, train_ft.py.
  2. Let's make our experiment command line exactly the same:
    I have found that when PASSES=2, the experiment log for PASS1 and PASS2 differs. So let's always use PASSES=1.
    • case 1:
      TAG=round_0 AUTO_SCALING=ON PASSES=1 JOB_COUNT=20 ./run.sh start case1
      TAG=round_1 AUTO_SCALING=ON PASSES=1 JOB_COUNT=20 ./run.sh start case1
      ...
      TAG=round_0 AUTO_SCALING=OFF PASSES=1 JOB_COUNT=20 ./run.sh start case1
      TAG=round_1 AUTO_SCALING=OFF PASSES=1 JOB_COUNT=20 ./run.sh start case1
    • case 2:
      TAG=round_0 AUTO_SCALING=ON PASSES=1 JOB_COUNT=5 ./run.sh start case2
      TAG=round_1 AUTO_SCALING=ON PASSES=1 JOB_COUNT=5 ./run.sh start case2
      ...
      TAG=round_0 AUTO_SCALING=OFF PASSES=1 JOB_COUNT=5 ./run.sh start case2
      TAG=round_1 AUTO_SCALING=OFF PASSES=1 JOB_COUNT=5 ./run.sh start case2

@helinwang
Copy link
Collaborator Author

helinwang commented Nov 3, 2017

Latest Graphs:
case 2:
Autoscale OFF
1

Autoscale ON
2

case 1:
Autoscale ON
3

Autoscale OFF
4

@Yancey1989
Copy link
Collaborator

Yancey1989 commented Nov 3, 2017

Update #453

  1. Package train_ft.py in the Docker image, and modify the run.sh to use the new Image to run the job.
  2. Fix mount HDFS timeout.
  3. Push the time series data under the log folder.

@helinwang
Copy link
Collaborator Author

helinwang commented Nov 3, 2017

Update #453

  1. Plotter: support averaging inputs on different timestamps
    usage: DATA_MAX=550 DATA_PATHS='case2-mnist-ON*/*.log' python ../python/ploter.py

@helinwang
Copy link
Collaborator Author

helinwang commented Nov 4, 2017

Known problem:

Sometimes case 2 stuck at:

waiting for collector exit, generated file ./out/case2-mnist-ON-5-1-ON-400-round_3/mnist-case1-pass0.csv
waiting for collector exit, generated file ./out/case2-mnist-ON-5-1-ON-400-round_3/mnist-case1-pass0.csv
waiting for collector exit, generated file ./out/case2-mnist-ON-5-1-ON-400-round_3/mnist-case1-pass0.csv
waiting for collector exit, generated file ./out/case2-mnist-ON-5-1-ON-400-round_3/mnist-case1-pass0.csv


@helinwang
Fixed this problem, we need to kill the trainingjob firstly, and then kill the job.

FROM yanxu

@helinwang
Copy link
Collaborator Author

helinwang commented Nov 4, 2017

我们最后会在timestamp 550截断试验(之后会kill job,实验数据没有意义)。
avg pending time不被影响。
avg cpu util受影响,统计方式是如下命令行:
case1:

$ cat case1-mnist-OFF-20-1-ON-400-round_*/mnist-case1-pass0.log|awk -F, '{if ($1<=550) {a=a+$2; b=b+1}} END {print a/b}'

case2:

$ cat case2-mnist-OFF-6-1-ON-400-round_*/mnist-case2.log|awk -F, '{if ($1<=550) {a=a+$2; b=b+1}} END {print a/b}'

@Yancey1989
Copy link
Collaborator

Yancey1989 commented Nov 7, 2017

Push a PR: #470

  1. Fix a bug, the controller does rewrite the node resource when executing the DryRun.
  2. Fix the long pending time in a track way, shrink the time interval of submitting jobs.
  3. Fix remainding Pods after deleting the job in a track way: kubectl delete pod `kubectl get pods | | grep -v Terminating | awk '{print $1}'` , already add in the stop function in case1.sh and case2.sh.
  4. Rerun the experiment, update the log files: out/case1-mnist-ON-20-1-ON-400-round_{0-9}, out/case1-mnist-OFF-20-1-ON-400-round_{0-3}

@helinwang
Copy link
Collaborator Author

Seems something wrong with the state of autoscaler, I have to kill the pod between every run of case 1 to see the expected experiment result.

for i in `seq 2 9`; do TAG=round_$i AUTO_SCALING=ON PASSES=1 JOB_COUNT=20 ./run.sh start case1; kc delete po `kc get pod --namespace paddlecloud|grep training |awk  '{print $1}'` --namespace paddlecloud; sleep 10; done

@typhoonzero
Copy link
Collaborator

Seems something wrong with the state of autoscaler, I have to kill the pod between every run of case 1 to see the expected experiment result.

@Yancey1989 is that a bug of case 1 job submit scripts, it's not deleting jobs after they finishes?

@Yancey1989
Copy link
Collaborator

@typhoonzero , Yes, I fixed it by a tracking way, but I think it's a bug of controller or cloud server..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants