Feature/memory profiler #7983

dzhwinter · 2018-01-30T15:35:59Z

hack the memory profiler for memory debugging and benchmark. When I do this job, I find that the structure of profiler needs to make it readable.

Use with profiler.profiler('CPU', 'total') as prof to package the profiling code.
profiler.reset_profiler() can be used to clear the previous records.

A simple usage is as follows:

image = fluid.layers.data(name='x', shape=[784], dtype='float32')
# ...
avg_cost = fluid.layers.mean(x=cost)
optimizer = fluid.optimizer.Momentum(learning_rate=0.001, momentum=0.9)
opts = optimizer.minimize(avg_cost)
accuracy = fluid.evaluator.Accuracy(input=predict, label=label)

place = fluid.CPUPlace() # or fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
accuracy.reset(exe)

with profiler.profiler('CPU', 'total') as prof:
     for iter in range(10):
         if iter == 2:
             profiler.reset_profiler()
         x = np.random.random((32, 784)).astype("float32")
         y = np.random.randint(0, 10, (32, 1)).astype("int64")

         outs = exe.run(fluid.default_main_program(),
                        feed={'x': x,
                              'y': y},
                        fetch_list=[avg_cost] + accuracy.metrics)
         acc = np.array(outs[1])
         pass_acc = accuracy.eval(exe)

Please see this for demo usage
dzhwinter/benchmark#80

  1 -----------  Configuration Arguments -----------
   2 batch_size: 16
   3 data_format: NCHW
   4 data_set: flowers
   5 device: GPU
   6 learning_rate: 0.001
   7 pass_num: 1
   8 ------------------------------------------------
   9
  10 ------------------------->     Profiling Report     <-------------------------
  11
  12 Place: CUDA Total Time:0ms  Total Memory:9065.64MB  Sorted by total time in descending order in the same thread
  13
  14 Event                            Calls       Total       Min.        Max.        Ave.        Total Memory.Min Memory. Max Memory. Ave Memory.
  15 thread0::elementwise_add_grad    16          206.254     0.037888    63.5791     12.8909     4648.4      0.00683594  196.001     290.525
  16 thread0::conv2d_grad             13          112.87      2.88768     19.3096     8.68234     4727.98     0.00683594  196.141     363.691
  17 thread0::dropout                 10          77.7945     0.043072    35.1161     7.77945     1121.58     0.0629883   392         112.158
  18 thread0::conv2d                  13          55.5693     0.971776    12.2081     4.27456     337.579     6.12524     196         25.9676
  19 thread0::batch_norm_grad         14          14.7364     0.074752    3.76525     1.0526      4649.8      0.0358887   196.001     332.129
  20 thread0::batch_norm              14          13.8792     0.083968    4.15642     0.991369    729.579     0.0358887   196.001     52.1128
  21 thread0::relu_grad               14          8.91693     0.027648    2.32448     0.636923    4649.76     0.0314941   196         332.126
  22 thread0::elementwise_add         16          7.9863      0.027648    2.15555     0.499144    533.579     0.00634766  196         33.3487
  23 thread0::relu                    14          7.06045     0.022528    1.92        0.504318    925.581     0.0314941   196         66.1129
  24 thread0::pool2d_grad             5           5.06368     0.152576    2.57126     1.01274     4703.47     6.12524     196         940.693
  25 thread0::dropout_grad            10          4.85581     0.034816    2.26202     0.485581    4649.73     0.0314941   196         464.973
  26 thread0::adam                    60          3.85926     0.026624    0.975872    0.0643211   9065.61     0           0           151.094
  27 thread0::fill_zeros_like         66          2.56307     0.017408    0.636928    0.0388344   4649.7      0.000488281 196         70.45
  28 thread0::pool2d                  5           1.72131     0.079872    0.797664    0.344262    2297.58     1.53149     49.0002     459.517
  29 thread0::elementwise_mul         60          1.66064     0.0256      0.048128    0.0276773   9065.61     0.000244141 0.000244141 151.094
  30 thread0::fill_constant           61          0.998304    0.014336    0.035744    0.0163656   4648.38     0.000244141 0.000244141 76.203
  31 thread0::mul_grad                3           0.518048    0.067584    0.36352     0.172683    4648.4      0.230957    50.5317     1549.47
  32 thread0::mul                     3           0.433088    0.041984    0.344       0.144363    4648.1      0.00634766  0.0314941   1549.37
  33 thread0::fetch                   2           0.275616    0.027136    0.24848     0.137808    9065.64     0           0           4532.82

kuke

Two things need to be discussed: 1) Is the way to count the memory reasonable; 2) There are two many outputs in the profiling report, some of them maybe need to be removed.

kuke · 2018-01-30T16:05:38Z

paddle/platform/profiler.cc

@@ -277,6 +292,17 @@ void ParseEvents(std::vector<std::vector<Event>>& events,
            // max time
            event_items[index].max_time =
                std::max(event_time, event_items[index].max_time);
+
+            // total memory used
+            event_items[index].total_time += event_memory_used;


total_time -> total_memory_used

kuke · 2018-01-30T16:15:09Z

paddle/platform/profiler.cc

+    for (size_t j = 0; j < events_table[i].size(); ++j) {
+      EventItem& event_item = events_table[i][j];
+
+      app_total_time += event_item.total_time;


It is not correct to count the total time like this. There may be overlap between different events, also there may be gap between the end time of one event and the start time of the next.

kuke · 2018-01-30T16:25:08Z

paddle/platform/profiler.cc

+      EventItem& event_item = events_table[i][j];
+
+      app_total_time += event_item.total_time;
+      app_total_memory += event_item.total_memory_used;


Could you please explain how to obtain the memory occupation for every operator? I feel that the memory may be malloced and freed frequently, and often be reused. And total_memory_used calculated in this way would be meaningless.

qingqing01 · 2018-01-31T01:38:56Z

Is the way to count the memory reasonable;

I also have a doubt about this. Better to give some results and reasonable analysis. You can use the test_profiler.py to test.

Now the profiler for time is suitable for multithreading. Is this memory counting is suitable for multithreading?

…file

chengduoZH

It is great to count the memory usage by profiling!
But counting memory usage is different from counting time-consuming. Because time-consuming does not care about what happens inside the operator. counting memory usage should record the peak of memory used inside the operator.
You can think about conv_op, it creates a col to store the result of im2col when the conv_op is over, the memory will be released.

chengduoZH · 2018-01-31T15:57:55Z

paddle/platform/profiler.cc

+
+          double event_memory_used = rit->MemoryUsed(events[i][j]);
+          double total_memory_used =
+              static_cast<double>(rit->GetMemoryUsed()) / (1024 * 1024);


double total_memory_used = static_cast<double>(rit->GetMemoryUsed()) / (1024 * 1024);

==>

double total_memory_used = static_cast<double>(rit->GetMemoryUsed() + event_memory_used) * kMegabyte;

Where kMegabyte equals to 1.0/1024/1024.

event_memory_used means the memory of this operator creating's.
total_memory_used means, up to now, the total memory has been used, it should include event_memory_used.

we must overload the placement new operator to reach the goal.

chengduoZH · 2018-01-31T16:03:20Z

paddle/platform/profiler.cc

    for (auto& item : event_items) {
      item.ave_time = item.total_time / item.calls;
+      item.ave_memory_used = item.total_memory_used / item.calls;


I don't think item.ave_memory_used is necessary.
average memory used is confusing.

I tend to count the memory used in the most case.

kuke · 2018-02-01T04:54:29Z

About the memory usage, why there are cases that Ave > Max, and both Min and Max equal to 0, but the Ave is some positive number?

Please consider the way to count the memory seriously.

dzhwinter · 2018-02-01T05:43:04Z

Please use this profiler latest code.

dzhwinter · 2018-02-01T06:27:54Z

在uniform random op里加入profiler，只跑一个op,结果如下。
预期分配内存：
fill_constant 1000 * 784 * 4 = 2.99072265625
fetch_op : 同一份内存拷贝出来

结果符合预期，多分配了0.002 MB, 原因是allocator有magic number, 做alloc和free校验。

------------------------->     Profiling Report     <-------------------------

Place: CPU	Total Time:9.31106ms	Total Memory:2.99219MB	Sorted by total time in descending order in the same thread

Event                      Calls       Total       Min.        Max.        Ave.        Total Memory.Min Memory. Max Memory.
thread0::uniform_random    1           5.17344     5.17344     5.17344     5.17344     0           2.99219     2.99219
thread0::fetch             1           1.42045     1.42045     1.42045     1.42045     2.99219     2.99219     2.99219

dzhwinter · 2018-02-01T06:32:55Z

作为对比，检验profiler在整个模型上的累加是否正确

加入profiler之前的测试结果:
mnist, batch128, 占用41M

@QiJune dzhwinter/benchmark#67

Havetested in CPU based on latest develop branch Paddle code.

mnist

FLAGS_do_memory_benchmark=true GLOG_vmodule=executor=2 GLOG_logtostderr=1 python mnist.py --device=CPU --batch_size=128 --iterations=2 --pass_num=1

maximum memory usage: 50622464 --> 43061248

vgg16

FLAGS_do_memory_benchmark=true GLOG_vmodule=executor=2 GLOG_logtostderr=1 python vgg16.py --device=CPU --batch_size=128

maximum memory usage: 1729540096 --> 1132953600

resnet50

FLAGS_do_memory_benchmark=true GLOG_vmodule=executor=2 GLOG_logtostderr=1 python resnet50.py --device=CPU --batch_size=128 --model=resnet_cifar10 --data_set=cifar10

maximum memory usage: 1275125760 --> 663941120

dzhwinter · 2018-02-01T06:35:02Z

使用profiler统计结果。在benchmark表格中可见。

mnist batch=64, 21M

mnist batch=128, 41M

dzhwinter added 3 commits January 30, 2018 22:55

"rush out a memory caculater"

19d70aa

"fix ci"

296d612

"better format"

9725e49

dzhwinter requested review from kuke and qingqing01 January 30, 2018 15:36

kuke reviewed Jan 30, 2018

View reviewed changes

dzhwinter added 4 commits January 30, 2018 20:53

Merge remote-tracking branch 'origin/develop' into feature/memory_pro…

653b24d

…file

"followed comment"

615589e

"add memory tool"

036c997

Merge remote-tracking branch 'origin/develop' into feature/memory_pro…

fcb704a

…file

chengduoZH reviewed Jan 31, 2018

View reviewed changes

dzhwinter added 2 commits January 31, 2018 19:11

"fix based on comments"

fa2c7a4

"fix typo"

907c8c3

dzhwinter closed this Feb 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/memory profiler #7983

Feature/memory profiler #7983

dzhwinter commented Jan 30, 2018 •

edited

Loading

kuke left a comment

kuke Jan 30, 2018

dzhwinter Jan 31, 2018

kuke Jan 30, 2018

kuke Jan 30, 2018

qingqing01 commented Jan 31, 2018

chengduoZH left a comment

chengduoZH Jan 31, 2018

chengduoZH Jan 31, 2018

dzhwinter Feb 1, 2018

chengduoZH Jan 31, 2018

chengduoZH Jan 31, 2018

dzhwinter Feb 1, 2018

kuke commented Feb 1, 2018 •

edited

Loading

dzhwinter commented Feb 1, 2018

dzhwinter commented Feb 1, 2018

dzhwinter commented Feb 1, 2018 •

edited

Loading

dzhwinter commented Feb 1, 2018 •

edited

Loading

Feature/memory profiler #7983

Feature/memory profiler #7983

Conversation

dzhwinter commented Jan 30, 2018 • edited Loading

kuke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qingqing01 commented Jan 31, 2018

chengduoZH left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kuke commented Feb 1, 2018 • edited Loading

dzhwinter commented Feb 1, 2018

dzhwinter commented Feb 1, 2018

dzhwinter commented Feb 1, 2018 • edited Loading

dzhwinter commented Feb 1, 2018 • edited Loading

dzhwinter commented Jan 30, 2018 •

edited

Loading

kuke commented Feb 1, 2018 •

edited

Loading

dzhwinter commented Feb 1, 2018 •

edited

Loading

dzhwinter commented Feb 1, 2018 •

edited

Loading