Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/memory profiler #7983

Closed

Conversation

dzhwinter
Copy link
Contributor

@dzhwinter dzhwinter commented Jan 30, 2018

hack the memory profiler for memory debugging and benchmark. When I do this job, I find that the structure of profiler needs to make it readable.

Use with profiler.profiler('CPU', 'total') as prof to package the profiling code.
profiler.reset_profiler() can be used to clear the previous records.

A simple usage is as follows:

image = fluid.layers.data(name='x', shape=[784], dtype='float32')
# ...
avg_cost = fluid.layers.mean(x=cost)
optimizer = fluid.optimizer.Momentum(learning_rate=0.001, momentum=0.9)
opts = optimizer.minimize(avg_cost)
accuracy = fluid.evaluator.Accuracy(input=predict, label=label)

place = fluid.CPUPlace() # or fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
accuracy.reset(exe)

with profiler.profiler('CPU', 'total') as prof:
     for iter in range(10):
         if iter == 2:
             profiler.reset_profiler()
         x = np.random.random((32, 784)).astype("float32")
         y = np.random.randint(0, 10, (32, 1)).astype("int64")

         outs = exe.run(fluid.default_main_program(),
                        feed={'x': x,
                              'y': y},
                        fetch_list=[avg_cost] + accuracy.metrics)
         acc = np.array(outs[1])
         pass_acc = accuracy.eval(exe)

Please see this for demo usage
dzhwinter/benchmark#80

  1 -----------  Configuration Arguments -----------
   2 batch_size: 16
   3 data_format: NCHW
   4 data_set: flowers
   5 device: GPU
   6 learning_rate: 0.001
   7 pass_num: 1
   8 ------------------------------------------------
   9
  10 ------------------------->     Profiling Report     <-------------------------
  11
  12 Place: CUDA Total Time:0ms  Total Memory:9065.64MB  Sorted by total time in descending order in the same thread
  13
  14 Event                            Calls       Total       Min.        Max.        Ave.        Total Memory.Min Memory. Max Memory. Ave Memory.
  15 thread0::elementwise_add_grad    16          206.254     0.037888    63.5791     12.8909     4648.4      0.00683594  196.001     290.525
  16 thread0::conv2d_grad             13          112.87      2.88768     19.3096     8.68234     4727.98     0.00683594  196.141     363.691
  17 thread0::dropout                 10          77.7945     0.043072    35.1161     7.77945     1121.58     0.0629883   392         112.158
  18 thread0::conv2d                  13          55.5693     0.971776    12.2081     4.27456     337.579     6.12524     196         25.9676
  19 thread0::batch_norm_grad         14          14.7364     0.074752    3.76525     1.0526      4649.8      0.0358887   196.001     332.129
  20 thread0::batch_norm              14          13.8792     0.083968    4.15642     0.991369    729.579     0.0358887   196.001     52.1128
  21 thread0::relu_grad               14          8.91693     0.027648    2.32448     0.636923    4649.76     0.0314941   196         332.126
  22 thread0::elementwise_add         16          7.9863      0.027648    2.15555     0.499144    533.579     0.00634766  196         33.3487
  23 thread0::relu                    14          7.06045     0.022528    1.92        0.504318    925.581     0.0314941   196         66.1129
  24 thread0::pool2d_grad             5           5.06368     0.152576    2.57126     1.01274     4703.47     6.12524     196         940.693
  25 thread0::dropout_grad            10          4.85581     0.034816    2.26202     0.485581    4649.73     0.0314941   196         464.973
  26 thread0::adam                    60          3.85926     0.026624    0.975872    0.0643211   9065.61     0           0           151.094
  27 thread0::fill_zeros_like         66          2.56307     0.017408    0.636928    0.0388344   4649.7      0.000488281 196         70.45
  28 thread0::pool2d                  5           1.72131     0.079872    0.797664    0.344262    2297.58     1.53149     49.0002     459.517
  29 thread0::elementwise_mul         60          1.66064     0.0256      0.048128    0.0276773   9065.61     0.000244141 0.000244141 151.094
  30 thread0::fill_constant           61          0.998304    0.014336    0.035744    0.0163656   4648.38     0.000244141 0.000244141 76.203
  31 thread0::mul_grad                3           0.518048    0.067584    0.36352     0.172683    4648.4      0.230957    50.5317     1549.47
  32 thread0::mul                     3           0.433088    0.041984    0.344       0.144363    4648.1      0.00634766  0.0314941   1549.37
  33 thread0::fetch                   2           0.275616    0.027136    0.24848     0.137808    9065.64     0           0           4532.82

Copy link
Contributor

@kuke kuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things need to be discussed: 1) Is the way to count the memory reasonable; 2) There are two many outputs in the profiling report, some of them maybe need to be removed.

@@ -277,6 +292,17 @@ void ParseEvents(std::vector<std::vector<Event>>& events,
// max time
event_items[index].max_time =
std::max(event_time, event_items[index].max_time);

// total memory used
event_items[index].total_time += event_memory_used;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_time -> total_memory_used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for (size_t j = 0; j < events_table[i].size(); ++j) {
EventItem& event_item = events_table[i][j];

app_total_time += event_item.total_time;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not correct to count the total time like this. There may be overlap between different events, also there may be gap between the end time of one event and the start time of the next.

EventItem& event_item = events_table[i][j];

app_total_time += event_item.total_time;
app_total_memory += event_item.total_memory_used;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain how to obtain the memory occupation for every operator? I feel that the memory may be malloced and freed frequently, and often be reused. And total_memory_used calculated in this way would be meaningless.

@qingqing01
Copy link
Contributor

  1. Is the way to count the memory reasonable;

I also have a doubt about this. Better to give some results and reasonable analysis. You can use the test_profiler.py to test.

Now the profiler for time is suitable for multithreading. Is this memory counting is suitable for multithreading?

Copy link
Contributor

@chengduoZH chengduoZH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is great to count the memory usage by profiling!
But counting memory usage is different from counting time-consuming. Because time-consuming does not care about what happens inside the operator. counting memory usage should record the peak of memory used inside the operator.
You can think about conv_op, it creates a col to store the result of im2col when the conv_op is over, the memory will be released.


double event_memory_used = rit->MemoryUsed(events[i][j]);
double total_memory_used =
static_cast<double>(rit->GetMemoryUsed()) / (1024 * 1024);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double total_memory_used = static_cast<double>(rit->GetMemoryUsed()) / (1024 * 1024);

==>

double total_memory_used = static_cast<double>(rit->GetMemoryUsed() + event_memory_used)  * kMegabyte;

Where kMegabyte equals to 1.0/1024/1024.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event_memory_used means the memory of this operator creating's.
total_memory_used means, up to now, the total memory has been used, it should include event_memory_used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we must overload the placement new operator to reach the goal.

for (auto& item : event_items) {
item.ave_time = item.total_time / item.calls;
item.ave_memory_used = item.total_memory_used / item.calls;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think item.ave_memory_used is necessary.
average memory used is confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to count the memory used in the most case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted.

@kuke
Copy link
Contributor

kuke commented Feb 1, 2018

2018-02-01 12 48 44

About the memory usage, why there are cases that Ave > Max, and both Min and Max equal to 0, but the Ave is some positive number?

Please consider the way to count the memory seriously.

@dzhwinter
Copy link
Contributor Author

Please use this profiler latest code.

@dzhwinter
Copy link
Contributor Author

在uniform random op里加入profiler,只跑一个op,结果如下。
预期分配内存:
fill_constant 1000 * 784 * 4 = 2.99072265625
fetch_op : 同一份内存拷贝出来

结果符合预期,多分配了0.002 MB, 原因是allocator有magic number, 做alloc和free校验。

------------------------->     Profiling Report     <-------------------------

Place: CPU	Total Time:9.31106ms	Total Memory:2.99219MB	Sorted by total time in descending order in the same thread

Event                      Calls       Total       Min.        Max.        Ave.        Total Memory.Min Memory. Max Memory.
thread0::uniform_random    1           5.17344     5.17344     5.17344     5.17344     0           2.99219     2.99219
thread0::fetch             1           1.42045     1.42045     1.42045     1.42045     2.99219     2.99219     2.99219

@dzhwinter
Copy link
Contributor Author

dzhwinter commented Feb 1, 2018

作为对比,检验profiler在整个模型上的累加是否正确

加入profiler之前的测试结果:
mnist, batch128, 占用41M

@QiJune dzhwinter/benchmark#67

Havetested in CPU based on latest develop branch Paddle code.

  • mnist
FLAGS_do_memory_benchmark=true GLOG_vmodule=executor=2 GLOG_logtostderr=1 python mnist.py --device=CPU --batch_size=128 --iterations=2 --pass_num=1

maximum memory usage: 50622464 --> 43061248

  • vgg16
FLAGS_do_memory_benchmark=true GLOG_vmodule=executor=2 GLOG_logtostderr=1 python vgg16.py --device=CPU --batch_size=128

maximum memory usage: 1729540096 --> 1132953600

  • resnet50
FLAGS_do_memory_benchmark=true GLOG_vmodule=executor=2 GLOG_logtostderr=1 python resnet50.py --device=CPU --batch_size=128 --model=resnet_cifar10 --data_set=cifar10

maximum memory usage: 1275125760 --> 663941120

@dzhwinter
Copy link
Contributor Author

dzhwinter commented Feb 1, 2018

使用profiler统计结果。在benchmark表格中可见。

mnist batch=64, 21M

mnist batch=128, 41M

@dzhwinter dzhwinter closed this Feb 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants