-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/memory profiler #7983
Feature/memory profiler #7983
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things need to be discussed: 1) Is the way to count the memory reasonable; 2) There are two many outputs in the profiling report, some of them maybe need to be removed.
paddle/platform/profiler.cc
Outdated
@@ -277,6 +292,17 @@ void ParseEvents(std::vector<std::vector<Event>>& events, | |||
// max time | |||
event_items[index].max_time = | |||
std::max(event_time, event_items[index].max_time); | |||
|
|||
// total memory used | |||
event_items[index].total_time += event_memory_used; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_time
-> total_memory_used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
paddle/platform/profiler.cc
Outdated
for (size_t j = 0; j < events_table[i].size(); ++j) { | ||
EventItem& event_item = events_table[i][j]; | ||
|
||
app_total_time += event_item.total_time; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not correct to count the total time like this. There may be overlap between different events, also there may be gap between the end time of one event and the start time of the next.
paddle/platform/profiler.cc
Outdated
EventItem& event_item = events_table[i][j]; | ||
|
||
app_total_time += event_item.total_time; | ||
app_total_memory += event_item.total_memory_used; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain how to obtain the memory occupation for every operator? I feel that the memory may be malloced and freed frequently, and often be reused. And total_memory_used
calculated in this way would be meaningless.
I also have a doubt about this. Better to give some results and reasonable analysis. You can use the Now the profiler for time is suitable for multithreading. Is this memory counting is suitable for multithreading? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is great to count the memory usage by profiling!
But counting memory usage is different from counting time-consuming. Because time-consuming does not care about what happens inside the operator. counting memory usage should record the peak of memory used inside the operator.
You can think about conv_op
, it creates a col
to store the result of im2col
when the conv_op
is over, the memory will be released.
|
||
double event_memory_used = rit->MemoryUsed(events[i][j]); | ||
double total_memory_used = | ||
static_cast<double>(rit->GetMemoryUsed()) / (1024 * 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double total_memory_used = static_cast<double>(rit->GetMemoryUsed()) / (1024 * 1024);
==>
double total_memory_used = static_cast<double>(rit->GetMemoryUsed() + event_memory_used) * kMegabyte;
Where kMegabyte
equals to 1.0/1024/1024.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
event_memory_used
means the memory of this operator creating's.
total_memory_used
means, up to now, the total memory has been used, it should include event_memory_used
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we must overload the placement new
operator to reach the goal.
paddle/platform/profiler.cc
Outdated
for (auto& item : event_items) { | ||
item.ave_time = item.total_time / item.calls; | ||
item.ave_memory_used = item.total_memory_used / item.calls; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think item.ave_memory_used
is necessary.
average memory used
is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to count the memory used in the most case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted.
Please use this profiler latest code. |
在uniform random op里加入profiler,只跑一个op,结果如下。 结果符合预期,多分配了0.002 MB, 原因是allocator有magic number, 做alloc和free校验。 -------------------------> Profiling Report <-------------------------
Place: CPU Total Time:9.31106ms Total Memory:2.99219MB Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave. Total Memory.Min Memory. Max Memory.
thread0::uniform_random 1 5.17344 5.17344 5.17344 5.17344 0 2.99219 2.99219
thread0::fetch 1 1.42045 1.42045 1.42045 1.42045 2.99219 2.99219 2.99219
|
作为对比,检验profiler在整个模型上的累加是否正确 加入profiler之前的测试结果: @QiJune dzhwinter/benchmark#67
maximum memory usage: 50622464 --> 43061248
maximum memory usage: 1729540096 --> 1132953600
maximum memory usage: 1275125760 --> 663941120 |
使用profiler统计结果。在benchmark表格中可见。 mnist batch=64, 21M mnist batch=128, 41M |
hack the memory profiler for memory debugging and benchmark. When I do this job, I find that the structure of profiler needs to make it readable.
Use
with profiler.profiler('CPU', 'total')
as prof to package the profiling code.profiler.reset_profiler()
can be used to clear the previous records.A simple usage is as follows:
Please see this for demo usage
dzhwinter/benchmark#80