-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Done] Memory Management: Buddy Allocator #2674
Conversation
munlock(p, size); | ||
} | ||
free(p); | ||
} | ||
|
||
#ifndef PADDLE_ONLY_CPU | ||
|
||
void* GPUAllocator::Alloc(size_t size) { | ||
void* GPUAllocator::Alloc(size_t& index, size_t size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in order to support fallback allocation when standard allocation failed, we need parameter index
so that buddy allocator knows adopt which method to release the memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is index? I vaguely remember that in Majel, the index here is the device ID, but in our design, we have a GPUAllocator instance for each GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because index devotes which system allocator been used.
* Free will be added soon
cache_(system_allocator->UseGpu()), | ||
system_allocator_(std::move(system_allocator)) {} | ||
|
||
BuddyAllocator::~BuddyAllocator() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there multiple instances of BuddyAllocator
in one trainer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BuddyAlloctor is singleton, it belongs to each GPU/CPU
|
||
// Allocate a new maximum sized block | ||
size_t index = 0; | ||
void* p = system_allocator_->Alloc(index, max_chunk_size_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we allocate more than max_chunk_size_
if there's not enough in the pool_
, so that allocated memory are continues, introducing less memory fragments. Or I don't know if max_chunk_size_
could be like 1G
to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we can allocate chunk size bigger than the max_chunk_size_
, but it will not be managed by buddy allocator. You can chek this line: https://github.com/PaddlePaddle/Paddle/pull/2674/files#diff-dd894d330dd6a0deb01afe3fe24b1752R59
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well I mean
- shall we set
max_chunk_size_
>= 1G so that alloc ops after will be faster. - or shall we alocate 10 *
max_chunk_size_
in RefillPool for performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two situations in here.
For GPU, it's bad to specify max_chunk_size
>= 1G or 10 * max_chunk_size_
. It's better to set max_chunk_size_
according the current device's resouce.
size_t GpuMaxChunkSize() {
size_t total = 0;
size_t available = 0;
GpuMemoryUsage(available, total);
// Reserving the rest memory for page tables, etc.
size_t reserving = (1 - FLAGS_fraction_of_gpu_memory_to_use) * total;
// If available less than minimum chunk size, no usable memory exists.
available = std::max(available, GpuMinChunkSize()) - GpuMinChunkSize();
// If available less than reserving, no usable memory exists.
size_t usable = std::max(available, reserving) - reserving;
return usable;
}
For CPU, again, too large memory chunk should not be managed by Buddy allocator, it‘s one-time usage.
size_t CpuMaxAllocSize() {
// For distributed systems, it requires configuring and limiting
// the fraction of memory to use.
return FLAGS_fraction_of_cpu_memory_to_use * CpuTotalPhysicalMemory();
}
size_t CpuMinChunkSize() {
// Allow to allocate the minimum chunk size is 256 bytes.
return 1 << 8;
}
size_t CpuMaxChunkSize() {
// Allow to allocate the maximum chunk size is roughly 3% of CPU memory.
return CpuMaxAllocSize() / 32;
}
For 16GB node, 3% means roughly 500 MB, I think it's good enough.
FLAGS_fraction_of_cpu_memory_to_use
is to expose to kubernetes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great explanation! I totally agree with you!
Maybe minimum chunk size of 4K is best for performance because default linux memory page size is 4K.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero That's a good idea. But 4k
means 4096 bytes -> 1024 floats,
if we frequently allocate small chunks, like 256, 128, 32, 64 floats, any of them will be padding to 4K
, is that waste memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably for CPU, using 4k
. For GPU, maybe default 4k
is not a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, only for CPU.
Do we need some unit test case to check the buddy allocator is correctly spliting and merging memroy blocks? |
TEST(BuddyAllocator, CPUMultAlloc) {
paddle::platform::CPUPlace cpu;
std::vector<void *> ps;
ps.reserve(8);
for (auto size : {256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304}) {
ps.emplace_back(paddle::memory::Alloc(cpu, size));
}
for (auto p : ps) {
paddle::memory::Free(cpu, p);
}
}
|
@gangliao I saw this test, I mean to check the right buddy size after split and check merged size after merge to ensure the allocator's internal behavior. Well not sure whether this is needed. |
@typhoonzero Yeah, I guess this information already in here |
For instance, Split:
512 + 536870400 = 536870912 |
It's hard to review this PR, maybe take a look at this page, it explained how it works. |
The website looks great! |
TEST(BuddyAllocator, CPUMultAlloc) {
paddle::platform::CPUPlace cpu;
std::unordered_map<void *, size_t> ps;
size_t total_size = paddle::memory::Used(cpu);
EXPECT_EQ(total_size, 0UL);
for (auto size :
{128, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304}) {
ps[paddle::memory::Alloc(cpu, size)] = size;
// Buddy Allocator doesn't manage too large memory chunk
if (paddle::memory::Used(cpu) == total_size) continue;
size_t aligned_size = align(size, cpu);
total_size += aligned_size;
// check memory block is allocated and split correctly
EXPECT_EQ(total_size, paddle::memory::Used(cpu));
}
for (auto p : ps) {
// check each memory address is aligned
EXPECT_EQ(is_aligned(p.first), true);
paddle::memory::Free(cpu, p.first);
// Buddy Allocator doesn't manage too large memory chunk
if (paddle::memory::Used(cpu) == total_size) continue;
size_t aligned_size = align(p.second, cpu);
total_size -= aligned_size;
// check memory block is free and merged correctly
EXPECT_EQ(total_size, paddle::memory::Used(cpu));
}
} I updated the memory test as the above code snippet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code LGTM!
Anyway I think we need another lgtm for this PR is important and big.
This PR blocked my other work progress, so I will merge it first, any comments is welcome. |
Please start review from here @wangkuiyi @typhoonzero