Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Done] Memory Management: Buddy Allocator #2674

Merged
merged 46 commits into from
Jul 14, 2017

Conversation

gangliao
Copy link
Contributor

@gangliao gangliao commented Jun 29, 2017

Please start review from here @wangkuiyi @typhoonzero

munlock(p, size);
}
free(p);
}

#ifndef PADDLE_ONLY_CPU

void* GPUAllocator::Alloc(size_t size) {
void* GPUAllocator::Alloc(size_t& index, size_t size) {
Copy link
Contributor Author

@gangliao gangliao Jun 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to support fallback allocation when standard allocation failed, we need parameter index so that buddy allocator knows adopt which method to release the memory

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is index? I vaguely remember that in Majel, the index here is the device ID, but in our design, we have a GPUAllocator instance for each GPU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because index devotes which system allocator been used.

@gangliao gangliao changed the title [WIP] Basic CPU/GPU hardware memory info and allocation statistics [WIP] Buddy Allocator Jul 5, 2017
cache_(system_allocator->UseGpu()),
system_allocator_(std::move(system_allocator)) {}

BuddyAllocator::~BuddyAllocator() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there multiple instances of BuddyAllocator in one trainer?

Copy link
Contributor Author

@gangliao gangliao Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// Allocate a new maximum sized block
size_t index = 0;
void* p = system_allocator_->Alloc(index, max_chunk_size_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allocate more than max_chunk_size_ if there's not enough in the pool_, so that allocated memory are continues, introducing less memory fragments. Or I don't know if max_chunk_size_ could be like 1G to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can allocate chunk size bigger than the max_chunk_size_, but it will not be managed by buddy allocator. You can chek this line: https://github.com/PaddlePaddle/Paddle/pull/2674/files#diff-dd894d330dd6a0deb01afe3fe24b1752R59

@typhoonzero

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I mean

  1. shall we set max_chunk_size_ >= 1G so that alloc ops after will be faster.
  2. or shall we alocate 10 * max_chunk_size_ in RefillPool for performance.

Copy link
Contributor Author

@gangliao gangliao Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two situations in here.

For GPU, it's bad to specify max_chunk_size >= 1G or 10 * max_chunk_size_ . It's better to set max_chunk_size_ according the current device's resouce.

size_t GpuMaxChunkSize() {
  size_t total = 0;
  size_t available = 0;

  GpuMemoryUsage(available, total);

  // Reserving the rest memory for page tables, etc.
  size_t reserving = (1 - FLAGS_fraction_of_gpu_memory_to_use) * total;

  // If available less than minimum chunk size, no usable memory exists.
  available = std::max(available, GpuMinChunkSize()) - GpuMinChunkSize();

  // If available less than reserving, no usable memory exists.
  size_t usable = std::max(available, reserving) - reserving;

  return usable;
}

For CPU, again, too large memory chunk should not be managed by Buddy allocator, it‘s one-time usage.

size_t CpuMaxAllocSize() {
  // For distributed systems, it requires configuring and limiting
  // the fraction of memory to use.
  return FLAGS_fraction_of_cpu_memory_to_use * CpuTotalPhysicalMemory();
}

size_t CpuMinChunkSize() {
  // Allow to allocate the minimum chunk size is 256 bytes.
  return 1 << 8;
}

size_t CpuMaxChunkSize() {
  // Allow to allocate the maximum chunk size is roughly 3% of CPU memory.
  return CpuMaxAllocSize() / 32;
}

For 16GB node, 3% means roughly 500 MB, I think it's good enough.

FLAGS_fraction_of_cpu_memory_to_use is to expose to kubernetes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great explanation! I totally agree with you!

Maybe minimum chunk size of 4K is best for performance because default linux memory page size is 4K.

Copy link
Contributor Author

@gangliao gangliao Jul 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@typhoonzero That's a good idea. But 4k means 4096 bytes -> 1024 floats,
if we frequently allocate small chunks, like 256, 128, 32, 64 floats, any of them will be padding to 4K, is that waste memory?

Copy link
Contributor Author

@gangliao gangliao Jul 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably for CPU, using 4k. For GPU, maybe default 4k is not a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only for CPU.

@gangliao gangliao changed the title [WIP] Buddy Allocator [Done] Buddy Allocator Jul 11, 2017
@gangliao gangliao changed the title [Done] Buddy Allocator [Done] Memory Management: Buddy Allocator Jul 11, 2017
@typhoonzero
Copy link
Contributor

Do we need some unit test case to check the buddy allocator is correctly spliting and merging memroy blocks?

@gangliao
Copy link
Contributor Author

gangliao commented Jul 12, 2017

@typhoonzero

TEST(BuddyAllocator, CPUMultAlloc) {
  paddle::platform::CPUPlace cpu;

  std::vector<void *> ps;
  ps.reserve(8);

  for (auto size : {256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304}) {
    ps.emplace_back(paddle::memory::Alloc(cpu, size));
  }

  for (auto p : ps) {
    paddle::memory::Free(cpu, p);
  }
}
69: [ RUN      ] BuddyAllocator.CPUMultAlloc
69: I0712 20:44:41.779470 2569728960 buddy_allocator.cc:55] Allocate 256 bytes from chunk size 512
69: I0712 20:44:41.779475 2569728960 buddy_allocator.cc:75] Allocation from existing memory block 0x115d70000 at address 0x115d70040
69: I0712 20:44:41.779479 2569728960 buddy_allocator.cc:240] Split block (0x115d70000, 536870912) into
69: I0712 20:44:41.779484 2569728960 buddy_allocator.cc:244] Left block (0x115d70000, 512)
69: I0712 20:44:41.779489 2569728960 buddy_allocator.cc:251] Insert right block (0x115d70200, 536870400)
69: I0712 20:44:41.779494 2569728960 buddy_allocator.cc:55] Allocate 1024 bytes from chunk size 1280
69: I0712 20:44:41.779496 2569728960 buddy_allocator.cc:75] Allocation from existing memory block 0x115d70200 at address 0x115d70240
69: I0712 20:44:41.779500 2569728960 buddy_allocator.cc:240] Split block (0x115d70200, 536870400) into
69: I0712 20:44:41.779505 2569728960 buddy_allocator.cc:244] Left block (0x115d70200, 1280)
69: I0712 20:44:41.779510 2569728960 buddy_allocator.cc:251] Insert right block (0x115d70700, 536869120)
69: I0712 20:44:41.779515 2569728960 buddy_allocator.cc:55] Allocate 4096 bytes from chunk size 4352
69: I0712 20:44:41.779517 2569728960 buddy_allocator.cc:75] Allocation from existing memory block 0x115d70700 at address 0x115d70740
69: I0712 20:44:41.779521 2569728960 buddy_allocator.cc:240] Split block (0x115d70700, 536869120) into
69: I0712 20:44:41.779525 2569728960 buddy_allocator.cc:244] Left block (0x115d70700, 4352)
69: I0712 20:44:41.779531 2569728960 buddy_allocator.cc:251] Insert right block (0x115d71800, 536864768)
69: I0712 20:44:41.779536 2569728960 buddy_allocator.cc:55] Allocate 16384 bytes from chunk size 16640
69: I0712 20:44:41.779538 2569728960 buddy_allocator.cc:75] Allocation from existing memory block 0x115d71800 at address 0x115d71840
69: I0712 20:44:41.779542 2569728960 buddy_allocator.cc:240] Split block (0x115d71800, 536864768) into
69: I0712 20:44:41.779548 2569728960 buddy_allocator.cc:244] Left block (0x115d71800, 16640)
69: I0712 20:44:41.779553 2569728960 buddy_allocator.cc:251] Insert right block (0x115d75900, 536848128)
69: I0712 20:44:41.779558 2569728960 buddy_allocator.cc:55] Allocate 65536 bytes from chunk size 65792
69: I0712 20:44:41.779561 2569728960 buddy_allocator.cc:75] Allocation from existing memory block 0x115d75900 at address 0x115d75940
69: I0712 20:44:41.779566 2569728960 buddy_allocator.cc:240] Split block (0x115d75900, 536848128) into
69: I0712 20:44:41.779572 2569728960 buddy_allocator.cc:244] Left block (0x115d75900, 65792)
69: I0712 20:44:41.779575 2569728960 buddy_allocator.cc:251] Insert right block (0x115d85a00, 536782336)
69: I0712 20:44:41.779580 2569728960 buddy_allocator.cc:55] Allocate 262144 bytes from chunk size 262400
69: I0712 20:44:41.779589 2569728960 buddy_allocator.cc:75] Allocation from existing memory block 0x115d85a00 at address 0x115d85a40
69: I0712 20:44:41.779593 2569728960 buddy_allocator.cc:240] Split block (0x115d85a00, 536782336) into
69: I0712 20:44:41.779600 2569728960 buddy_allocator.cc:244] Left block (0x115d85a00, 262400)
69: I0712 20:44:41.779604 2569728960 buddy_allocator.cc:251] Insert right block (0x115dc5b00, 536519936)
69: I0712 20:44:41.779609 2569728960 buddy_allocator.cc:55] Allocate 1048576 bytes from chunk size 1048832
69: I0712 20:44:41.779613 2569728960 buddy_allocator.cc:75] Allocation from existing memory block 0x115dc5b00 at address 0x115dc5b40
69: I0712 20:44:41.779616 2569728960 buddy_allocator.cc:240] Split block (0x115dc5b00, 536519936) into
69: I0712 20:44:41.779623 2569728960 buddy_allocator.cc:244] Left block (0x115dc5b00, 1048832)
69: I0712 20:44:41.779628 2569728960 buddy_allocator.cc:251] Insert right block (0x115ec5c00, 535471104)
69: I0712 20:44:41.779633 2569728960 buddy_allocator.cc:55] Allocate 4194304 bytes from chunk size 4194560
69: I0712 20:44:41.779636 2569728960 buddy_allocator.cc:75] Allocation from existing memory block 0x115ec5c00 at address 0x115ec5c40
69: I0712 20:44:41.779640 2569728960 buddy_allocator.cc:240] Split block (0x115ec5c00, 535471104) into
69: I0712 20:44:41.779647 2569728960 buddy_allocator.cc:244] Left block (0x115ec5c00, 4194560)
69: I0712 20:44:41.779651 2569728960 buddy_allocator.cc:251] Insert right block (0x1162c5d00, 531276544)
69: I0712 20:44:41.779656 2569728960 buddy_allocator.cc:94] Free from address 0x115d70000
69: I0712 20:44:41.779661 2569728960 buddy_allocator.cc:114] Merging this block 0x115d70000 with its right buddy 0x115d70200
69: I0712 20:44:41.779665 2569728960 buddy_allocator.cc:149] Inserting free block (0x115d70000, 512)
69: I0712 20:44:41.779670 2569728960 buddy_allocator.cc:94] Free from address 0x115d70200
69: I0712 20:44:41.779675 2569728960 buddy_allocator.cc:114] Merging this block 0x115d70200 with its right buddy 0x115d70700
69: I0712 20:44:41.779678 2569728960 buddy_allocator.cc:132] Merging this block 0x115d70200 with its left buddy 0x115d70000
69: I0712 20:44:41.779685 2569728960 buddy_allocator.cc:149] Inserting free block (0x115d70000, 1792)
69: I0712 20:44:41.779688 2569728960 buddy_allocator.cc:94] Free from address 0x115d70700
69: I0712 20:44:41.779693 2569728960 buddy_allocator.cc:114] Merging this block 0x115d70700 with its right buddy 0x115d71800
69: I0712 20:44:41.779697 2569728960 buddy_allocator.cc:132] Merging this block 0x115d70700 with its left buddy 0x115d70000
69: I0712 20:44:41.779703 2569728960 buddy_allocator.cc:149] Inserting free block (0x115d70000, 6144)
69: I0712 20:44:41.779707 2569728960 buddy_allocator.cc:94] Free from address 0x115d71800
69: I0712 20:44:41.779711 2569728960 buddy_allocator.cc:114] Merging this block 0x115d71800 with its right buddy 0x115d75900
69: I0712 20:44:41.779716 2569728960 buddy_allocator.cc:132] Merging this block 0x115d71800 with its left buddy 0x115d70000
69: I0712 20:44:41.779721 2569728960 buddy_allocator.cc:149] Inserting free block (0x115d70000, 22784)
69: I0712 20:44:41.779726 2569728960 buddy_allocator.cc:94] Free from address 0x115d75900
69: I0712 20:44:41.779729 2569728960 buddy_allocator.cc:114] Merging this block 0x115d75900 with its right buddy 0x115d85a00
69: I0712 20:44:41.779733 2569728960 buddy_allocator.cc:132] Merging this block 0x115d75900 with its left buddy 0x115d70000
69: I0712 20:44:41.779739 2569728960 buddy_allocator.cc:149] Inserting free block (0x115d70000, 88576)
69: I0712 20:44:41.779743 2569728960 buddy_allocator.cc:94] Free from address 0x115d85a00
69: I0712 20:44:41.779747 2569728960 buddy_allocator.cc:114] Merging this block 0x115d85a00 with its right buddy 0x115dc5b00
69: I0712 20:44:41.779752 2569728960 buddy_allocator.cc:132] Merging this block 0x115d85a00 with its left buddy 0x115d70000
69: I0712 20:44:41.779757 2569728960 buddy_allocator.cc:149] Inserting free block (0x115d70000, 350976)
69: I0712 20:44:41.779762 2569728960 buddy_allocator.cc:94] Free from address 0x115dc5b00
69: I0712 20:44:41.779765 2569728960 buddy_allocator.cc:114] Merging this block 0x115dc5b00 with its right buddy 0x115ec5c00
69: I0712 20:44:41.779769 2569728960 buddy_allocator.cc:132] Merging this block 0x115dc5b00 with its left buddy 0x115d70000
69: I0712 20:44:41.779775 2569728960 buddy_allocator.cc:149] Inserting free block (0x115d70000, 1399808)
69: I0712 20:44:41.779779 2569728960 buddy_allocator.cc:94] Free from address 0x115ec5c00
69: I0712 20:44:41.779783 2569728960 buddy_allocator.cc:114] Merging this block 0x115ec5c00 with its right buddy 0x1162c5d00
69: I0712 20:44:41.779789 2569728960 buddy_allocator.cc:132] Merging this block 0x115ec5c00 with its left buddy 0x115d70000
69: I0712 20:44:41.779795 2569728960 buddy_allocator.cc:149] Inserting free block (0x115d70000, 536870912)
69: [       OK ] BuddyAllocator.CPUMultAlloc (0 ms)

@gangliao gangliao closed this Jul 12, 2017
@gangliao gangliao reopened this Jul 12, 2017
@typhoonzero
Copy link
Contributor

@gangliao I saw this test, I mean to check the right buddy size after split and check merged size after merge to ensure the allocator's internal behavior.

Well not sure whether this is needed.

@gangliao
Copy link
Contributor Author

@typhoonzero Yeah, I guess this information already in here

@gangliao
Copy link
Contributor Author

gangliao commented Jul 12, 2017

For instance,

Split:

69: I0712 20:44:41.779479 2569728960 buddy_allocator.cc:240] Split block (0x115d70000, 536870912) into
69: I0712 20:44:41.779484 2569728960 buddy_allocator.cc:244] Left block (0x115d70000, 512)
69: I0712 20:44:41.779489 2569728960 buddy_allocator.cc:251] Insert right block (0x115d70200, 536870400)

512 + 536870400 = 536870912

@gangliao
Copy link
Contributor Author

It's hard to review this PR, maybe take a look at this page, it explained how it works.

@jacquesqiao
Copy link
Member

The website looks great!

@gangliao
Copy link
Contributor Author

gangliao commented Jul 14, 2017

TEST(BuddyAllocator, CPUMultAlloc) {
  paddle::platform::CPUPlace cpu;

  std::unordered_map<void *, size_t> ps;

  size_t total_size = paddle::memory::Used(cpu);
  EXPECT_EQ(total_size, 0UL);

  for (auto size :
       {128, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304}) {
    ps[paddle::memory::Alloc(cpu, size)] = size;

    // Buddy Allocator doesn't manage too large memory chunk
    if (paddle::memory::Used(cpu) == total_size) continue;

    size_t aligned_size = align(size, cpu);
    total_size += aligned_size;

    // check memory block is allocated and split correctly
    EXPECT_EQ(total_size, paddle::memory::Used(cpu));
  }

  for (auto p : ps) {
    // check each memory address is aligned
    EXPECT_EQ(is_aligned(p.first), true);
    paddle::memory::Free(cpu, p.first);

    // Buddy Allocator doesn't manage too large memory chunk
    if (paddle::memory::Used(cpu) == total_size) continue;

    size_t aligned_size = align(p.second, cpu);
    total_size -= aligned_size;

    // check memory block is free and merged correctly
    EXPECT_EQ(total_size, paddle::memory::Used(cpu));
  }
}

I updated the memory test as the above code snippet.

Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM!
Anyway I think we need another lgtm for this PR is important and big.

@gangliao gangliao requested a review from QiJune July 14, 2017 12:10
@gangliao
Copy link
Contributor Author

This PR blocked my other work progress, so I will merge it first, any comments is welcome.

@gangliao gangliao merged commit 48cf64e into PaddlePaddle:develop Jul 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants