Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does MPI_Iprobe occupy so much memory #12974

Closed
heilengleng opened this issue Dec 10, 2024 · 12 comments
Closed

Why does MPI_Iprobe occupy so much memory #12974

heilengleng opened this issue Dec 10, 2024 · 12 comments

Comments

@heilengleng
Copy link

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

4.1.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

compile params:
'--prefix=/var/test/opt/openmpi415tcp'
'--without-ucx' '--without-verbs'
'--enable-mca-no-build=btl-openib,osc-ucx,pml-ucx'
'--enable-mpi-thread-multiple

Please describe the system on which you are running

uname -r
5.10.0-60.18.0.50.h1209.x86_64
gcc --version
gcc (GCC) 10.3.1
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOS
 lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         42 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  1
    Core(s) per socket:  1
    Socket(s):           16
    Stepping:            7
    BogoMIPS:            4400.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopolo
                         gy cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch inv
                         pcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512b
                         w avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni md_clear flush_l1d arch_capabilities
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   512 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    64 MiB (16 instances)
  L3:                    256 MiB (16 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-15
Vulnerabilities:         
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         KVM: Mitigation: VMX unsupported
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Retbleed:              Vulnerable
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown

Details of the problem

run command

mpirun -n 2 --bind-to none --mca pml ob1 --mca btl tcp,self --host 192.168.1.6,192.168.1.7 /var/test/build/p2pOneWay_8n8To8n8 100000000

This command is used to send 100 million messages between two servers. Each server has about 50 million messages. Each server has eight queues (eight threads) to send messages to each other. The size of each message is 1 KB.

Run the top command to check the memory usage of the process. After a certain time point, the memory usage suddenly increases to 16 GB. The following figure shows the memory usage statistics collected every 2s.

image

Use other methods to trace the memory usage distribution. It is found that MPI_Iprobe occupies a large amount of memory, as shown in the following figure.
image

What are the special requirements or special settings for the memory of the openmpi method? Why does the memory suddenly skyrocket?

I feel that the memory usage in this place is unreasonable. Is there any solution?

@ggouaillardet
Copy link
Contributor

Thanks for the report.
Can you please share your reproducer?

@heilengleng
Copy link
Author

Thanks for the report. Can you please share your reproducer?
you mean the code?

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Dec 10, 2024

yes, please trim down your code to a self contained program that can be used to evidence the issue.

@ggouaillardet
Copy link
Contributor

note there is no control flow in Open MPI. That means that if the sender is continuously sending messages and the receiver cannot keep up, that will create a lot of unexpected messages causing some memory allocations that can ultimately result in memory exhaustion.
Without seeing your program, I cannot tell if this is the case here.

@heilengleng
Copy link
Author

note there is no control flow in Open MPI. That means that if the sender is continuously sending messages and the receiver cannot keep up, that will create a lot of unexpected messages causing some memory allocations that can ultimately result in memory exhaustion. Without seeing your program, I cannot tell if this is the case here.

Sorry, some service codes are not easy to post. However, the key message receiving function is implemented in this way. After a message is received, the service puts the message into a queue, and other threads process the subsequent service logic of the message.

void MpiMgr::receive(std::string &buffer, MPI_Comm comm, MPI_Status &status, int &dataLength)
{
    int waitTimes = 0;
    int flag = 0;
    while (!flag) {
        MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, comm, &flag, &status);
        if (waitTimes >= BUSY_WAIT_TIMES) {
            std::this_thread::sleep_for(std::chrono::microseconds(WAIT_DURATION_FOR_IPROBE_IN_MICROSECOND));
        } else {
            waitTimes++;
        }
    }
    MPI_Get_count(&status, MPI_CHAR, &dataLength);
    buffer.resize(dataLength);
    MPI_Recv(buffer.data(), (int)buffer.size(), MPI_CHAR, status.MPI_SOURCE, status.MPI_TAG, comm, MPI_STATUS_IGNORE);
}

@heilengleng
Copy link
Author

heilengleng commented Dec 11, 2024

yes, please trim down your code to a self contained program that can be used to evidence the issue.
In the complete test code, two servers send 100 million pieces of 1 KB data to each other in MPI_Probe mode. However, the same problem occurs and the memory usage increases sharply. The test result is as follows:
server1:
image
server2:
image

The memory of one server increases to almost 31 GB, but strangely, the memory of the other server is very small.
mpi_example_04.txt
@ggouaillardet

@ggouaillardet
Copy link
Contributor

Control flow can be an issue here, but I cannot tell for sure without a reproducer.

Note you do not really have to use non blocking probe (e.g. MPI_Iprobe()), MPI_Probe() will do the trick.

I think the code is legit otherwise, but you can consider using MPI_Mprobe() and MPI_Mrecv()

@ggouaillardet
Copy link
Contributor

Assuming you send the messages with MPI_Send(), you can easily add synchronization (e.g. receive a zero message size right after, and send a zero message size right after a message is received.
If it does reduce the memory usage, that should hint the root cause is indeed unexpected messages caused by the lack of control flow).

@heilengleng
Copy link
Author

Control flow can be an issue here, but I cannot tell for sure without a reproducer.

Note you do not really have to use non blocking probe (e.g. MPI_Iprobe()), MPI_Probe() will do the trick.

I think the code is legit otherwise, but you can consider using MPI_Mprobe() and MPI_Mrecv()

#include <mpi.h>
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
#include <boost/lockfree/queue.hpp>
#include <sstream>

const int MAX_MESSAGE_SIZE = 1000 * 1000 * 10;  // 10 MB
int num_msgs = 10000 * 10000;
int size_msg = 1024;
bool show_msg = false;

using namespace std;

auto pQueue = new boost::lockfree::queue<string*,
    boost::lockfree::fixed_sized<true>,
    boost::lockfree::capacity<1024>>();

void send_msgs(int source_rank, int dest_rank, int count) {
    for (int i = 0; i < count; ++i) {
        while (!pQueue->empty()) {
            string* pMsg = nullptr;
            if (pQueue->pop(pMsg)) {
                MPI_Send(pMsg->data(), pMsg->size(), MPI_CHAR, dest_rank, 0, MPI_COMM_WORLD);
                if (show_msg) {
                    ostringstream oss;
                    oss << "Rank " << source_rank << ": Sent msg " << i + 1 << endl;
                    cout << oss.str();
                }
                delete pMsg;
            } else {
                std::this_thread::sleep_for(std::chrono::microseconds(1));
            }
        }
    }
}

void produce_msgs(int count) {
    for (int i = 0; i < count; ++i) {
        auto* pMsg = new string(size_msg, 'a');
        while (!pQueue->push(pMsg)) {
            // cout << "queue is full when pushing." << endl;
            std::this_thread::sleep_for(std::chrono::microseconds(1));
        }
    }
}

void recv_msgs(int source_rank, int dest_rank) {
    char* buffer = new char[MAX_MESSAGE_SIZE];
    MPI_Status status;

    int count = 0;
    auto start = chrono::high_resolution_clock::now();

    for (int i = 0; i < num_msgs; ++i) {
        MPI_Probe(source_rank, 0, MPI_COMM_WORLD, &status);  // Probing the incoming message
        int message_size;
        MPI_Get_count(&status, MPI_CHAR, &message_size);  // Getting the size of the incoming message
        MPI_Recv(buffer, message_size, MPI_CHAR, source_rank, 0, MPI_COMM_WORLD, &status);  // Receiving the message

        if (i == 0) {
            start = chrono::high_resolution_clock::now();
        }

        if (show_msg) {
            ostringstream oss;
            oss << "Rank " << dest_rank << ": Received message " << i + 1 << endl;
            cout << oss.str();
        }
    }

    auto end = chrono::high_resolution_clock::now();
    auto duration = chrono::duration_cast<chrono::microseconds>(end - start);
    cout << "Rank " << dest_rank << ": Received " << num_msgs << " messages!" << endl;
    cout << "Rank " << dest_rank << " messages per second: " << num_msgs / (duration.count() / 1000 / 1000.0) << endl;

    delete[] buffer;
}

int main(int argc, char** argv) {
    int provided;
    MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
    if (provided < MPI_THREAD_MULTIPLE) {
        cout << "The MPI implementation does not support MPI_THREAD_MULTIPLE." << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
    }

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);  
    MPI_Comm_size(MPI_COMM_WORLD, &size); 

    if (size < 2) {
        cout << "This program requires at least two processes." << endl;
        MPI_Abort(MPI_COMM_WORLD, 1);  
    }

    if (rank == 0) {
        vector<thread> producing_threads;
        for (int i = 0; i < 16; ++i) {
            producing_threads.emplace_back(produce_msgs, num_msgs / 16);
        }
        thread sending_thread = thread(send_msgs, 0, 1, num_msgs);
        thread receiving_thread = thread(recv_msgs, 1, 0);

        for (auto& t : producing_threads) {
            if (t.joinable()) {
                t.join();
            }
        }
        if (sending_thread.joinable()) {
            sending_thread.join();
        }
        if (receiving_thread.joinable()) {
            receiving_thread.join();
        }
    } else if (rank == 1) {
        vector<thread> producing_threads;
        for (int i = 0; i < 16; ++i) {
            producing_threads.emplace_back(produce_msgs, num_msgs / 16);
        }
        thread sending_thread = thread(send_msgs, 1, 0, num_msgs);
        thread receiving_thread = thread(recv_msgs, 0, 1);

        for (auto& t : producing_threads) {
            if (t.joinable()) {
                t.join();
            }
        }
        if (sending_thread.joinable()) {
            sending_thread.join();
        }
        if (receiving_thread.joinable()) {
            receiving_thread.join();
        }
    }

    delete pQueue;

    MPI_Finalize();

    return 0;
}

@ggouaillardet
Copy link
Contributor

There is no control flow indeed.
A simple trick you can try is to MPI_Ssend() instead of MPI_Send() once every n (for example 100) messages.

@bosilca
Copy link
Member

bosilca commented Dec 11, 2024

@ggouaillardet is correct, your processes get desynchronized and servers have to stack the unexpected messages. The solution @ggouaillardet proposes (aka. use an MPI_Ssend regularly) is generic and loosely synchronized.

You can also just change the eager size in OMPI to force a handshake for each message. The outcome will be similar, an MPI_Send will not complete before the corresponding MPI_Recv has been posted, providing a very strong synchronization between each pair of processes.

@heilengleng
Copy link
Author

heilengleng commented Dec 12, 2024

There is no control flow indeed. A simple trick you can try is to MPI_Ssend() instead of MPI_Send() once every n (for example 100) messages.

@ggouaillardet
Thank you very much. As you said, no control flow. After MPI_SSend is used, the memory usage is very small.
Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants