Runtime failures on larger process counts #66

krzikalla · 2021-07-29T08:06:06Z

I run in all kinds of trouble, when I try running GPI2 1.5 on somewhat larger process counts. Up to 128 procs all is fine, but starting with 256 processes the program stops with all kinds of unreproducible errors. Tried on two Infiniband clusters.

Has something changed from 1.3 to 1.5, so that a gaspi_proc_init with GASPI_TOPOLOGY_STATIC isn't advisable anymore for those process counts? And if I use GASPI_TOPOLOGY_NONE and connect only neighbors by hand, can I then use the gaspi collectives nevertheless?

The text was updated successfully, but these errors were encountered:

krzikalla · 2021-08-06T12:40:03Z

The following program fails reliable with 1024 processes on our cluster. Please, can someone look at it, apparently somewhat in the collectives is broken (GPI2 v1.5.0). (or it is my AllGatherValueImpl function)

//  mpicxx gaspi_segment.cpp -pthread -I$GPI2_HOME/include -L$GPI2_HOME/lib64 -lGPI2

#include <iostream>
#include <cassert>
#include <vector>
#include <mpi.h>
#include "GASPI.h"


inline gaspi_return_t CheckGaspiResult(gaspi_return_t result, const char* what)
{
  if (result != GASPI_SUCCESS && result != GASPI_TIMEOUT)
  {
    throw std::runtime_error(what);
  }
  return result;
}

#define GASPI_CHECK( X ) CheckGaspiResult((X), #X)


using RankIndexT = unsigned int;

struct GASPICommunicator
{
  gaspi_rank_t numProcs_;
  gaspi_rank_t ownRank_;
  gaspi_number_t maxReduceElems_;

  GASPICommunicator()
  {
    GASPI_CHECK(gaspi_allreduce_elem_max(&maxReduceElems_));
    GASPI_CHECK(gaspi_proc_rank(&ownRank_));
    GASPI_CHECK(gaspi_proc_num(&numProcs_));
  }

  void AllGatherValueImpl(const unsigned int* values, unsigned int* data)
  {
    std::fill_n(data, numProcs_, 0);
    std::copy(values, values + 1, data + ownRank_);
    gaspi_number_t remainingElems = gaspi_number_t(numProcs_);
    while (remainingElems > 0)
    {
      auto reduceEles = std::min(maxReduceElems_, remainingElems);
      GASPI_CHECK(gaspi_allreduce(data, data, reduceEles, GASPI_OP_SUM, GASPI_TYPE_UINT, GASPI_GROUP_ALL, GASPI_BLOCK));
      remainingElems -= reduceEles;
      data += reduceEles;
    }
  }
};

void CheckAllreduce()
{
  GASPICommunicator communicator;
  unsigned int value = communicator.ownRank_;
  std::vector<unsigned int> allData (communicator.numProcs_, -1);
  communicator.AllGatherValueImpl(&value, allData.data());
  for (int i = 0; i < communicator.numProcs_; ++i)
  {
    if (i != allData[i])
    {
      std::cout << "At rank " << value << " first fail at " << i << ", content is " << allData[i] << std::endl;
      return;
    }
  }
  std::cout << "At rank " << value << " all OK." << std::endl;
}


int main(int argc, char** argv)
{
  int provided_thread_level;
  int mpi_init_result = MPI_Init_thread(&argc, &argv, MPI_THREAD_SERIALIZED, &provided_thread_level);

  GASPI_CHECK(gaspi_proc_init(GASPI_BLOCK));

  CheckAllreduce();

  GASPI_CHECK(gaspi_proc_term(GASPI_BLOCK));
  MPI_Finalize();
}

krzikalla · 2022-09-23T09:56:08Z

Update on this issue: the reason seems to be the setting of PCI_WR_ORDERING. If set to per_mkey(0), all is fine. If set to force_relax(1), races will happen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime failures on larger process counts #66

Runtime failures on larger process counts #66

krzikalla commented Jul 29, 2021

krzikalla commented Aug 6, 2021

krzikalla commented Sep 23, 2022

Runtime failures on larger process counts #66

Runtime failures on larger process counts #66

Comments

krzikalla commented Jul 29, 2021

krzikalla commented Aug 6, 2021

krzikalla commented Sep 23, 2022