Kokkos::ViewAllocateWithoutInitializing is not working #317

mndevec · 2016-06-06T16:49:15Z

Allocations with initializations are sometimes expensive and unnecessary and I was frequently using Kokkos::ViewAllocateWithoutInitializing to avoid that. It seems that it has been disabled at some point, and view allocations seem to be always initialized with 0's now. Is there a new way to avoid initializations?

hcedwar · 2016-06-06T18:42:02Z

Are you getting a compile error or is initialization happening regardless of input?

mndevec · 2016-06-06T18:44:02Z

It is not a compilation issue. No matter what I set, the views are initialized with 0's.

dsunder · 2016-06-06T18:56:57Z

I could be the the memory was just allocated with mmap, the memory would then always be set to zero. If it is a large allocation this is most likely.

--Dan

On Jun 6, 2016, at 12:44, Mehmet Deveci [email protected] wrote:

It is not a compilation issue. No matter what I set, the views are initialized with 0's.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

mndevec · 2016-06-06T19:13:56Z

Below is the test example I was using to test initialization. I allocate a view change its values, print it. Then I release that one, and reallocate it and print it. Although it gets the exact same memory address its values appears to be 0 in the second print.

#include <iostream>
#include <Kokkos_Core.hpp>


typedef typename Kokkos::Cuda MyMemorySpace;
typedef typename Kokkos::Cuda MyExecSpace;
typedef Kokkos::View <int *, MyExecSpace> myview;

template <typename array_type>
struct LinearInitialization{
  typedef typename array_type::value_type idx;
  array_type array_sum;
  LinearInitialization(array_type arr_): array_sum(arr_){}

  KOKKOS_INLINE_FUNCTION
  void operator()(const size_t ii) const {
    array_sum(ii) = ii;
  }
};


template <typename array_type, typename MyExecSpace>
void linear_init(typename array_type::value_type num_elements, array_type arr){
  typedef Kokkos::RangePolicy<MyExecSpace> my_exec_space;
  Kokkos::parallel_for( my_exec_space(0, num_elements), LinearInitialization<array_type>(arr));
}


template <typename idx_array_type>
void print_1Dview(idx_array_type view, bool print_all = false){

  typedef typename idx_array_type::HostMirror host_type;
  typedef typename idx_array_type::size_type idx;
  host_type host_view = Kokkos::create_mirror_view (view);
  Kokkos::deep_copy (host_view , view);
  idx nr = host_view.dimension_0();
  if (!print_all){


    if (nr > 20){
      idx n = 10;
      for (idx i = 0; i < n; ++i){
        std::cout << host_view(i) << " ";
      }
      std::cout << "... ... ... ";

      for (idx i = nr-n; i < nr; ++i){
        std::cout << host_view(i) << " ";
      }
      std::cout << std::endl;
    }
    else {
      for (idx i = 0; i < nr; ++i){
        std::cout << host_view(i) << " ";
      }
      std::cout << std::endl;
    }
  }
  else {
    for (idx i = 0; i < nr; ++i){
      std::cout << host_view(i) << " ";
    }
    std::cout << std::endl;
  }
}




int  main (int  argc, char ** argv){
  Kokkos::initialize(argc, argv);
  MyExecSpace::print_configuration(std::cout);
  int nnz = 100;
  if (argc >= 2)
  nnz = atoi(argv[1]);

  std::cout << "Allocating and initializing view with size:" << nnz << std::endl;

  myview noInitializeView(Kokkos::ViewAllocateWithoutInitializing("test"), nnz);
  MyExecSpace::fence();

  std::cout << "noInitializeView.ptr_on_device():" << noInitializeView.ptr_on_device() << std::endl;

  linear_init<myview, MyExecSpace>(nnz, noInitializeView );
  MyExecSpace::fence();

  print_1Dview(noInitializeView);
  MyExecSpace::fence();

  noInitializeView = myview();
  MyExecSpace::fence();

  noInitializeView = myview(Kokkos::ViewAllocateWithoutInitializing("test"), nnz);
  MyExecSpace::fence();

  std::cout << "noInitializeView.ptr_on_device():" << noInitializeView.ptr_on_device() << std::endl;
  print_1Dview(noInitializeView);

  Kokkos::finalize();
  return 0;
}

mndevec · 2016-06-06T19:25:07Z

On Cuda this happens for any size I provided. But on OpenMP, as you said it happens for larger allocations (> ~40000 on shannon)

dsunder · 2016-06-06T19:50:37Z

I just checked the view implementation, the constructor that accepts the without initializing argument does not appear to touch the memory. I believe that what you are seeing is a side effect of the underlying system allocator. If you are using gcc, try linking against either tcmalloc or jemalloc (can be done with an LD_PRELOAD) and see if you observe the same behavior. Since they each use a per thread arena to allocate memory, mmap is called less frequently (though this could be sub optimal when there are multiple numa regions). If you are using Intel it is harder to change the system allocator because we use intel builtin methods to allocate memory.

--Dan

On Jun 6, 2016, at 13:25, Mehmet Deveci [email protected] wrote:

On Cuda this happens for any size matrix I provided. But on OpenMP, as you said it happens for larger allocations (> ~40000 on shannon)

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

mndevec · 2016-06-07T22:47:06Z

Thanks Dan, it would take some time to try that out, but here is another thing about this issue.

Allocating 100M integers with a no initialize view takes 0.149532 seconds on Kokkos::Cuda space. Allocating same memory with cudaMalloc takes 0.000432, and initializing it would add only 0.002089 seconds. I wonder why view allocations add up that much overhead. I can post a test-code for this as well.

dsunder · 2016-06-07T23:29:49Z

Views do a lot more than just calling malloc, there is also a thread safe record created that allows for reference counting, texture binding, bounds checking, and leak detection (among other things). I'm not surprised that it is significantly slower than just a raw malloc. If this performance is a bottleneck for you and you do not need the RandomAccess memory trait you can use unmanaged views and pass in your own pointer that you've allocated with cudaMalloc, this will also avoid initializing the memory. You could also look at using the MemoryPool provided by kokkos, though I have no experience with its allocation performance.

When we designed kokkos we assumed the allocation overhead would be in the noise and overshadowed by the benefits of using texture cache and leak detection. If this is a showing up as a significant time sink we may need to address some of our design decisions. Would you happen to have profile data that you can share with us?

Thanks,

--Dan

On Jun 7, 2016, at 16:47, Mehmet Deveci [email protected] wrote:

Thanks Dan, it would take some time to try that out, but here is another thing about this issue.

Allocating 100M integers with a no initialize view takes 0.149532 seconds on Kokkos::Cuda space. Allocating same memory with cudaMalloc takes 0.000432, and initializing it would add only 0.002089 seconds. I wonder why view allocations adds up that much overhead. I can post a test-code for this as well.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

crtrott · 2016-06-08T05:06:16Z

Actually I believe something serious weird is going on with our allocations. I have occasionally seen Initialization poking its head out in profiling data in places where I definitely did not expect it to show up. I.e. it took up way more time (like 100x more) than I would expect. I think we need to track this down and understand exactly what is going on here. Usually I was preoccupied with other things and customers were not complaining about it, but I think now its time to find the root cause.

nmhamster · 2016-06-08T06:07:46Z

Is that sampled or simple timer for profiling?

mndevec · 2016-06-08T14:33:50Z

That is simple timers using Kokkos timers. Below is some timings for different allocation sizes.

dsunder · 2016-06-08T16:10:40Z

Hello Mehmet,

Could you try compiling your code with

-DKOKKOS_USING_EXP_VIEW=0

I don't believe that it will make a difference, but I want to rule it out.

Thanks,

--Dan S

On Wed, Jun 8, 2016 at 8:33 AM, Mehmet Deveci [email protected]
wrote:

That is simple timers using Kokkos timers. Below is some timings for
different allocation sizes.
[image: alloc]
https://cloud.githubusercontent.com/assets/15694785/15897962/ad3e7d42-2d53-11e6-8452-00eb53bfea9b.png

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#317 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AIMHEiuICpjflUHNlMXoIEvnJ8KrVF_Vks5qJtLQgaJpZM4IvGDW
.

crtrott · 2016-06-08T16:36:25Z

My data was coming from simple timing not sampling.

crtrott · 2016-06-08T16:37:38Z

Mehmet what type of view was it? Can you post replication code.

mndevec · 2016-06-08T17:05:11Z

It is below. UVM is on. When I try it on Kokkos::OpenMP, it does not appear to be an issue.
Mehmet

#include <iostream>
#include <Kokkos_Core.hpp>
#include <impl/Kokkos_Timer.hpp>

typedef typename Kokkos::Cuda MyExecSpace;
typedef Kokkos::View <int *, MyExecSpace> myview;

int  main (int  argc, char ** argv){
  Kokkos::initialize(argc, argv);
  MyExecSpace::print_configuration(std::cout);
  int nnz = 100;
  if (argc >= 2)
  nnz = atoi(argv[1]);
  std::cout << "Allocating and initializing view with size:" << nnz << std::endl;
  int *a_d;

  MyExecSpace::fence();
  Kokkos::Impl::Timer timer1;
  cudaMalloc((void **) &a_d, sizeof(int) * nnz);   // Allocate array on device
  MyExecSpace::fence();
  cudaThreadSynchronize();
  std::cout << "\tCuda Allocation Time:" << timer1.seconds() << std::endl;

  Kokkos::View<int*, Kokkos::Cuda, Kokkos::MemoryUnmanaged> a_d_view (a_d, nnz);
  MyExecSpace::fence();
  timer1.reset();
  Kokkos::deep_copy (a_d_view, 42);
  MyExecSpace::fence();
  cudaThreadSynchronize ();
  std::cout << "\tKokkos::deep_copy fill time: " << timer1.seconds () << std::endl;


  MyExecSpace::fence();
  timer1.reset();
  myview noInitializeView(Kokkos::ViewAllocateWithoutInitializing("test"), nnz);
  MyExecSpace::fence();
  std::cout << "\tAllocation Time - 1:" << timer1.seconds() << std::endl;

  MyExecSpace::fence();
  timer1.reset();
  myview noInitializeView2(Kokkos::ViewAllocateWithoutInitializing("test"), nnz);
  MyExecSpace::fence();
  std::cout << "\tView Allocation Time-2:" << timer1.seconds() << std::endl;

  Kokkos::finalize();
  return 0;
}

crtrott · 2016-06-08T17:21:18Z

Hm for me this works as expected. The DeepCopy is running at 180GB/s the view allocations take the same time as the cuda malloc.

crtrott · 2016-06-08T17:21:36Z

Oh wait I didn't enable UVM ...

crtrott · 2016-06-08T17:25:17Z

WIth UVM the first View allocation is slow. On the other hand that is the first UVM allocation happening in the system.

crtrott · 2016-06-08T17:27:17Z

Yeah making the cudaMalloc a cudaMallocManaged makes that slow, but now both VIew allocations are fast.

mndevec · 2016-06-08T17:39:19Z

Christian,
So is that a problem with UVM on? Do you experience it only at the first allocation?
If I change cudaMalloc to cudaMallocManaged, I get the below output, where all allocations are still slow.
Allocating and initializing view with size:1000000000
Cuda Allocation Time:1.54331
Kokkos::deep_copy fill time: 0.340411
Allocation Time - 1:1.51073
View Allocation Time-2:1.49391

crtrott · 2016-06-08T17:48:38Z

Ok did you set CUDA_MANAGED_FORCE_DEVICE_ALLOC=1?
Because 1B size should actually not work (you use more than the available memory). So I think you actually do host pinned allocations.

crtrott · 2016-06-08T17:50:24Z

Oh yeah I just confirmed that. You didn't set CUDA_MANAGED_FORCE_DEVICE_ALLOC. Setting it to zero replicates your numbers. And yes that is expected to be much slower because the OS has to shuffle around physical pages in order to a get a consecutive big chunk of memory freed.

crtrott · 2016-06-08T17:51:42Z

Also I believe my slow numbers for initialization were related to multi dimensional views, and probably the View Initialization using a non-layout aware algorithm which results in bad memory access patterns.

mndevec · 2016-06-08T18:07:48Z

Christian,
Yes, setting CUDA_MANAGED_FORCE_DEVICE_ALLOC seems to make the allocations faster. It also solved the problem of getting initialized data on the views even with ViewAllocateWithoutInitializing.
Thanks a lot!

mndevec closed this as completed Jun 8, 2016

aprokop mentioned this issue Sep 3, 2017

When to use ViewAllocWithoutInitializing? #1073

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kokkos::ViewAllocateWithoutInitializing is not working #317

Kokkos::ViewAllocateWithoutInitializing is not working #317

mndevec commented Jun 6, 2016

hcedwar commented Jun 6, 2016

mndevec commented Jun 6, 2016

dsunder commented Jun 6, 2016

mndevec commented Jun 6, 2016 •

edited

Loading

mndevec commented Jun 6, 2016 •

edited

Loading

dsunder commented Jun 6, 2016

mndevec commented Jun 7, 2016 •

edited

Loading

dsunder commented Jun 7, 2016

crtrott commented Jun 8, 2016

nmhamster commented Jun 8, 2016

mndevec commented Jun 8, 2016

dsunder commented Jun 8, 2016

crtrott commented Jun 8, 2016

crtrott commented Jun 8, 2016

mndevec commented Jun 8, 2016 •

edited

Loading

crtrott commented Jun 8, 2016

crtrott commented Jun 8, 2016

crtrott commented Jun 8, 2016

crtrott commented Jun 8, 2016

mndevec commented Jun 8, 2016

crtrott commented Jun 8, 2016

crtrott commented Jun 8, 2016

crtrott commented Jun 8, 2016

mndevec commented Jun 8, 2016

Kokkos::ViewAllocateWithoutInitializing is not working #317

Kokkos::ViewAllocateWithoutInitializing is not working #317

Comments

mndevec commented Jun 6, 2016

hcedwar commented Jun 6, 2016

mndevec commented Jun 6, 2016

dsunder commented Jun 6, 2016

mndevec commented Jun 6, 2016 • edited Loading

mndevec commented Jun 6, 2016 • edited Loading

dsunder commented Jun 6, 2016

mndevec commented Jun 7, 2016 • edited Loading

dsunder commented Jun 7, 2016

crtrott commented Jun 8, 2016

nmhamster commented Jun 8, 2016

mndevec commented Jun 8, 2016

dsunder commented Jun 8, 2016

crtrott commented Jun 8, 2016

crtrott commented Jun 8, 2016

mndevec commented Jun 8, 2016 • edited Loading

crtrott commented Jun 8, 2016

crtrott commented Jun 8, 2016

crtrott commented Jun 8, 2016

crtrott commented Jun 8, 2016

mndevec commented Jun 8, 2016

crtrott commented Jun 8, 2016

crtrott commented Jun 8, 2016

crtrott commented Jun 8, 2016

mndevec commented Jun 8, 2016

mndevec commented Jun 6, 2016 •

edited

Loading

mndevec commented Jun 6, 2016 •

edited

Loading

mndevec commented Jun 7, 2016 •

edited

Loading

mndevec commented Jun 8, 2016 •

edited

Loading