Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kokkos::ViewAllocateWithoutInitializing is not working #317

Closed
mndevec opened this issue Jun 6, 2016 · 24 comments
Closed

Kokkos::ViewAllocateWithoutInitializing is not working #317

mndevec opened this issue Jun 6, 2016 · 24 comments

Comments

@mndevec
Copy link

mndevec commented Jun 6, 2016

Allocations with initializations are sometimes expensive and unnecessary and I was frequently using Kokkos::ViewAllocateWithoutInitializing to avoid that. It seems that it has been disabled at some point, and view allocations seem to be always initialized with 0's now. Is there a new way to avoid initializations?

@hcedwar
Copy link
Contributor

hcedwar commented Jun 6, 2016

Are you getting a compile error or is initialization happening regardless of input?

@mndevec
Copy link
Author

mndevec commented Jun 6, 2016

It is not a compilation issue. No matter what I set, the views are initialized with 0's.

@dsunder
Copy link
Contributor

dsunder commented Jun 6, 2016

I could be the the memory was just allocated with mmap, the memory would then always be set to zero. If it is a large allocation this is most likely.

--Dan

On Jun 6, 2016, at 12:44, Mehmet Deveci [email protected] wrote:

It is not a compilation issue. No matter what I set, the views are initialized with 0's.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@mndevec
Copy link
Author

mndevec commented Jun 6, 2016

Below is the test example I was using to test initialization. I allocate a view change its values, print it. Then I release that one, and reallocate it and print it. Although it gets the exact same memory address its values appears to be 0 in the second print.

#include <iostream>
#include <Kokkos_Core.hpp>


typedef typename Kokkos::Cuda MyMemorySpace;
typedef typename Kokkos::Cuda MyExecSpace;
typedef Kokkos::View <int *, MyExecSpace> myview;

template <typename array_type>
struct LinearInitialization{
  typedef typename array_type::value_type idx;
  array_type array_sum;
  LinearInitialization(array_type arr_): array_sum(arr_){}

  KOKKOS_INLINE_FUNCTION
  void operator()(const size_t ii) const {
    array_sum(ii) = ii;
  }
};


template <typename array_type, typename MyExecSpace>
void linear_init(typename array_type::value_type num_elements, array_type arr){
  typedef Kokkos::RangePolicy<MyExecSpace> my_exec_space;
  Kokkos::parallel_for( my_exec_space(0, num_elements), LinearInitialization<array_type>(arr));
}


template <typename idx_array_type>
void print_1Dview(idx_array_type view, bool print_all = false){

  typedef typename idx_array_type::HostMirror host_type;
  typedef typename idx_array_type::size_type idx;
  host_type host_view = Kokkos::create_mirror_view (view);
  Kokkos::deep_copy (host_view , view);
  idx nr = host_view.dimension_0();
  if (!print_all){


    if (nr > 20){
      idx n = 10;
      for (idx i = 0; i < n; ++i){
        std::cout << host_view(i) << " ";
      }
      std::cout << "... ... ... ";

      for (idx i = nr-n; i < nr; ++i){
        std::cout << host_view(i) << " ";
      }
      std::cout << std::endl;
    }
    else {
      for (idx i = 0; i < nr; ++i){
        std::cout << host_view(i) << " ";
      }
      std::cout << std::endl;
    }
  }
  else {
    for (idx i = 0; i < nr; ++i){
      std::cout << host_view(i) << " ";
    }
    std::cout << std::endl;
  }
}




int  main (int  argc, char ** argv){
  Kokkos::initialize(argc, argv);
  MyExecSpace::print_configuration(std::cout);
  int nnz = 100;
  if (argc >= 2)
  nnz = atoi(argv[1]);

  std::cout << "Allocating and initializing view with size:" << nnz << std::endl;

  myview noInitializeView(Kokkos::ViewAllocateWithoutInitializing("test"), nnz);
  MyExecSpace::fence();

  std::cout << "noInitializeView.ptr_on_device():" << noInitializeView.ptr_on_device() << std::endl;

  linear_init<myview, MyExecSpace>(nnz, noInitializeView );
  MyExecSpace::fence();

  print_1Dview(noInitializeView);
  MyExecSpace::fence();

  noInitializeView = myview();
  MyExecSpace::fence();

  noInitializeView = myview(Kokkos::ViewAllocateWithoutInitializing("test"), nnz);
  MyExecSpace::fence();

  std::cout << "noInitializeView.ptr_on_device():" << noInitializeView.ptr_on_device() << std::endl;
  print_1Dview(noInitializeView);

  Kokkos::finalize();
  return 0;
}

@mndevec
Copy link
Author

mndevec commented Jun 6, 2016

On Cuda this happens for any size I provided. But on OpenMP, as you said it happens for larger allocations (> ~40000 on shannon)

@dsunder
Copy link
Contributor

dsunder commented Jun 6, 2016

I just checked the view implementation, the constructor that accepts the without initializing argument does not appear to touch the memory. I believe that what you are seeing is a side effect of the underlying system allocator. If you are using gcc, try linking against either tcmalloc or jemalloc (can be done with an LD_PRELOAD) and see if you observe the same behavior. Since they each use a per thread arena to allocate memory, mmap is called less frequently (though this could be sub optimal when there are multiple numa regions). If you are using Intel it is harder to change the system allocator because we use intel builtin methods to allocate memory.

--Dan

On Jun 6, 2016, at 13:25, Mehmet Deveci [email protected] wrote:

On Cuda this happens for any size matrix I provided. But on OpenMP, as you said it happens for larger allocations (> ~40000 on shannon)


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@mndevec
Copy link
Author

mndevec commented Jun 7, 2016

Thanks Dan, it would take some time to try that out, but here is another thing about this issue.

Allocating 100M integers with a no initialize view takes 0.149532 seconds on Kokkos::Cuda space. Allocating same memory with cudaMalloc takes 0.000432, and initializing it would add only 0.002089 seconds. I wonder why view allocations add up that much overhead. I can post a test-code for this as well.

@dsunder
Copy link
Contributor

dsunder commented Jun 7, 2016

Views do a lot more than just calling malloc, there is also a thread safe record created that allows for reference counting, texture binding, bounds checking, and leak detection (among other things). I'm not surprised that it is significantly slower than just a raw malloc. If this performance is a bottleneck for you and you do not need the RandomAccess memory trait you can use unmanaged views and pass in your own pointer that you've allocated with cudaMalloc, this will also avoid initializing the memory. You could also look at using the MemoryPool provided by kokkos, though I have no experience with its allocation performance.

When we designed kokkos we assumed the allocation overhead would be in the noise and overshadowed by the benefits of using texture cache and leak detection. If this is a showing up as a significant time sink we may need to address some of our design decisions. Would you happen to have profile data that you can share with us?

Thanks,

--Dan

On Jun 7, 2016, at 16:47, Mehmet Deveci [email protected] wrote:

Thanks Dan, it would take some time to try that out, but here is another thing about this issue.

Allocating 100M integers with a no initialize view takes 0.149532 seconds on Kokkos::Cuda space. Allocating same memory with cudaMalloc takes 0.000432, and initializing it would add only 0.002089 seconds. I wonder why view allocations adds up that much overhead. I can post a test-code for this as well.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@crtrott
Copy link
Member

crtrott commented Jun 8, 2016

Actually I believe something serious weird is going on with our allocations. I have occasionally seen Initialization poking its head out in profiling data in places where I definitely did not expect it to show up. I.e. it took up way more time (like 100x more) than I would expect. I think we need to track this down and understand exactly what is going on here. Usually I was preoccupied with other things and customers were not complaining about it, but I think now its time to find the root cause.

@nmhamster
Copy link
Contributor

Is that sampled or simple timer for profiling?

@mndevec
Copy link
Author

mndevec commented Jun 8, 2016

That is simple timers using Kokkos timers. Below is some timings for different allocation sizes.
alloc

@dsunder
Copy link
Contributor

dsunder commented Jun 8, 2016

Hello Mehmet,

Could you try compiling your code with

-DKOKKOS_USING_EXP_VIEW=0

I don't believe that it will make a difference, but I want to rule it out.

Thanks,

--Dan S

On Wed, Jun 8, 2016 at 8:33 AM, Mehmet Deveci [email protected]
wrote:

That is simple timers using Kokkos timers. Below is some timings for
different allocation sizes.
[image: alloc]
https://cloud.githubusercontent.com/assets/15694785/15897962/ad3e7d42-2d53-11e6-8452-00eb53bfea9b.png


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#317 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AIMHEiuICpjflUHNlMXoIEvnJ8KrVF_Vks5qJtLQgaJpZM4IvGDW
.

@crtrott
Copy link
Member

crtrott commented Jun 8, 2016

My data was coming from simple timing not sampling.

@crtrott
Copy link
Member

crtrott commented Jun 8, 2016

Mehmet what type of view was it? Can you post replication code.

@mndevec
Copy link
Author

mndevec commented Jun 8, 2016

It is below. UVM is on. When I try it on Kokkos::OpenMP, it does not appear to be an issue.
Mehmet

#include <iostream>
#include <Kokkos_Core.hpp>
#include <impl/Kokkos_Timer.hpp>

typedef typename Kokkos::Cuda MyExecSpace;
typedef Kokkos::View <int *, MyExecSpace> myview;

int  main (int  argc, char ** argv){
  Kokkos::initialize(argc, argv);
  MyExecSpace::print_configuration(std::cout);
  int nnz = 100;
  if (argc >= 2)
  nnz = atoi(argv[1]);
  std::cout << "Allocating and initializing view with size:" << nnz << std::endl;
  int *a_d;

  MyExecSpace::fence();
  Kokkos::Impl::Timer timer1;
  cudaMalloc((void **) &a_d, sizeof(int) * nnz);   // Allocate array on device
  MyExecSpace::fence();
  cudaThreadSynchronize();
  std::cout << "\tCuda Allocation Time:" << timer1.seconds() << std::endl;

  Kokkos::View<int*, Kokkos::Cuda, Kokkos::MemoryUnmanaged> a_d_view (a_d, nnz);
  MyExecSpace::fence();
  timer1.reset();
  Kokkos::deep_copy (a_d_view, 42);
  MyExecSpace::fence();
  cudaThreadSynchronize ();
  std::cout << "\tKokkos::deep_copy fill time: " << timer1.seconds () << std::endl;


  MyExecSpace::fence();
  timer1.reset();
  myview noInitializeView(Kokkos::ViewAllocateWithoutInitializing("test"), nnz);
  MyExecSpace::fence();
  std::cout << "\tAllocation Time - 1:" << timer1.seconds() << std::endl;

  MyExecSpace::fence();
  timer1.reset();
  myview noInitializeView2(Kokkos::ViewAllocateWithoutInitializing("test"), nnz);
  MyExecSpace::fence();
  std::cout << "\tView Allocation Time-2:" << timer1.seconds() << std::endl;

  Kokkos::finalize();
  return 0;
}

@crtrott
Copy link
Member

crtrott commented Jun 8, 2016

Hm for me this works as expected. The DeepCopy is running at 180GB/s the view allocations take the same time as the cuda malloc.

@crtrott
Copy link
Member

crtrott commented Jun 8, 2016

Oh wait I didn't enable UVM ...

@crtrott
Copy link
Member

crtrott commented Jun 8, 2016

WIth UVM the first View allocation is slow. On the other hand that is the first UVM allocation happening in the system.

@crtrott
Copy link
Member

crtrott commented Jun 8, 2016

Yeah making the cudaMalloc a cudaMallocManaged makes that slow, but now both VIew allocations are fast.

@mndevec
Copy link
Author

mndevec commented Jun 8, 2016

Christian,
So is that a problem with UVM on? Do you experience it only at the first allocation?
If I change cudaMalloc to cudaMallocManaged, I get the below output, where all allocations are still slow.
Allocating and initializing view with size:1000000000
Cuda Allocation Time:1.54331
Kokkos::deep_copy fill time: 0.340411
Allocation Time - 1:1.51073
View Allocation Time-2:1.49391

@crtrott
Copy link
Member

crtrott commented Jun 8, 2016

Ok did you set CUDA_MANAGED_FORCE_DEVICE_ALLOC=1?
Because 1B size should actually not work (you use more than the available memory). So I think you actually do host pinned allocations.

@crtrott
Copy link
Member

crtrott commented Jun 8, 2016

Oh yeah I just confirmed that. You didn't set CUDA_MANAGED_FORCE_DEVICE_ALLOC. Setting it to zero replicates your numbers. And yes that is expected to be much slower because the OS has to shuffle around physical pages in order to a get a consecutive big chunk of memory freed.

@crtrott
Copy link
Member

crtrott commented Jun 8, 2016

Also I believe my slow numbers for initialization were related to multi dimensional views, and probably the View Initialization using a non-layout aware algorithm which results in bad memory access patterns.

@mndevec
Copy link
Author

mndevec commented Jun 8, 2016

Christian,
Yes, setting CUDA_MANAGED_FORCE_DEVICE_ALLOC seems to make the allocations faster. It also solved the problem of getting initialized data on the views even with ViewAllocateWithoutInitializing.
Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants