-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kokkos::ViewAllocateWithoutInitializing is not working #317
Comments
Are you getting a compile error or is initialization happening regardless of input? |
It is not a compilation issue. No matter what I set, the views are initialized with 0's. |
I could be the the memory was just allocated with mmap, the memory would then always be set to zero. If it is a large allocation this is most likely. --Dan
|
Below is the test example I was using to test initialization. I allocate a view change its values, print it. Then I release that one, and reallocate it and print it. Although it gets the exact same memory address its values appears to be 0 in the second print. #include <iostream>
#include <Kokkos_Core.hpp>
typedef typename Kokkos::Cuda MyMemorySpace;
typedef typename Kokkos::Cuda MyExecSpace;
typedef Kokkos::View <int *, MyExecSpace> myview;
template <typename array_type>
struct LinearInitialization{
typedef typename array_type::value_type idx;
array_type array_sum;
LinearInitialization(array_type arr_): array_sum(arr_){}
KOKKOS_INLINE_FUNCTION
void operator()(const size_t ii) const {
array_sum(ii) = ii;
}
};
template <typename array_type, typename MyExecSpace>
void linear_init(typename array_type::value_type num_elements, array_type arr){
typedef Kokkos::RangePolicy<MyExecSpace> my_exec_space;
Kokkos::parallel_for( my_exec_space(0, num_elements), LinearInitialization<array_type>(arr));
}
template <typename idx_array_type>
void print_1Dview(idx_array_type view, bool print_all = false){
typedef typename idx_array_type::HostMirror host_type;
typedef typename idx_array_type::size_type idx;
host_type host_view = Kokkos::create_mirror_view (view);
Kokkos::deep_copy (host_view , view);
idx nr = host_view.dimension_0();
if (!print_all){
if (nr > 20){
idx n = 10;
for (idx i = 0; i < n; ++i){
std::cout << host_view(i) << " ";
}
std::cout << "... ... ... ";
for (idx i = nr-n; i < nr; ++i){
std::cout << host_view(i) << " ";
}
std::cout << std::endl;
}
else {
for (idx i = 0; i < nr; ++i){
std::cout << host_view(i) << " ";
}
std::cout << std::endl;
}
}
else {
for (idx i = 0; i < nr; ++i){
std::cout << host_view(i) << " ";
}
std::cout << std::endl;
}
}
int main (int argc, char ** argv){
Kokkos::initialize(argc, argv);
MyExecSpace::print_configuration(std::cout);
int nnz = 100;
if (argc >= 2)
nnz = atoi(argv[1]);
std::cout << "Allocating and initializing view with size:" << nnz << std::endl;
myview noInitializeView(Kokkos::ViewAllocateWithoutInitializing("test"), nnz);
MyExecSpace::fence();
std::cout << "noInitializeView.ptr_on_device():" << noInitializeView.ptr_on_device() << std::endl;
linear_init<myview, MyExecSpace>(nnz, noInitializeView );
MyExecSpace::fence();
print_1Dview(noInitializeView);
MyExecSpace::fence();
noInitializeView = myview();
MyExecSpace::fence();
noInitializeView = myview(Kokkos::ViewAllocateWithoutInitializing("test"), nnz);
MyExecSpace::fence();
std::cout << "noInitializeView.ptr_on_device():" << noInitializeView.ptr_on_device() << std::endl;
print_1Dview(noInitializeView);
Kokkos::finalize();
return 0;
} |
On Cuda this happens for any size I provided. But on OpenMP, as you said it happens for larger allocations (> ~40000 on shannon) |
I just checked the view implementation, the constructor that accepts the without initializing argument does not appear to touch the memory. I believe that what you are seeing is a side effect of the underlying system allocator. If you are using gcc, try linking against either tcmalloc or jemalloc (can be done with an LD_PRELOAD) and see if you observe the same behavior. Since they each use a per thread arena to allocate memory, mmap is called less frequently (though this could be sub optimal when there are multiple numa regions). If you are using Intel it is harder to change the system allocator because we use intel builtin methods to allocate memory. --Dan
|
Thanks Dan, it would take some time to try that out, but here is another thing about this issue. Allocating 100M integers with a no initialize view takes 0.149532 seconds on Kokkos::Cuda space. Allocating same memory with cudaMalloc takes 0.000432, and initializing it would add only 0.002089 seconds. I wonder why view allocations add up that much overhead. I can post a test-code for this as well. |
Views do a lot more than just calling malloc, there is also a thread safe record created that allows for reference counting, texture binding, bounds checking, and leak detection (among other things). I'm not surprised that it is significantly slower than just a raw malloc. If this performance is a bottleneck for you and you do not need the RandomAccess memory trait you can use unmanaged views and pass in your own pointer that you've allocated with cudaMalloc, this will also avoid initializing the memory. You could also look at using the MemoryPool provided by kokkos, though I have no experience with its allocation performance. When we designed kokkos we assumed the allocation overhead would be in the noise and overshadowed by the benefits of using texture cache and leak detection. If this is a showing up as a significant time sink we may need to address some of our design decisions. Would you happen to have profile data that you can share with us? Thanks, --Dan
|
Actually I believe something serious weird is going on with our allocations. I have occasionally seen Initialization poking its head out in profiling data in places where I definitely did not expect it to show up. I.e. it took up way more time (like 100x more) than I would expect. I think we need to track this down and understand exactly what is going on here. Usually I was preoccupied with other things and customers were not complaining about it, but I think now its time to find the root cause. |
Is that sampled or simple timer for profiling? |
Hello Mehmet, Could you try compiling your code with -DKOKKOS_USING_EXP_VIEW=0 I don't believe that it will make a difference, but I want to rule it out. Thanks, --Dan S On Wed, Jun 8, 2016 at 8:33 AM, Mehmet Deveci [email protected]
|
My data was coming from simple timing not sampling. |
Mehmet what type of view was it? Can you post replication code. |
It is below. UVM is on. When I try it on Kokkos::OpenMP, it does not appear to be an issue. #include <iostream>
#include <Kokkos_Core.hpp>
#include <impl/Kokkos_Timer.hpp>
typedef typename Kokkos::Cuda MyExecSpace;
typedef Kokkos::View <int *, MyExecSpace> myview;
int main (int argc, char ** argv){
Kokkos::initialize(argc, argv);
MyExecSpace::print_configuration(std::cout);
int nnz = 100;
if (argc >= 2)
nnz = atoi(argv[1]);
std::cout << "Allocating and initializing view with size:" << nnz << std::endl;
int *a_d;
MyExecSpace::fence();
Kokkos::Impl::Timer timer1;
cudaMalloc((void **) &a_d, sizeof(int) * nnz); // Allocate array on device
MyExecSpace::fence();
cudaThreadSynchronize();
std::cout << "\tCuda Allocation Time:" << timer1.seconds() << std::endl;
Kokkos::View<int*, Kokkos::Cuda, Kokkos::MemoryUnmanaged> a_d_view (a_d, nnz);
MyExecSpace::fence();
timer1.reset();
Kokkos::deep_copy (a_d_view, 42);
MyExecSpace::fence();
cudaThreadSynchronize ();
std::cout << "\tKokkos::deep_copy fill time: " << timer1.seconds () << std::endl;
MyExecSpace::fence();
timer1.reset();
myview noInitializeView(Kokkos::ViewAllocateWithoutInitializing("test"), nnz);
MyExecSpace::fence();
std::cout << "\tAllocation Time - 1:" << timer1.seconds() << std::endl;
MyExecSpace::fence();
timer1.reset();
myview noInitializeView2(Kokkos::ViewAllocateWithoutInitializing("test"), nnz);
MyExecSpace::fence();
std::cout << "\tView Allocation Time-2:" << timer1.seconds() << std::endl;
Kokkos::finalize();
return 0;
}
|
Hm for me this works as expected. The DeepCopy is running at 180GB/s the view allocations take the same time as the cuda malloc. |
Oh wait I didn't enable UVM ... |
WIth UVM the first View allocation is slow. On the other hand that is the first UVM allocation happening in the system. |
Yeah making the cudaMalloc a cudaMallocManaged makes that slow, but now both VIew allocations are fast. |
Christian, |
Ok did you set CUDA_MANAGED_FORCE_DEVICE_ALLOC=1? |
Oh yeah I just confirmed that. You didn't set CUDA_MANAGED_FORCE_DEVICE_ALLOC. Setting it to zero replicates your numbers. And yes that is expected to be much slower because the OS has to shuffle around physical pages in order to a get a consecutive big chunk of memory freed. |
Also I believe my slow numbers for initialization were related to multi dimensional views, and probably the View Initialization using a non-layout aware algorithm which results in bad memory access patterns. |
Christian, |
Allocations with initializations are sometimes expensive and unnecessary and I was frequently using Kokkos::ViewAllocateWithoutInitializing to avoid that. It seems that it has been disabled at some point, and view allocations seem to be always initialized with 0's now. Is there a new way to avoid initializations?
The text was updated successfully, but these errors were encountered: