CUDA programming C++

The most common deep learning frameworks such as Tensorflow and PyThorch often rely on kernel calls in order to use the GPU to compute parallel computations and accelerate the computation of such networks. The most famous interface that allows developers to program using the GPU is CUDA, created by NVIDIA. This repository will keep track of my progress in this area. I will base it mainly on what I'm learning man by man from my master in deep learning run by Deep Learning Italia Academy, on Udemy CUDA programming Masterclass with C++ and also of course on NVIDIA documentation.

My purpose is to deepen my knowledge about parallel programming!

In this repository :

Hello World

I learned key concepts such as host (cpu) and device (gpu) computation, the context switch method, and the apparent parallel execution of cpu. The difference between process and thread, how threads share memory. I know that there are 2 level of prallelism (1) task level and (2) data level. The difference between parallelism and concurrency. Finally I was able to launch the kernel using the grid and block parameters
Threads Organization

Often figuring out how and which threads access the kernel function is difficult. I have learned to use variables of type dim3 blockIdx, blockDim, gridDim to identify them.
Unique Index Calculation

Often identifying unique thread IDs can be difficult, especially when using grids and 2 or even 3 dimensional blocks. Here I solve this problem
Memory Transfer

In addition to processing data on the GPU, we also need to transfer data from the CPU to the GPU, and transfer the results back.
Sum Array

Let's transfer and sum 2 arrays in GPU. Monitor the time needed using clocks, and also lets handle the CUDA errors creating a macro and wrapping all the CUDA functions.
Device Query

Here is a simple script to query on the fly our device and get its properties
Intro to Warps

We should consider the parallelism between software and hardware. Since each core of a SM can execute in parallel only a single warp (32 thread) this should be the otimal number oh threads in a block. If we 1 single thread in a block, the hardware will still assign a warp of 32 with resources for 32 threads, but 31 of htem will be inactive, and it will be a waste of resources.
Wrap Divergence

Wrap divergence is an issue for prallel computing. Part of the wrap, and so part of the NVIDIA SM can be disabled, and you can waste resources. Pay attention to if-else statments. You can check the branch_efficiencu metric using compiling with nvcc and running nvprof

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
device_query		device_query
exercise_memory_transfer_3D_grid		exercise_memory_transfer_3D_grid
hello_world		hello_world
memory_transfer		memory_transfer
sum_array		sum_array
thread_organization		thread_organization
unique_index_calculation_using_threadIdx_blockIdx_blockDim		unique_index_calculation_using_threadIdx_blockIdx_blockDim
warp_id_thread_id		warp_id_thread_id
wrap_divergence		wrap_divergence
README.md		README.md
kernel.cu		kernel.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA programming C++

In this repository :

About

Releases

Packages

Languages

March-08/CUDA-programming

Folders and files

Latest commit

History

Repository files navigation

CUDA programming C++

In this repository :

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages