Skip to content
Sean Baxter edited this page May 15, 2016 · 4 revisions

moderngpu 2.0 is designed for CUDA 7.5+. It runs on all generations of NVIDIA GPUs from sm_20 (Fermi) and up.

The reference environment is 64-bit Linux with g++ 4.9.3 with nvcc 7.5.17. This library and code using it might not compile on Microsoft Visual Studio 2013 because of symbol length limitations in that compiler. Windows users should try upgrading to CUDA 8.0 and using Visual Studio 2015.

Cloning the source

Clone the source from the moderngpu github repository.

From the command line,

git clone https://www.github.com/moderngpu/moderngpu [your_directory]

to clone the master branch of the repository into your_directory.

Compiling

moderngpu uses some advanced opt-in features of the CUDA compiler. You may need to set these flags when building your application:

  • -I [your_directory]/src Add moderngpu to your project's include path. Provides access to the library's header files under the moderngpu/ path, eg #include <moderngpu/transform.hxx>.

  • -std=c++11 C++11 features are used extensively.

  • --expt-extended-lambda enables device-tagged lambdas.

  • -use_fast_math enables the fast CUDA math library. It won't give bit-identical results with arithmetic run on your host processor, but numerical apps are greatly accelerated by its inclusion.

  • -Xptxas="-v" enables verbose reporting of PTX assembly. If your kernel uses more than 0 bytes of local memory, your code is probably doing something wrong.

  • -lineinfo tracks kernel line numbers. cuda-memcheck and the CUDA Visual Profiler give more intelligible results when this option is used.

  • -gencode arch=compute_xx,code=compute_xx generates PTX for architecture sm_xx. May be asserted multiple times to take advantage of architecture-specific tunings and intrinsics. PTX is forward compatible, but must be JIT compiled by the CUDA runtime to SASS before device code is launched. -gencode may be specified multiple times to target different architectures.

  • -gencode arch=compute_xx,code=sm_xx generates SASS for architecture sm_xx. SASS is more space-efficient than PTX and doesn't require JIT compilation, but it's only forward-compatible within the same major architecture. That is, sm_35 devices can execute sm_30 SASS, but sm_5x devices cannot.

Testing the library

Test your installation by compiling this simple program.

hello.cu

#include <moderngpu/transform.hxx>

using namespace mgpu;

int main(int argc, char** argv) {
  // The context encapsulates things like an allocator and a stream.
  // By default it prints device info to the console.
  standard_context_t context;

  // Launch five threads to greet us.
  transform([]MGPU_DEVICE(int index) {
    printf("Hello GPU from thread %d\n", index);
  }, 5, context);

  // Synchronize on the context's stream to send the output to the console.
  context.synchronize();

  return 0;
}

If the library is installed at ../moderngpu, compile with this line:

$ nvcc \
  -std=c++11 \
  --expt-extended-lambda \
  -gencode arch=compute_20,code=compute_20 \
  -I ../moderngpu/src \
  -o hello \
  hello.cu

If all goes well, the program should produce output similar to this:

$ ./hello
GeForce GTX 980 Ti : 1190.000 Mhz   (Ordinal 0)
22 SMs enabled. Compute Capability sm_52
FreeMem:   5837MB   TotalMem:   6140MB   64-bit pointers.
Mem Clock: 3505.000 Mhz x 384 bits   (336.5 GB/s)
ECC Disabled


Hello GPU from thread 0
Hello GPU from thread 1
Hello GPU from thread 2
Hello GPU from thread 3
Hello GPU from thread 4

If you see output like this:

$ ./hello
terminate called after throwing an instance of 'mgpu::cuda_exception_t'
  what():  invalid device function
Aborted

then you likely didn't build your executable with options that are compatible with the architecture of your device. For instance, building with -gencode arch=compute_20,code=sm_20 will generate a binary that produces this output when run on a Maxwell device, because Maxwell cannot run Fermi SASS. Either generate Maxwell SASS with -gencode arch=compute_52,code=sm_52 or PTX for any earlier architecture with -gencode arch=compute_35,code=compute_35 to make a Maxwell-compatible binary.