Skip to content

Commit

Permalink
Sphinx Doc: overtake Sergei's and Jan's suggestions
Browse files Browse the repository at this point in the history
  • Loading branch information
SimeonEhrig committed Jul 29, 2020
1 parent b0b17f7 commit 11ef7ff
Show file tree
Hide file tree
Showing 7 changed files with 81 additions and 72 deletions.
2 changes: 1 addition & 1 deletion docs/source/advanced/rationale.rst
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,7 @@ The constant memory is a fast, cached, read-only memory that is beneficial when
In this case it is as fast as a read from a register.


Access to Accelerator Dependent Functionality
Access to Accelerator-Dependent Functionality
+++++++++++++++++++++++++++++++++++++++++++++

There are two possible ways to implement access to accelerator dependent functionality inside a kernel:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/basic/abstraction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ Abstraction

Objective of the abstraction is to separate the parallelization strategy from the algorithm itself.
Algorithm code written by users should not depend on any parallelization library or specific strategy.
This would allow to exchange the parallelization back-end without any changes to the algorithm itself.
This would enable exchanging the parallelization back-end without any changes to the algorithm itself.
Besides allowing to test different parallelization strategies this also makes it possible to port algorithms to new, yet unsupported, platforms.

Parallelism and memory hierarchies at all levels need to be exploited in order to achieve performance portability across various types of accelerators.
Within this chapter an abstraction will be derivated that tries to provide a maximum of parallelism while simultaneously considering implementability and applicability in hardware.
Within this chapter an abstraction will be derive that tries to provide a maximum of parallelism while simultaneously considering implementability and applicability in hardware.

Looking at the current HPC hardware landscape, we often see nodes with multiple sockets/processors extended by accelerators like GPUs or Intel Xeon Phi, each with their own processing units.
Within a CPU or a Intel Xeon Phi there are cores with hyper-threads, vector units and a large caching infrastructure.
Expand Down
36 changes: 18 additions & 18 deletions docs/source/basic/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Introduction

The *alpaka* library defines and implements an abstract interface for the *hierarchical redundant parallelism* model.
This model exploits task- and data-parallelism as well as memory hierarchies at all levels of current multi-core architectures.
This allows to achieve portability of performant codes across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a specific accelerator.
This allows to achieve performance portability across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a specific accelerator.
All hardware types (multi- and many-core CPUs, GPUs and other accelerators) are treated and can be programmed in the same way.
The *alpaka* library provides back-ends for *CUDA*, *OpenMP*, *Boost.Fiber* and other methods.
The policy-based C++ template interface provided allows for straightforward user-defined extension of the library to support other accelerators.
Expand Down Expand Up @@ -38,26 +38,26 @@ If you do not install alpaka in a default path such as ``/usr/local/`` you have

The cmake configuration decides which alpaka accelerators are available during compiling. For example, if you configure your ``cmake`` build with the CUDA back-end (``-DALPAKA_ACC_GPU_CUDA_ENABLE=ON``), ``cmake`` checks, if the CUDA SDK is available and if it found, the C++ template ``alpaka::acc::AccGpuCudaRt`` is available during compiling.

What is alpaka
--------------
About alpaka
------------

alpaka is ...
~~~~~~~~~~~~~

An Abstract Interface
It describing parallel execution on multiple hierarchy levels. It allows to implement a mapping to various hardware architectures but is no optimal mapping itself.
Abstract
It describes parallel execution on multiple hierarchy levels. It allows to implement a mapping to various hardware architectures but is no optimal mapping itself.

Sustainably
*alpaka* decouple the application from the availability of different accelerator frameworks in different versions, such as OpenMP, CUDA, HIP, etc. (50% on the way to reach full performance portability).
Sustainable
*alpaka* decouples the application from the availability of different accelerator frameworks in different versions, such as OpenMP, CUDA, HIP, etc. (50% on the way to reach full performance portability).

Heterogeneity
Heterogeneous
An identical algorithm / kernel can be executed on heterogeneous parallel systems by selecting the target device. This allows the best performance for each algorithm and/or a good utilization of the system without major code changes.

Maintainability
Maintainable
*alpaka* allows to provide a single version of the algorithm / kernel that can be used by all back-ends. There is no need for "copy and paste" kernels with different API calls for different accelerators. All the accelerator dependent implementation details are hidden within the *alpaka* library.

Testability
Due to the easy back-end switch, no special hardware is required for testing the kernels. Even if the simulation itself will always use the *CUDA* back-end, the tests can completely run on a CPU. As long as the *alpaka* library is thoroughly tested for compatibility between the acceleration back-ends, the user simulation code is guaranteed to generate identical results (ignoring rounding errors / non-determinism) and is portable without any changes.
Testable
Due to the easy back-end switch, no special hardware is required for testing the kernels. Even if the simulation itself always uses the *CUDA* back-end, the tests can completely run on a CPU. As long as the *alpaka* library is thoroughly tested for compatibility between the acceleration back-ends, the user simulation code is guaranteed to generate identical results (ignoring rounding errors / non-determinism) and is portable without any changes.

Optimizable
Everything in *alpaka* can be replaced by user code to optimize for special use-cases.
Expand All @@ -68,19 +68,19 @@ Extensible
Data Structure Agnostic
The user can use and define arbitrary data structures.

alpaka is not ...
~~~~~~~~~~~~~~~~~
alpaka does not ...
~~~~~~~~~~~~~~~~~~~

An automatically optimal mapping of algorithms / kernels to various acceleration platforms
Except in trivial examples an optimal execution always depends on suitable selected data structure. An adaptive selection of data structures is a separate topic that has to be implemented in a distinct library.
Automatically provide an optimal mapping of kernels to various acceleration platforms
Except in trivial examples an optimal execution always depends on suitable selected data structures. An adaptive selection of data structures is a separate topic that has to be implemented in a distinct library.

Automatically optimizing concurrent data accesses
Automatically optimize concurrent data access
*alpaka* does not provide feature to create optimized memory layouts.

Handling or hiding differences in arithmetic operations
Handle differences in arithmetic operations
For example, due to **different rounding** or different implementations of floating point operations, results can differ slightly between accelerators.

Guaranteeing any determinism of results
Guarantee determinism of results
Due to the freedom of the library to reorder or repartition the threads within the tasks it is not possible or even desired to preserve deterministic results. For example, the non-associativity of floating point operations give non-deterministic results within and across accelerators.

The *alpaka* library is aimed at parallelization on shared memory, i.e. within nodes of a cluster.
Expand Down
Loading

0 comments on commit 11ef7ff

Please sign in to comment.