-
Notifications
You must be signed in to change notification settings - Fork 279
Patterns and behaviors
Sean Baxter edited this page May 15, 2016
·
2 revisions
moderngpu exposes functionality through templated pattern functions. The user specializes these patterns with concrete behaviors, typically written as C++ lambdas. The pattern launches a kernel, computes a context for each work-item, and invokes the behavior on each work-item, providing the corresponding context.
moderngpu provides patterns useful for many operations:
- cta_launch - Launch a grid of cooperative thread arrays (CTAs) and pass the behavior function the index of the thread (threadIdx.x) and CTA (blockIdx.x).
- cta_transform - Launch a grid of CTAs, but size them according to the number of work-items that can be processed given the architecture's specified launch box.
- transform - A non-cooperative method. Invoke the behavior once for each work-item.
- transform_reduce - Call a behavior once for each work-item and recursively reduce the return values with a user-provided reducer. This pattern enables array-wide sum, max, and min operations.
-
transform_scan - Like
transform_reduce
, but computes a reduction for each interval from the start of the array to each element. For an addition operator this is the prefix sum operation. -
transform_lbs - A vectorized and load-balanced
transform
implemented using load-balancing search. The caller specifies the geometry of the problem with a segments descriptor array and the behavior is invoked for each work-item, providing both the ID of the segment the work-item belongs to and its rank within the segment. This is the signature pattern of moderngpu. - lbs_segreduce - Fold together all values in each segment and return one reduction value per segment. The behavior is invoked with the segment ID and rank of each work-item. This pattern makes for consistent performance for simultaneous processing of many irregularly-shaped problems.
- transform_compact - An efficient two-pass pattern for conditionally selecting and compacting elements of an array.
- lbs_workcreate - Dynamic work-creation accelerated with load-balancing search. This is a two-pass pattern. On the upsweep the pattern returns the number of output work-items to stream for each input work-item. On the downsweep, the pattern encodes parameters for each work-creating segment. This pattern solves many problems that CUDA Dynamic Parallelism was intended to solve, but with exact load-balancing and requiring no special hardware mechanisms.
moderngpu also includes traditional bulk-synchronous parallel general-purpose functions. Most of these can be adapted to the pattern-behavior model with the use of lambda iterators, which wrap lambda behavior functions in the interface of pointers.
-
reduce and scan are the iterator-oriented equivalents of
transform_reduce
andtransform_scan
. - merge, bulk_remove and bulk_insert are general-purpose array construction functions.
-
mergesort is a basic array-wide sort. segmented_sort is an advanced mergesort that sorts keys and values within segments defined by a segments descriptor array. Thanks to a novel short-circuiting feature,
segmented_sort
actually improves in performance as the number of segments to sort increases. - sorted_search is a vectorized sorted search. A binary search looks for a needle in a sorted haystack. The vectorized sorted search looks for an array of sorted needles in a sorted haystack. A problem with n needles and m haystack items requires _O(n log m) operations to binary search but only _O(n + m) operations to sorted search. This is a critical function for implementing database operations.
- inner_join is a relational join operator. It's a demonstration of the power of combining vectorized sorted search with the load-balancing search and useful in its own right.
moderngpu - copyright 2021 - Sean Baxter
moderngpu
programming
segmented programming
API reference