PyTorch C++ and CUDA extension for PACE's Piecewise Polynomial Approximation(PwPA), a Transformer non-linerarities accelaration engine.
This extension integrates PwPA CUDA kernels for both AoS and SoA coefficients' data structure using a simple unrolling technic.
More details here.
Built with PyPA/Build, but you can use Pip or similar.
To build:
python -m build -n
To install:
pip install dist\<builded_extension_file.whl>
To test:
python test\extension_test.py
python test\extension_test.py
To use:
import torch_pace
...
# base kernel
y = torch_pace.ops._pwpa(x, coeffs, partition_points, AoS=true)
# optimized kernel
y = torch_pace.ops.pwpa(x, coeffs, partition_points, AoS=true)
# AoS to SoA coefficients rearrangement
coeffs_soa = torch_pace.ops.aos2soa(coeffs, degree)
# optimized kernel with SoA coefficients' data structure
y = torch_pace.ops.pwpa(x, coeffs_soa, partition_points, AoS=false)
Important
Requirements:
- torch>=2.4 with CUDA enabled (mine is 2.5.1+cu118)
- CUDA toolkit (mine is 11.7)
- Python>=3.8 (mine is 3.12.8)
This is the ouput of running approximation_test.py:
Note
approximation_test.py uses a simple uniform partitioning which divides the X-value range in equal parts.
More sophisticated partitioning strategies may account for slope trends, yielding more accurate approximations where the function changes more.
A brief list of things to do or fix in this extension:
- PyTorch Half type support
- Extension Benchmark on non-linearities in plain CUDA code
- Extension Benchmark on PyTorch non-linearities
- ILP (Instruction-Level Parallelism) integration
- aos2soa function
- soa2aos function
- CUDA SIMD instrics analysis for float16 (PyTorch Half) type
- PyTorch neural net example
Extension backbone inspired by this tutorial.
Marco Sangiorgi
2025©