Skip to content

PyTorch CUDA/C++ extension of PACE: Transformer non-linearlity accelerator engine.

License

Notifications You must be signed in to change notification settings

sangioai/torchPACE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

torchPACE

PyTorch C++ and CUDA extension for PACE's Piecewise Polynomial Approximation(PwPA), a Transformer non-linerarities accelaration engine.

Introduction

This extension integrates PwPA CUDA kernels for both AoS and SoA coefficients' data structure using a simple unrolling technic.
More details here.

Setup

Built with PyPA/Build, but you can use Pip or similar.

To build:

python -m build -n

To install:

pip install dist\<builded_extension_file.whl>

To test:

python test\extension_test.py
python test\extension_test.py

To use:

import torch_pace
...
# base kernel
y = torch_pace.ops._pwpa(x, coeffs, partition_points, AoS=true)
# optimized kernel
y = torch_pace.ops.pwpa(x, coeffs, partition_points, AoS=true)
# AoS to SoA coefficients rearrangement
coeffs_soa = torch_pace.ops.aos2soa(coeffs, degree)
# optimized kernel with SoA coefficients' data structure
y = torch_pace.ops.pwpa(x, coeffs_soa, partition_points, AoS=false)

Important

Requirements:

  • torch>=2.4 with CUDA enabled (mine is 2.5.1+cu118)
  • CUDA toolkit (mine is 11.7)
  • Python>=3.8 (mine is 3.12.8)

Examples

This is the ouput of running approximation_test.py: immagine

Note

approximation_test.py uses a simple uniform partitioning which divides the X-value range in equal parts.
More sophisticated partitioning strategies may account for slope trends, yielding more accurate approximations where the function changes more.

ToDo

A brief list of things to do or fix in this extension:

  • PyTorch Half type support
  • Extension Benchmark on non-linearities in plain CUDA code
  • Extension Benchmark on PyTorch non-linearities
  • ILP (Instruction-Level Parallelism) integration
  • aos2soa function
  • soa2aos function
  • CUDA SIMD instrics analysis for float16 (PyTorch Half) type
  • PyTorch neural net example

Credits

Extension backbone inspired by this tutorial.

Authors

Marco Sangiorgi
2025©

About

PyTorch CUDA/C++ extension of PACE: Transformer non-linearlity accelerator engine.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published