Fast Hadamard Transform in CUDA, with a PyTorch interface

Features:

Support fp32, fp16, bf16, for dimension up to 32768.
Implicitly pad with zeros if dimension is not a power of 2.

How to use

from fast_hadamard_transform import hadamard_transform

def hadamard_transform(x, scale=1.0):
    """
    Arguments:
        x: (..., dim)
        scale: float. Multiply the output by this number.
    Returns:
        out: (..., dim)

    Multiply each row of x by the Hadamard transform matrix.
    Equivalent to F.linear(x, torch.tensor(scipy.linalg.hadamard(dim))) * scale.
    If dim is not a power of 2, we implicitly pad x with zero so that dim is the next power of 2.
    """

Speed

Benchmarked on A100, for not too small batch size, compared to memcpy (torch.clone), which is a lower bound for the time taken as we'd need to read inputs from GPU memory and write output to GPU memory anyway.

Data type	Dimension	Time taken vs memcpy
fp16/bf16	<= 512	1.0x
	512 - 8192	<= 1.2x
	16384	1.3x
	32768	1.8x
fp32	<= 8192	1.0x
	16384	1.1x
	32768	1.2x

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
csrc		csrc
fast_hadamard_transform		fast_hadamard_transform
tests		tests
AUTHORS		AUTHORS
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast Hadamard Transform in CUDA, with a PyTorch interface

How to use

Speed

About

Releases

Packages

Languages

License

RishiAstra/fast-hadamard-transform

Folders and files

Latest commit

History

Repository files navigation

Fast Hadamard Transform in CUDA, with a PyTorch interface

How to use

Speed

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages