Skip to content

Hashed Lookup Table based Matrix Multiplication (halutmatmul) - Stella Nera accelerator

License

Notifications You must be signed in to change notification settings

deepware-ai/halutmatmul

 
 

Repository files navigation

Stella Nera - Halutmatmul

Algorithmic CI

PyTorch Layer Test | PyTest Python Linting Mypy - Typechecking

ML CI

ResNet9 - 92%+ accuracy

Hardware CI

HW Synth + PAR OpenROAD RTL Linting HW Design Verification

Paper

Abstract

The recent Maddness method approximates Matrix Multiplication (MatMul) without the need for multiplication by using a hash-based version of product quantization (PQ). The hash function is a decision tree, allowing for efficient hardware implementation, as multiply-accumulate operations are replaced by decision tree passes and LUT lookups. Stella Nera is the first Maddness accelerator achieving 15x higher area efficiency (GMAC/s/mm^2) and 25x higher energy efficiency (TMAC/s/W) than direct MatMul accelerators in the same technology. In a commercial 14 nm technology and scaled to 3 nm, we achieve an energy efficiency of 161 TOp/s/[email protected] with a Top-1 accuracy on CIFAR-10 of over 92.5% using ResNet9.

Algorithmic - Maddness

Maddness Animation

Differentiable Maddness

Differentiable Maddness

ResNet-9 LUTs, Thresholds, Dims

Halutmatmul example

import numpy as np
from halutmatmul.halutmatmul import HalutMatmul

A = np.random.random((10000, 512))
A_train = A[:8000]
A_test = A[8000:]
B = np.random.random((512, 10))
C = np.matmul(A_test, B)

hm = HalutMatmul(C=32, K=16)
hm.learn_offline(A_train, B)
C_halut = hm.matmul_online(A_test)

mse = np.square(C_halut - C).mean()
print(mse)

Hardware - OpenROAD flow results from CI

All NanGate45 results are NOT OPTIMIZED! The results are only for reference and to show the flow works.

All Designs NanGate45
All Report All
History History

Full design (halutmatmul)

Run locally with:

git submodule update --init --recursive
cd hardware
ACC_TYPE=INT DATA_WIDTH=8 NUM_M=8 NUM_DECODER_UNITS=4 NUM_C=16 make halut-open-synth-and-pnr-halut_matmul

Full Design

halut_matmul NanGate45
Area [μm^2] 128816
Freq [Mhz] 166.7
GE 161.423 kGE
Std Cell [#] 65496
Voltage [V] 1.1
Util [%] 50.4
TNS 0
Clock Net Clock_net
Routing Routing
GDS GDS Download

Encoder

halut_encoder_4 NanGate45
Area [μm^2] 46782
Freq [Mhz] 166.7
GE 58.624 kGE
Std Cell [#] 23130
Voltage [V] 1.1
Util [%] 48.7
TNS 0
Clock Net Clock_net
Routing Routing
GDS GDS Download

Decoder

halut_decoder NanGate45
Area [μm^2] 24667.5
Freq [Mhz] 166.7
GE 30.911 kGE
Std Cell [#] 12256
Voltage [V] 1.1
Util [%] 52.1
TNS 0
Clock Net Clock_net
Routing Routing
GDS GDS Download

Install

# install conda environment & activate
# mamba is recommended for faster install
conda env create -f environment_gpu.yml
conda activate halutmatmul

# IIS prefixed env
conda env create -f environment_gpu.yml --prefix /scratch/janniss/conda/halutmatmul_gpu

References

Hackernews mention (comments only) and discussion

About

Hashed Lookup Table based Matrix Multiplication (halutmatmul) - Stella Nera accelerator

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 85.7%
  • SystemVerilog 10.0%
  • Makefile 2.5%
  • Other 1.8%