Optimizations for mamba1 #1213

Goekdeniz-Guelmez · 2025-01-20T18:03:58Z

This PR optimizes the MambaBlock implementation to improve performance and cache handling while maintaining compatibility with the existing MambaCache interface.

Batched Input Processing
- Processes all tokens simultaneously through in_proj instead of one at a time
- Reduces number of operations and improves parallelization
- Uses reshape to maintain batch dimensions efficiently
Refactored Core Logic
- Introduced _process_sequence method to separate main processing logic from cache handling
- Improves code maintainability and makes future optimizations easier
- Clearer separation between state management and computation
Enhanced Cache Handling
- Added robust type checking for different cache formats
- Maintains backwards compatibility with list-based caches
- Explicit handling of MambaCache objects
Memory and Computation Optimizations
- Pre-computes A matrix outside the token loop
- Better state tracking through sequence processing
- More explicit memory management

Hardware M4 MacMini

Before:

state-spaces/mamba-130m-hf

Prompt: 5 tokens, 41.293 tokens-per-sec
Generation: 100 tokens, 113.233 tokens-per-sec
Peak memory: 0.529 GB

mlx-community/Falcon3-Mamba-7B-Instruct-4bits

Prompt: 22 tokens, 14.359 tokens-per-sec
Generation: 100 tokens, 15.100 tokens-per-sec
Peak memory: 4.218 GB

After

state-spaces/mamba-130m-hf

Prompt: 5 tokens, 129.130 tokens-per-sec
Generation: 100 tokens, 106.952 tokens-per-sec
Peak memory: 0.530 GB

mlx-community/Falcon3-Mamba-7B-Instruct-4bits

Prompt: 22 tokens, 28.364 tokens-per-sec
Generation: 100 tokens, 14.512 tokens-per-sec
Peak memory: 4.164 GB

…7.822 tokens-per-sec

…c, after: 83.890 tokens-per-sec

…ens-per-sec

… Pre-computed Constants, Cleaner State Management, Explicit Return Values:. Before: 82.442 tokens-per-sec, after: 129.130 tokens-per-sec.

Goekdeniz-Guelmez added 4 commits January 20, 2025 18:37

added mx.einsum() operations: before: 41.293 tokens-per-sec, after: 5…

e43ac7c

…7.822 tokens-per-sec

Fused Operations in delta, B, C = ... :. Before: 57.822 tokens-per-se…

9494a27

…c, after: 83.890 tokens-per-sec

Pre-computing A_log. After: 83.890 tokens-per-sec, before: 85.848 tok…

db582e4

…ens-per-sec

Update MambaBlock, Batched Input Processing, Improved Cache Handling,…

dfd51f1

… Pre-computed Constants, Cleaner State Management, Explicit Return Values:. Before: 82.442 tokens-per-sec, after: 129.130 tokens-per-sec.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations for mamba1 #1213

Optimizations for mamba1 #1213

Goekdeniz-Guelmez commented Jan 20, 2025 •

edited

Loading

Optimizations for mamba1 #1213

Are you sure you want to change the base?

Optimizations for mamba1 #1213

Conversation

Goekdeniz-Guelmez commented Jan 20, 2025 • edited Loading

Hardware M4 MacMini

Before:

state-spaces/mamba-130m-hf

mlx-community/Falcon3-Mamba-7B-Instruct-4bits

After

state-spaces/mamba-130m-hf

mlx-community/Falcon3-Mamba-7B-Instruct-4bits

Goekdeniz-Guelmez commented Jan 20, 2025 •

edited

Loading