Releases · ml-explore/mlx

14 Feb 21:39

awni

v0.23.0

6cec78d

v0.23.0 Latest

Latest

Highlights

4-bit Mistral 7B generates at 131 toks/sec out of the box on an M2 Ultra
More performance improvements across the board:
- Faster small batch quantized matmuls. Speeds up speculative decoding on M1, M2
- Faster winograd convolutions, benchmarks
- Up to 3x faster sort, benchmarks
- Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
- Faster unified CPU back-end with vector operations
Double precision (mx.float64) support on the CPU

Core

Features

Bitwise invert mx.bitwise_invert
mx.linalg.lu, mx.linalg.lu_factor, mx.linalg.solve, mx.linalg.solve_triangular
Support loading F8_E4M3 from safetensors
mx.float64 supported on the CPU
Matmul JVPs
Distributed launch helper :mlx.launch
Support non-square QR factorization with mx.linalg.qr
Support ellipsis in mx.einsum
Refactor and unify accelerate and common back-ends

Performance

Faster synchronization Fence for synchronizing CPU-GPU
Much faster mx.put_along_axis and mx.take_along_axis, benchmarks
Fast winograd convolutions, benchmarks
Allow dynamic ops per buffer based on dispatches and memory, benchmarks
Up to 3x faster sort, benchmarks
Faster small batch qmv, benchmarks
Ring distributed backend
- Uses raw sockets for faster all reduce
Some CPU ops are much faster with the new Simd<T, N>

NN

Orthogonal initializer nn.init.orthogonal
Add dilation for conv 3d layers

Bug fixes

Limit grad recursion depth by not recursing through non-grad inputs
Fix synchronization bug for GPU stream async CPU work
Fix shapeless compile on ubuntu24
Recompile when shapeless changes
Fix rope fallback to not upcast
Fix metal sort for certain cases
Fix a couple of slicing bugs
Avoid duplicate malloc with custom kernel init
Fix compilation error on Windows
Allow Python garbage collector to break cycles on custom objects
Fix grad with copies
Loading empty list is ok when strict = false
Fix split vmap
Fixes output donation for IO ops on the GPU
Fix creating an array with an int64 scalar
Catch stream errors earlier to avoid aborts

Assets 2

06 Feb 20:10

awni

v0.22.1

1a1b210

v0.22.1

🚀

Assets 2

09 Jan 22:33

angeloskath

v0.22.0

1ce0c0f

v0.22.0

Highlights

Export and import MLX functions to a file (example, bigger example)
- Functions can be exported from Python and run in C++ and vice versa

Core

Add slice and slice_update which take arrays for starting locations
Add an example for using MLX in C++ with CMake
Fused attention for generation now supports boolean masking (benchmark)
Allow array offset for mx.fast.rope
Add mx.finfo
Allow negative strides without resorting to copying for slice and as_strided
Add Flatten, Unflatten and ExpandDims primitives
Enable the compilation of lambdas in C++
Add a lot more primitives for shapeless compilation (full list)
Fix performance regression in qvm
Introduce separate types for Shape and Strides and switch to int64 strides from uint64
Reduced copies for fused-attention kernel
Recompile a function when the stream changes
Several steps to improve the linux / x86_64 experience (#1625, #1627, #1635)
Several steps to improve/enable the windows experience (#1628, #1660, #1662, #1661, #1672, #1663, #1664, ...)
Update to newer Metal-cpp
Throw when exceeding the maximum number of buffers possible
Add mx.kron
mx.distributed.send now implements the identity function instead of returning an empty array
Better errors reporting for mx.compile on CPU and for unrecoverable errors

NN

Add optional bias correction in Adam/AdamW
Enable mixed quantization by nn.quantize
Remove reshapes from nn.QuantizedEmbedding

Bug fixes

Fix qmv/qvm bug for batch size 2-5
Fix some leaks and races (#1629)
Fix transformer postnorm in mlx.nn
Fix some mx.fast fallbacks
Fix the hashing for string constants in compile
Fix small sort in Metal
Fix memory leak of non-evaled arrays with siblings
Fix concatenate/slice_update vjp in edge-case where the inputs have different type

Assets 2

06 Dec 21:17

awni

v0.21.1

50fa705

v0.21.1

🚀 🚀

Assets 2

22 Nov 20:18

awni

v0.21.0

bb303c4

v0.21.0

Highlights

Support 3 and 6 bit quantization: benchmarks
Much faster memory efficient attention for headdim 64, 80: benchmarks
Much faster sdpa inference kernel for longer sequences: benchmarks

Core

contiguous op (C++ only) + primitive
Bfs width limit to reduce memory consumption during eval
Fast CPU quantization
Faster indexing math in several kernels:
- unary, binary, ternary, copy, compiled, reduce
Improve dispatch threads for a few kernels:
- conv, gemm splitk, custom kernels
More buffer donation with no-ops to reduce memory use
Use CMAKE_OSX_DEPLOYMENT_TARGET to pick Metal version
Dispatch Metal bf16 type at runtime when using the JIT

NN

nn.AvgPool3d and nn.MaxPool3d
Support groups in nn.Conv2d

Bug fixes

Fix per-example mask + docs in sdpa
Fix FFT synchronization bug (use dispatch method everywhere)
Throw for invalid *fft{2,n} cases
Fix OOB access in qmv
Fix donation in sdpa to reduce memory use
Allocate safetensors header on the heap to avoid stack overflow
Fix sibling memory leak
Fix view segfault for scalars input
Fix concatenate vmap

Assets 2

05 Nov 21:23

barronalex

v0.20.0

726dbd9

v0.20.0

Highlights

Even faster GEMMs
- Peaking at 23.89 TFlops on M2 Ultra benchmarks
BFS graph optimizations
- Over 120tks with Mistral 7B!
Fast batched QMV/QVM for KV quantized attention benchmarks

Core

New Features
- mx.linalg.eigh and mx.linalg.eigvalsh
- mx.nn.init.sparse
- 64bit type support for mx.cumprod, mx.cumsum
Performance
- Faster long column reductions
- Wired buffer support for large models
- Better Winograd dispatch condition for convs
- Faster scatter/gather
- Faster mx.random.uniform and mx.random.bernoulli
- Better threadgroup sizes for large arrays
Misc
- Added Python 3.13 to CI
- C++20 compatibility

Bugfixes

Fix command encoder synchronization
Fix mx.vmap with gather and constant outputs
Fix fused sdpa with differing key and value strides
Support mx.array.__format__ with spec
Fix multi output array leak
Fix RMSNorm weight mismatch error

Assets 2

31 Oct 23:11

awni

v0.19.3

eac961d

v0.19.3

🚀

Assets 2

31 Oct 02:54

angeloskath

v0.19.2

cde5b4a

v0.19.2

🚀🚀

Assets 2

25 Oct 20:18

awni

v0.19.1

35e9c87

v0.19.1

🚀

Assets 2

18 Oct 19:35

angeloskath

v0.19.0

58a8556

v0.19.0

Highlights

Speed improvements
- Up to 6x faster CPU indexing benchmarks
- Faster Metal compiled kernels for strided inputs benchmarks
- Faster generation with fused-attention kernel benchmarks
Gradient for grouped convolutions
Due to Python 3.8's end-of-life we no longer test with it on CI

Core

New features
- Gradient for grouped convolutions
- mx.roll
- mx.random.permutation
- mx.real and mx.imag
Performance
- Up to 6x faster CPU indexing benchmarks
- Faster CPU sort benchmarks
- Faster Metal compiled kernels for strided inputs benchmarks
- Faster generation with fused-attention kernel benchmarks
- Bulk eval in safetensors to avoid unnecessary serialization of work
Misc
- Bump to nanobind 2.2
- Move testing to python 3.9 due to 3.8's end-of-life
- Make the GPU device more thread safe
- Fix the submodule stubs for better IDE support
- CI generated docs that will never be stale

NN

Add support for grouped 1D convolutions to the nn API
Add some missing type annotations

Bugfixes

Fix and speedup row-reduce with few rows
Fix normalization primitive segfault with unexpected inputs
Fix complex power on the GPU
Fix freeing deep unevaluated graphs details
Fix race with array::is_available
Consistently handle softmax with all -inf inputs
Fix streams in affine quantize
Fix CPU compile preamble for some linux machines
Stream safety in CPU compilation
Fix CPU compile segfault at program shutdown

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlights

Core

Features

Performance

NN

Bug fixes

Highlights

Core

NN

Bug fixes

Highlights

Core

NN

Bug fixes

Highlights

Core

Bugfixes

Highlights

Core

NN

Bugfixes

Releases: ml-explore/mlx

v0.23.0

Highlights

Core

Features

Performance

NN

Bug fixes

v0.22.1

v0.22.0

Highlights

Core

NN

Bug fixes

v0.21.1

v0.21.0

Highlights

Core

NN

Bug fixes

v0.20.0

Highlights

Core

Bugfixes

v0.19.3

v0.19.2

v0.19.1

v0.19.0

Highlights

Core

NN

Bugfixes