Releases: ml-explore/mlx
Releases · ml-explore/mlx
v0.23.0
Highlights
- 4-bit Mistral 7B generates at 131 toks/sec out of the box on an M2 Ultra
- More performance improvements across the board:
- Faster small batch quantized matmuls. Speeds up speculative decoding on M1, M2
- Faster winograd convolutions, benchmarks
- Up to 3x faster sort, benchmarks
- Much faster
mx.put_along_axis
andmx.take_along_axis
, benchmarks - Faster unified CPU back-end with vector operations
- Double precision (
mx.float64
) support on the CPU
Core
Features
- Bitwise invert
mx.bitwise_invert
mx.linalg.lu
,mx.linalg.lu_factor
,mx.linalg.solve
,mx.linalg.solve_triangular
- Support loading F8_E4M3 from safetensors
mx.float64
supported on the CPU- Matmul JVPs
- Distributed launch helper :
mlx.launch
- Support non-square QR factorization with
mx.linalg.qr
- Support ellipsis in
mx.einsum
- Refactor and unify accelerate and common back-ends
Performance
- Faster synchronization
Fence
for synchronizing CPU-GPU - Much faster
mx.put_along_axis
andmx.take_along_axis
, benchmarks - Fast winograd convolutions, benchmarks
- Allow dynamic ops per buffer based on dispatches and memory, benchmarks
- Up to 3x faster sort, benchmarks
- Faster small batch qmv, benchmarks
- Ring distributed backend
- Uses raw sockets for faster all reduce
- Some CPU ops are much faster with the new
Simd<T, N>
NN
- Orthogonal initializer
nn.init.orthogonal
- Add dilation for conv 3d layers
Bug fixes
- Limit grad recursion depth by not recursing through non-grad inputs
- Fix synchronization bug for GPU stream async CPU work
- Fix shapeless compile on ubuntu24
- Recompile when
shapeless
changes - Fix rope fallback to not upcast
- Fix metal sort for certain cases
- Fix a couple of slicing bugs
- Avoid duplicate malloc with custom kernel init
- Fix compilation error on Windows
- Allow Python garbage collector to break cycles on custom objects
- Fix grad with copies
- Loading empty list is ok when
strict = false
- Fix split vmap
- Fixes output donation for IO ops on the GPU
- Fix creating an array with an int64 scalar
- Catch stream errors earlier to avoid aborts
v0.22.1
v0.22.0
Highlights
- Export and import MLX functions to a file (example, bigger example)
- Functions can be exported from Python and run in C++ and vice versa
Core
- Add
slice
andslice_update
which take arrays for starting locations - Add an example for using MLX in C++ with CMake
- Fused attention for generation now supports boolean masking (benchmark)
- Allow array offset for
mx.fast.rope
- Add
mx.finfo
- Allow negative strides without resorting to copying for
slice
andas_strided
- Add
Flatten
,Unflatten
andExpandDims
primitives - Enable the compilation of lambdas in C++
- Add a lot more primitives for shapeless compilation (full list)
- Fix performance regression in
qvm
- Introduce separate types for
Shape
andStrides
and switch to int64 strides from uint64 - Reduced copies for fused-attention kernel
- Recompile a function when the stream changes
- Several steps to improve the linux / x86_64 experience (#1625, #1627, #1635)
- Several steps to improve/enable the windows experience (#1628, #1660, #1662, #1661, #1672, #1663, #1664, ...)
- Update to newer Metal-cpp
- Throw when exceeding the maximum number of buffers possible
- Add
mx.kron
mx.distributed.send
now implements the identity function instead of returning an empty array- Better errors reporting for
mx.compile
on CPU and for unrecoverable errors
NN
- Add optional bias correction in Adam/AdamW
- Enable mixed quantization by
nn.quantize
- Remove reshapes from
nn.QuantizedEmbedding
Bug fixes
- Fix qmv/qvm bug for batch size 2-5
- Fix some leaks and races (#1629)
- Fix transformer postnorm in
mlx.nn
- Fix some
mx.fast
fallbacks - Fix the hashing for string constants in
compile
- Fix small sort in Metal
- Fix memory leak of non-evaled arrays with siblings
- Fix concatenate/slice_update vjp in edge-case where the inputs have different type
v0.21.1
v0.21.0
Highlights
- Support 3 and 6 bit quantization: benchmarks
- Much faster memory efficient attention for headdim 64, 80: benchmarks
- Much faster sdpa inference kernel for longer sequences: benchmarks
Core
contiguous
op (C++ only) + primitive- Bfs width limit to reduce memory consumption during
eval
- Fast CPU quantization
- Faster indexing math in several kernels:
- unary, binary, ternary, copy, compiled, reduce
- Improve dispatch threads for a few kernels:
- conv, gemm splitk, custom kernels
- More buffer donation with no-ops to reduce memory use
- Use
CMAKE_OSX_DEPLOYMENT_TARGET
to pick Metal version - Dispatch Metal bf16 type at runtime when using the JIT
NN
nn.AvgPool3d
andnn.MaxPool3d
- Support
groups
innn.Conv2d
Bug fixes
- Fix per-example mask + docs in sdpa
- Fix FFT synchronization bug (use dispatch method everywhere)
- Throw for invalid
*fft{2,n}
cases - Fix OOB access in qmv
- Fix donation in sdpa to reduce memory use
- Allocate safetensors header on the heap to avoid stack overflow
- Fix sibling memory leak
- Fix
view
segfault for scalars input - Fix concatenate vmap
v0.20.0
Highlights
- Even faster GEMMs
- Peaking at 23.89 TFlops on M2 Ultra benchmarks
- BFS graph optimizations
- Over 120tks with Mistral 7B!
- Fast batched QMV/QVM for KV quantized attention benchmarks
Core
- New Features
mx.linalg.eigh
andmx.linalg.eigvalsh
mx.nn.init.sparse
- 64bit type support for
mx.cumprod
,mx.cumsum
- Performance
- Faster long column reductions
- Wired buffer support for large models
- Better Winograd dispatch condition for convs
- Faster scatter/gather
- Faster
mx.random.uniform
andmx.random.bernoulli
- Better threadgroup sizes for large arrays
- Misc
- Added Python 3.13 to CI
- C++20 compatibility
Bugfixes
- Fix command encoder synchronization
- Fix
mx.vmap
with gather and constant outputs - Fix fused sdpa with differing key and value strides
- Support
mx.array.__format__
with spec - Fix multi output array leak
- Fix RMSNorm weight mismatch error
v0.19.3
v0.19.2
🚀🚀
v0.19.1
v0.19.0
Highlights
- Speed improvements
- Up to 6x faster CPU indexing benchmarks
- Faster Metal compiled kernels for strided inputs benchmarks
- Faster generation with fused-attention kernel benchmarks
- Gradient for grouped convolutions
- Due to Python 3.8's end-of-life we no longer test with it on CI
Core
- New features
- Gradient for grouped convolutions
mx.roll
mx.random.permutation
mx.real
andmx.imag
- Performance
- Up to 6x faster CPU indexing benchmarks
- Faster CPU sort benchmarks
- Faster Metal compiled kernels for strided inputs benchmarks
- Faster generation with fused-attention kernel benchmarks
- Bulk eval in safetensors to avoid unnecessary serialization of work
- Misc
- Bump to nanobind 2.2
- Move testing to python 3.9 due to 3.8's end-of-life
- Make the GPU device more thread safe
- Fix the submodule stubs for better IDE support
- CI generated docs that will never be stale
NN
- Add support for grouped 1D convolutions to the nn API
- Add some missing type annotations
Bugfixes
- Fix and speedup row-reduce with few rows
- Fix normalization primitive segfault with unexpected inputs
- Fix complex power on the GPU
- Fix freeing deep unevaluated graphs details
- Fix race with
array::is_available
- Consistently handle softmax with all
-inf
inputs - Fix streams in affine quantize
- Fix CPU compile preamble for some linux machines
- Stream safety in CPU compilation
- Fix CPU compile segfault at program shutdown