Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support pre-packing weights after model optimization
This reduces inference time for matmul operations at the cost of higher memory usage. - Add methods to `Operator` trait to declare which inputs can potentially be pre-packed and to prepack those inputs. - Add `Graph::prepack_weights` method to traverse operators and prepack inputs whose values are constant nodes. - Implement prepacking methods for MatMul and fused MatMul ops There are some caveats: - Non-MatMul operations which use matmuls internally (Conv, ConvTranspose, LSTM, GRU etc.) currently don't prepack their weights. - MatMul operations which turn out to be matrix-vector (gemv) products don't use the prepacked weights. This affects transformer decoders doing non-batched generation after the initial prompt encoding step.
- Loading branch information