-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use "tensorised-Pauli decomposition" in SparsePauliOp.from_operator
#11133
Conversation
One or more of the the following people are requested to review this:
|
if isinstance(coeffs, np.ndarray) and coeffs.dtype == object: | ||
dtype = object | ||
if isinstance(coeffs, np.ndarray): | ||
dtype = object if coeffs.dtype == object else complex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rearrange prevents an unnecessary Python-space iteration from coeffs
when it's given as a Numpy array that's not an object array; we don't need to check each element individually but can let Numpy perform the cast (if needed) in C space.
Pull Request Test Coverage Report for Build 7507348255Warning: This coverage report may be inaccurate.We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
💛 - Coveralls |
This switches the implementation of `SparsePauliOp.from_operator` to the "tensorised-Pauli decomposition" described in https://arxiv.org/abs/2310.13421. This initial implementation is a relatively naive implementation of the algorithm as described in the paper, with a couple of engineering improvements over the paper's accompaniment at HANTLUK/PauliDecomposition@932ce3926 (i.e. constant-factor improvements, rather than complexity improvements). This makes the "zero check" on the recursions short-circuiting, which means rather matrix elements need to be examined on average after each recursive step. Further, the constant-factor multiplication is tracked separately as a single scalar throughout the recursion to avoid a full-matrix complex multiplication at every recursive step (which is approximately six real-floating-point operations per complex multiplication). There is plenty of space in this implementation to reduce internal memory allocations by reusing swap space, and to dispatch the different components of the decomposition to threaded workers. _Effective_ threading is more non-trivial, as a typical operator will have structure that could be exploiting to more optimally dispatch the threads. These can be investigated later.
b96ae78
to
f016f32
Compare
Very casual flame-graphing looks like this implementation is bottlenecked on its very inefficient allocations, which is pretty much exactly what I'd expect. I'd prefer to rearrange everything into a scratch-reusing, potentially non-recursive version of the code in a follow-up, though, since I suspect that it will be rather harder to read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks great, it's a big speedup an the algorithm is pretty straightforward. I just had a couple of questions and comments inline. Nothing major though.
87a8967
to
1b2bea3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the updates
…Qiskit#11133) * Use "tensorised-Pauli decomposition" in `SparsePauliOp.from_operator` This switches the implementation of `SparsePauliOp.from_operator` to the "tensorised-Pauli decomposition" described in https://arxiv.org/abs/2310.13421. This initial implementation is a relatively naive implementation of the algorithm as described in the paper, with a couple of engineering improvements over the paper's accompaniment at HANTLUK/PauliDecomposition@932ce3926 (i.e. constant-factor improvements, rather than complexity improvements). This makes the "zero check" on the recursions short-circuiting, which means rather matrix elements need to be examined on average after each recursive step. Further, the constant-factor multiplication is tracked separately as a single scalar throughout the recursion to avoid a full-matrix complex multiplication at every recursive step (which is approximately six real-floating-point operations per complex multiplication). There is plenty of space in this implementation to reduce internal memory allocations by reusing swap space, and to dispatch the different components of the decomposition to threaded workers. _Effective_ threading is more non-trivial, as a typical operator will have structure that could be exploiting to more optimally dispatch the threads. These can be investigated later. * Remove unused import * Remove superfluous `num_qubits` argument * Export `ZXPaulis` to Python space * Add comment on array-builder logic * Use `num_traits::Zero` trait for zero checking * Use custom ilog2 * Use stdlib `ilog2`
Summary
This switches the implementation of
SparsePauliOp.from_operator
to the "tensorised-Pauli decomposition" described in https://arxiv.org/abs/2310.13421.This initial implementation is a relatively naive implementation of the algorithm as described in the paper, with a couple of engineering improvements over the paper's accompaniment at HANTLUK/PauliDecomposition@932ce3926 (i.e. constant-factor improvements, rather than complexity improvements). This makes the "zero check" on the recursions short-circuiting, which means rather matrix elements need to be examined on average after each recursive step. Further, the constant-factor multiplication is tracked separately as a single scalar throughout the recursion to avoid a full-matrix complex multiplication at every recursive step (which is approximately six real-floating-point operations per complex multiplication).
There is plenty of space in this implementation to reduce internal memory allocations by reusing swap space, and to dispatch the different components of the decomposition to threaded workers. Effective threading is more non-trivial, as a typical operator will have structure that could be exploiting to more optimally dispatch the threads. These can be investigated later.
Details and comments
Timing using the script (IPython because I'm lazy about doing manual
timeit
):I got these results:
where in the$2 \times 4^{\text{num qubits}}$ floating-point addition/subtractions for the recursive step at a size of $4 \times 4^{\text{num qubits}}$ complex multiplications, where each complex multiplication is approximately 6 real floating-point operations in non-SIMD code, though Numpy will have that optimised via its runtime-CPU-feature dispatch to be much more efficient.
num_qubits=10
case, this PR's time is 217ms vs the paper's 13.2s. Some of that will be Rust vs Python, but my version will also allocate fewer matrices and has drastically smaller constant factors on each recursive step via avoiding matrix-long complex multiplications. This means that this PR's version is doingnum_qubits
, whereas the paper is doing that, plusedit: I made the minor modifications needed to avoid the extra complex-number multiplications, and it looks like the non-short-circuiting
np.any
and Python-space overhead dominate the time of the paper's Python implementation, though the complex-multiplication is very visible. I didn't go far down that path, though.I intend to go further in a follow-up PR to reduce the number of matrix allocations that are done in Rust space by re-using the same scratch space all the way down, and potentially avoiding using the helper
Pauli
enum.The algorithm is trivially naively threadable (dispatch the four top-level recursive calls to a thread each), but only when discounting the structure of real operators that meant that it's unlikely that the top level would always split neatly in four, and then there'd be quite unbalanced threads. Taking efficient advantage of the structure in threading is a little more non-trivial.