-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of CommutationAnalysis
transpiler pass
#6982
Improve performance of CommutationAnalysis
transpiler pass
#6982
Conversation
The `CommutationAnalysis` transpiler pass for large circuits with many gates is typically one of the longer passes, along with the mapping passes. It spends most of its time deciding whether any two given operators commute, which it does by matrix multiplication and maintaining a cache. This commit maintains the same general method (as opposed to, say, maintaining a known-good lookup table), but performs the following optimisations to improve it, in approximate order from most to least impactful: - we store the _result_ of "do these two matrices commute?" instead of the previous method of two matrices individually in the cache. With the current way the caching is written, the keys depend on both matrices, so it is never the case that one key can be a cache hit and the other a cache miss. This means that storing only the result does not cause any more cache misses than before, and saves the subsequent matrix operations (multiplication and comparison). In real-word usage, this is the most major change. - the matrix multiplication is slightly reorganised to halve the number of calls to `Operator.compose`. Instead of first constructing a identity operator over _all_ qubits, and composing both operators onto it, wey reorganise the indices of the qubit arguments so that all the "first" operator's qubits come first, and it is tensored with a small identity to bring it up to size. Then the other operator is composed with it. This is generally faster, since it replaces a call to `compose`, which needs to do a matmul-einsum step into a simple `kron`, which need not concern itself with order. It also results in fewer operations, since the input matrices are smaller. - the cache-key algorithm is changed to avoid string-ification as much as possible. This generally has a very small impact for most real-world applications, but has _massive_ impact on circuits with large numbers of unsynthesised `UnitaryGate` elements (like quantum volume circuits being transpiled without `basis_gates`). This is because the gate parameters were previously being converted to string, which for `UnitaryGate` meant string-ifying the whole matrix. This was much slower than just doing the dot product, so was defeating the purpose of the cache. On my laptop (i7 Macbook Pro 2019), this gives a 15-35% speed increase on the `time_quantum_volume_transpile` benchmark at transpiler optimisation levels 2 and 3 (the only levels the `CommutationAnalysis` pass is done by default), over the whole transpiler pass. The improvement in runtime of the pass itself depends strongly on the type of circuit itself, but in the worst (highly non-realistic) cases, can be nearly an order of magnitude improvement (for example just calling `transpile(optimization_level=3)` on a `QuantumVolume(128)` circuit drops from 27s to 5s). `asv` `time_quantum_volume_transpile` benchmarks: - This commit: =============================== ============ transpiler optimization level ------------------------------- ------------ 0 3.57±0s 1 7.16±0.03s 2 10.8±0.02s 3 33.3±0.08s =============================== ============ - Previous commit: =============================== ============ transpiler optimization level ------------------------------- ------------ 0 3.56±0.02s 1 7.24±0.05s 2 16.8±0.04s 3 38.9±0.1s =============================== ============
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the cache handle floating point round-off errors? Would the keys be unique enough by manually rounding them?
@@ -87,42 +87,87 @@ def run(self, dag): | |||
self.property_set["commutation_set"][(current_gate, wire)] = temp_len - 1 | |||
|
|||
|
|||
def _commute(node1, node2, cache): | |||
_COMMUTE_ID_OP = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be module level or can it be put inside _commute
?
I couldn't find where it gets populated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be either - realistically I can't imagine it'll ever contain more than ~3 elements (you'd need to have be dealing with many-qubit gates to have more than even a single element in it), so there shouldn't be any memory problems. Having it module-level means that the cache is saved per call; it's only ever populated with various sizes of the identity matrix, and those are complete constants.
It's populated here:
https://github.com/Qiskit/qiskit-terra/blob/07299e170532a24fbaefa6a87aa3c427b79f765e/qiskit/transpiler/passes/optimization/commutation_analysis.py#L161-L168
on line 164 (note it's a double assignment).
The original code actually had far worse behaviour with regards to string-ification, since before this commit, all Numpy arrays were turned into strings, which has all the floating-point problems you mentioned. In the new form, I'm not aware of any cases where the fallback will actually be used, but I kept it in place so it won't break any extension gates doing super weird things. This commit in theory can make more cache misses (though it certainly doesn't in the benchmark suite), if several instructions are separately decomposed into the same type of gate with exactly equivalent, but referentially unequal Numpy arrays. Even if this were to happen, the slowness of string-ifying Numpy arrays (the previous method) almost invariably outweighs the cost of just doing to the commutation analysis again, especially now I tweaked the comparison to avoid unnecessary (Sorry for the slow reply - I was on holiday!) |
The
CommutationAnalysis
transpiler pass for large circuits with manygates is typically one of the longer passes, along with the mapping
passes. It spends most of its time deciding whether any two given
operators commute, which it does by matrix multiplication and
maintaining a cache.
This commit maintains the same general method (as opposed to, say,
maintaining a known-good lookup table), but performs the following
optimisations to improve it, in approximate order from most to least
impactful:
we store the result of "do these two matrices commute?" instead of
the previous method of two matrices individually in the cache. With
the current way the caching is written, the keys depend on both
matrices, so it is never the case that one key can be a cache hit and
the other a cache miss. This means that storing only the result does
not cause any more cache misses than before, and saves the subsequent
matrix operations (multiplication and comparison). In real-word
usage, this is the most major change.
the matrix multiplication is slightly reorganised to halve the number
of calls to
Operator.compose
. Instead of first constructing aidentity operator over all qubits, and composing both operators onto
it, wey reorganise the indices of the qubit arguments so that all the
"first" operator's qubits come first, and it is tensored with a small
identity to bring it up to size. Then the other operator is composed
with it. This is generally faster, since it replaces a call to
compose
, which needs to do a matmul-einsum step into a simplekron
, which need not concern itself with order. It also results infewer operations, since the input matrices are smaller.
the cache-key algorithm is changed to avoid string-ification as much
as possible. This generally has a very small impact for most
real-world applications, but has massive impact on circuits with
large numbers of unsynthesised
UnitaryGate
elements (like quantumvolume circuits being transpiled without
basis_gates
). This isbecause the gate parameters were previously being converted to string,
which for
UnitaryGate
meant string-ifying the whole matrix. Thiswas much slower than just doing the dot product, so was defeating the
purpose of the cache.
On my laptop (i7 Macbook Pro 2019), this gives a 15-35% speed increase
on the
time_quantum_volume_transpile
benchmark at transpileroptimisation levels 2 and 3 (the only levels the
CommutationAnalysis
pass is done by default), over the whole transpiler pass. The
improvement in runtime of the pass itself depends strongly on the type
of circuit itself, but in the worst (highly non-realistic) cases, can be
nearly an order of magnitude improvement (for example just calling
transpile(optimization_level=3)
on aQuantumVolume(128)
circuitdrops from 27s to 5s).
asv
time_quantum_volume_transpile
benchmarks:This commit:
Previous commit:
Details and comments
There is more work that could be done here as well, which I think would give us more performance in more natural workflows that don't just involve dumping in a tonne of unitary matrices. The next obvious step would be to build a lookup table of gate types that are known to commute with each other, in order to skip the entire process of comparing the two matrix multiplications for common cases; we can always fall back to this method. I haven't done this right now, because it's bigger task, and doesn't need doing in one go.
With caches, it's important that the time taken to construct the key isn't excessive - in this case, when the parameters were arrays (even if not
UnitaryGate
), that wasn't necessarily honoured.