-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transpiler yields less optimal circuits on larger topologies #10160
Comments
With these plots, did you have an |
No initial layout was set. |
This commit builds on the VF2PartialLayout pass which was an experiment available as an external plugin here: https://github.com/mtreinish/vf2_partial_layout That pass used the vf2 algorithm in rustworkx to find the deepest partial interaction graph of a circuit which is isomorphic with the coupling graph and uses that mapping to apply an initial layout. The issue with the performance of that pass was the selection of the qubits outside the partial interaction graph. Selecting the mapping for those qubits is similar to the same heuristic layout that SabreLayout is trying to solve, just for a subset of qubits. In VF2PartialLayout a simple nearest neighbor based approach was used for selecting qubits from the coupling graph for any virtual qubits outside the partial layout. In practice this ended up performing worse than SabreLayout. To address the shortcomings of that pass this commit combines the partial layout selection from that external plugin with SabreLayout. The sabre layout algorithm starts by randomly selecting a layout and then progressively working forwards and backwards across the circuit and swap mapping it to find the permutation caused by inserted swaps. Those permutations are then used to modify the random layout and eventual an initial layout that minimizes the number of swaps needed is selected. With this commit instead of using a completely random layout for all the initial guesses this starts a single trial with the partial layout found in the same way as VF2PartialLayout. Then the remaining qubits are selected at random and the Sabrelayout algorithm is run in the same manner as before. This hopefully should improve the quality of the results because we're starting from a partial layout that doesn't require swaps for those qubits. A similar (almost identical approach) was tried in Qiskit#9174 except instead of seeding a single trial with the partial layout it used the partial layout for all the the trials. In that case the results were not generally better and the results were mixed. At the time my guess was that using the partial layout constrained the search space too much and was inducing more swaps to be needed. However, looking at the details in issue Qiskit#10160 this adapts Qiskit#9174 to see if doing the partial layout in a more limited manner has any impact there.
This commit builds on the VF2PartialLayout pass which was an experiment available as an external plugin here: https://github.com/mtreinish/vf2_partial_layout That pass used the vf2 algorithm in rustworkx to find the deepest partial interaction graph of a circuit which is isomorphic with the coupling graph and uses that mapping to apply an initial layout. The issue with the performance of that pass was the selection of the qubits outside the partial interaction graph. Selecting the mapping for those qubits is similar to the same heuristic layout that SabreLayout is trying to solve, just for a subset of qubits. In VF2PartialLayout a simple nearest neighbor based approach was used for selecting qubits from the coupling graph for any virtual qubits outside the partial layout. In practice this ended up performing worse than SabreLayout. To address the shortcomings of that pass this commit combines the partial layout selection from that external plugin with SabreLayout. The sabre layout algorithm starts by randomly selecting a layout and then progressively working forwards and backwards across the circuit and swap mapping it to find the permutation caused by inserted swaps. Those permutations are then used to modify the random layout and eventual an initial layout that minimizes the number of swaps needed is selected. With this commit instead of using a completely random layout for all the initial guesses this starts a single trial with the partial layout found in the same way as VF2PartialLayout. Then the remaining qubits are selected at random and the Sabrelayout algorithm is run in the same manner as before. This hopefully should improve the quality of the results because we're starting from a partial layout that doesn't require swaps for those qubits. A similar (almost identical approach) was tried in Qiskit#9174 except instead of seeding a single trial with the partial layout it used the partial layout for all the the trials. In that case the results were not generally better and the results were mixed. At the time my guess was that using the partial layout constrained the search space too much and was inducing more swaps to be needed. However, looking at the details in issue Qiskit#10160 this adapts Qiskit#9174 to see if doing the partial layout in a more limited manner has any impact there.
Following up on this a bit, I looked at comparing results vs PyTket for the case of circular entanglement. Because Tket is deterministic, but the Qiskit methods are not, I transpiled circuits from different width from 2->20 qubits 50 times, for 27, 127, and 433Q devices, and looked at the mean and spread of the results vs Tket. On 27Q, Qiskit on average performs quite well verse Tket, and in the vast majority of cases the min value found after 50 trials was better than the deterministic Tket result. Every transpilation for Qiskit is much faster than Qiskit, so in this case the timings to do 50x the circuits is the same as 1 circuit in Tket However, because of this issue, as the topology grows, the mean number of addtional gates due to swaps increases, and it becomes harder to out-perform Tket. What we see is that, for identical circuits, the mean number of additional gates increases with topology size, and on average Tket begins to pull ahead. Basically, as the number of gates, and mean values increase, we are having a harder nad harder time of finding a better routing because we need to hit the tail end of a distribution that is moving further away. So on average, for this circuit at least, using the Qiskit transpiler is no better than Tket (but is faster). For 433Q, Tket is by far better on average in most cases. So in situations like the Runtime where this repeated transpilation is not carried out, the end user is getting subpar performance relative to other tools, and it becomes more and more imperative that the user transpiles locally. There is of course a cost to transpiling multiple times, and the cost grows with topology size for Qiskit, Tket seems to remain relatively constant |
Can you share the circuits and scripts that you're benchmarking here? Something seems odd about our transpile times being hundreds of seconds, because I've successfully routed circuits of 1,000+q in a fraction of that time. |
Fwiw, I have a pretty good guess at what's going on here: our random sampling of start locations on entry to SabreLayout is much less likely to be something decent when we've got more qubits in the map (more chance of qubits starting miles away from each other), which means we spend most of our forwards/backwards passes of routing just bringing the "initial" qubits closer together, before we even start on the actual meat of polishing that guess. Our heuristic for choosing initial layouts to try at the moment is a uniform random sampling over the "n_total permute n_virtual" mappings, with no regard to initial distance, symmetries of the coupling map, or anything, whereas tket has deterministic routines to choose something decent, but are generally much slower (and in part is why I'm really wondering about your compile times). |
Jake, wouldn't this be mitigated by Sabre's forward-backward approach of updating layouts? |
My strong hunch is that it'll be somewhat mitigated, but not sufficiently - we only do a finite number of forward/backward passes, and there's nothing that stops Sabre routing towards the end from routing gates "away" from the "centre" of the populated qubits, which would need more passes to correct. Much of the back-and-forth is meant to help with settling on the best initial permutation from within a given set of initial physical qubits, but in the cases of a coupling map that's much larger than the number of virtual qubits, we have to use up some of our back-and-forth passes just trying to get a decent combination from the initial set, before we can start worrying about the best permutation within that. There's also no guarantee that Sabre routing will bring all the qubits "close" to each at the start/end - only those that interact nearby, and that potentially stymies routing's choices later down the line, since one swap may only ever affect one gate, whereas if all the qubits are packed into a dense subspace of the coupling graph, there's the opportunity for one swap to help several gates in the lookahead layers. This is all super heuristic and a bit of me just speaking from my intuition about how Sabre achieves its results rather than a formal study, but it sounds pretty plausible to me. |
I've been playing some more with #10169 and I'm not sure a "better" initial layout guess necessarily leads to better results. That PR sets a layout trial to use a partial isomorphic subgraph (it's build traversing the DAG and returns the mapping from the deepest node in the DAG) and then if there are extra qubits needed for a full layout (it's not always the case, for example for a circular entanglement I'm wondering if there is more of a middle ground we can influence the layout selected towards a better result but without constraining the search space too much. For #10169 the difference between that and my earlier attempt in #9174 is that the newer PR only uses one partial layout trial with nearest neighbors for the free qubits and the other trials are all full random. While the earlier attempt used a partial layout for all trials random permutations of the free qubits outside the partial layout. The other thing I want to try in this vain is instead of using a partial VF2 layout as the starting point is to try and use That all being said I'm wondering though if maybe it makes more sense in the meantime to just dial up the heuristic effort dials on |
I ran a quick test to see what the impact of extended set size is on the output here. After doing more testing with different types of input layout seeding (using various forms of
against IBM Seattle with opt level 3 and a fixed seed. Then modified the sabre code to take the extended set size from an environment variable. The results are: The best case result found on IBM Guadalupe in my previous testing of this circuit was 168 2q gates which matches the graph when the extended set size was 72. There was also no measurable impact to the runtime because of this change as even with the larger sizes the runtime of transpile() was always < 5 sec and realistically was bound by the optimization loop time, which become a function of the output size from routing. I'm thinking to solve this the answer is to use a larger extended set size (in the short term while working a more thorough lookahead heuristic) and also increase the trial counts for level 3. |
This commit increases the heuristic effort for sabre layout and routing. This is made through 2 changes, the first is the depth of the internal lookahead heuristic used in sabre swap has been increased from 20 to 72. This is just a stop-gap for a potential reworking of the lookahead heuristic. In local testing for larger backends and deeper circuits in this is showing better output results than the current default of 20 without any runtime impact. The other aspect is that the trial counts for each optimization level > 0 is increased 5x. This has a runtime cost, but it the performance of the rust sabre implementation is fast enough that even running the algorithm with more trials it is not a bottleneck for typical compilation. Related to Qiskit#10160
This commit increases the heuristic effort for sabre layout and routing. This is made through 2 changes, the first is the depth of the internal lookahead heuristic used in sabre swap has been increased from 20 to 72. This is just a stop-gap for a potential reworking of the lookahead heuristic. In local testing for larger backends and deeper circuits in this is showing better output results than the current default of 20 without any runtime impact. The other aspect is that the trial counts for each optimization level > 0 is increased 5x. This has a runtime cost, but it the performance of the rust sabre implementation is fast enough that even running the algorithm with more trials it is not a bottleneck for typical compilation. Related to Qiskit#10160
If we are specifically interested in transpiling all-ones BV circuits, then we can do a much job by implementing the CX-logic of these circuits adhering to linear-nearest-neighbor connectivity. More precisely, the two circuits below are equivalent (while the second circuit has a perfect layout when the connectivity map of the device contains a line of the right size):
|
BV is just an example that requires lots of swaps, if done in the canonical manner. It can be rewritten using resets as a 2Q problem. The issue here is that the swap mappers seem to be having a harder time on circuit requiring swaps as the search space grows |
Environment
I have empirically noticed for some time that compiling circuits against smaller device topologies, e.g. a 16Q fake IBM backend vs. larger ones, yields improved output circuits, as defined by lower swap counts / added 2Q gates. Intuitively this does not make much sense as the smaller systems can be viewed as sub-graphs of the larger ones, and therefore the same optimal solutions found on the smaller graphs should be found within the larger topologies.
This is quite problematic because we are in a situation where we have large numbers of qubits, but the number of effectively usable qubits is less than the full device; exactly where this issue pops up.
What is happening?
To see this, I have plotted the number of added CX gates when compiling an all-ones BV circuits, an Efficient SU2 ansatz with circular entanglement, and QFT circuits as a function of size against various backends. Here 250 transpilations were performed using
optimization_level=3
. Bars show one standard deviation of the resulting distribution of added 2Q gates. It is clear to see that smaller device topology yields better circuits, or at least does no worse.How can we reproduce the issue?
Repeat the above, or use other circuits.
What should happen?
An optimal circuit found on a smaller machine should be discoverable on a larger device since they are sub-graphs.
Any suggestions?
That better solutions are directly linked to topology size, my guess is this is an issue with search space size, where somehow the transpiler is not finding better results in the larger space. @ashsaki has looked into this a bit with respect to the initial layout.
The text was updated successfully, but these errors were encountered: