Sabre swap mapper dominated by data copy #5197

nonhermitian · 2020-10-08T16:26:14Z

Information

Qiskit Terra version: master
Python version:
Operating system:

What is the current behavior?

Running Sabre on a QV circuit over 20 qubits shows that data copying is the current bottleneck:

Abbreviated profile:

         9631133 function calls (8921864 primitive calls) in 4.184 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
497066/663    0.492    0.000    1.516    0.002 copy.py:132(deepcopy)
44375/860    0.312    0.000    1.499    0.002 copy.py:220(_deepcopy_tuple)
   389939    0.232    0.000    0.326    0.000 layout.py:97(__getitem__)
   191092    0.179    0.000    0.511    0.000 sabre_swap.py:326(<genexpr>)
55118/10740    0.160    0.000    1.537    0.000 copy.py:269(_reconstruct)
  1312882    0.112    0.000    0.112    0.000 bit.py:73(__hash__)
  1032203    0.107    0.000    0.108    0.000 {method 'get' of 'dict' objects}
44177/660    0.090    0.000    1.497    0.002 copy.py:237(_deepcopy_dict)

Steps to reproduce the problem

What is the expected behavior?

Suggested solutions

The text was updated successfully, but these errors were encountered:

mtreinish · 2020-10-08T16:40:16Z

It actually depends on how large you scale things. For ~100 qubits or more it's typically calls to bfs_successors that takes the most time, at 1k qubits (because I had the profile flame graph handy ) it gets bogged down in the coupling map (which is why I opened #5183 ) distance(). There's a lot in the sabre passes that needs to be tuned it's not really performance optimized and scales pretty poorly.

nonhermitian · 2020-10-08T16:54:57Z

Humm BFS should be linear runtime if done right. Also the distance calculation should probably be done only for those qubits which actually need evaluating at some point; most distances are not usually needed and a dense million+ element array is getting a bit hefty.

mtreinish · 2020-10-08T17:26:31Z

IIRC, the bfs_successors one was because of a retworkx limitation right now it computes the entire list of nodes and then returns that as a python list (there is an issue open on this Qiskit/rustworkx#71 but its blocked on the python binding lib) which for modest sized graphs is still an order of magnitude faster, but as the graph grows sufficiently large it can become a bottleneck. Looking at the sabre code:

https://github.com/Qiskit/qiskit-terra/blob/master/qiskit/transpiler/passes/routing/sabre_swap.py#L264-L285

it is designed with an iterator like networkx returns. But, with retworkx it's spending an excessive amount of time waiting for the full list to be generate over many calls to bfs_successors for each node. Which as the circuit grows becomes a bottleneck.

As for the distance() it's not actually the matrix generation where it was getting bogged down. Looking at the profile flame graph I have locally (its an interactive svg which github doesn't understand otherwise I'd upload it here) its spending all its time in https://github.com/Qiskit/qiskit-terra/blob/master/qiskit/transpiler/coupling.py#L173 and https://github.com/Qiskit/qiskit-terra/blob/master/qiskit/transpiler/coupling.py#L175 which makes sense because physical_qubits is a list not a set, so lookups like that traverse the list which for >=1k qubits gets expensive especially when it's called a lot. That should be easy to fix by just changing it to a set so it's a fixed time lookup, or even better would be to assume a contiguous set of indicies and just check that it's < length of the coupling map. But, I agree after a certain size we'll have to switch to doing the distance calculation on the graph directly on demand instead of all at once in a big matrix. (although I am quite proud that the parallel implementation of the distance matrix I wrote in Qiskit/rustworkx#166 can generate the 100k x 100k array for a 1000 x 100 grid coupling map in ~50 sec on my desktop)

nonhermitian · 2020-10-08T18:30:29Z

even better would be to assume a contiguous set of indicies and just check that it's < length

Yes, this would be good. Also another reason to avoid going to the faulty qubits model in Qiskit. This should be possible already I think.

although I am quite proud that the parallel implementation of the distance matrix

The ability to give the parallel cutoff heuristic that you have there is also quite cool.

The bfs code also looks to be the usual one that scales well, so that not the issue.

Previously, the CouplingMap.distance() method was checking whether the qubit number was in the coupling graph by searching over a list for the coupling map indices. This search becomes a bottleneck at large qubit numbers because in the worst case it has to traverse the entire list to find a match. This also is unecessary because the coupling map indices are always an ordered range from 0 to n for n qubits. This commit fixes this bottleneck by updating the check to just error if the index is out of bounds for the size of the graph. Related to Qiskit#5197

mtreinish · 2020-10-28T16:29:09Z

For bfs_successors I just pushed up Qiskit/rustworkx#185 which improves things for that potential bottleneck. When I was looking at the profile of a QV 150x10 model circuit with sabre I realized that most of the slowdown for bfs_success was actually the rust -> python type conversion because the conversion of the list((object, list(object)) is expensive. So I implemented a custom return class that doesn't try to do that conversion in a single shot. The profile from before that commit was:

and after that commit it becomes:

it basically eliminates bfs_successors from being picked up by the sampling profile. The run time also decreases from 8min 41 sec with just #5294 (this was just what my local branch was running when I was testing, I forgot to checkout master) to 7min 11 sec with Qiskit/rustworkx#185 too.

This commit changes the deepcopy usage in the sabreswap pass to a shallow copy. In sabre_swap the deepcopy was run on the DAGCircuit node when remapping it, however this was not necessary because it's not actually used where shared references matter. The output from the remapper is not used directly and instead a copy of the DAGNode object is just used as a container for the required args in apply_operation_back() where a new DAGNode is always created from the op, qargs, etc. The remapping just replaces the qargs parameter with the remapped one. So this commit changes the deepcopy to a shallow copy which won't have the performance overhead. Fixes Qiskit#5197

mtreinish · 2020-10-29T16:55:45Z

I pushed up #5316 which is tagged as fixing this issue. I did that because it fixes the copy bottleneck for small numbers of qubits outlined in this issue. But even with that there are still significant bottlenecks with sabre_swap especially as things scale up. I've started to address those in #5294, #5183, Qiskit/rustworkx#185 I'll need to evaluate where things are after all of those are applied and see where things are. But I'll open up a separate issue for that

This commit changes the deepcopy usage in the sabreswap pass to a shallow copy. In sabre_swap the deepcopy was run on the DAGCircuit node when remapping it, however this was not necessary because it's not actually used where shared references matter. The output from the remapper is not used directly and instead a copy of the DAGNode object is just used as a container for the required args in apply_operation_back() where a new DAGNode is always created from the op, qargs, etc. The remapping just replaces the qargs parameter with the remapped one. So this commit changes the deepcopy to a shallow copy which won't have the performance overhead. Fixes #5197 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

This commit changes the deepcopy usage in the sabreswap pass to a shallow copy. In sabre_swap the deepcopy was run on the DAGCircuit node when remapping it, however this was not necessary because it's not actually used where shared references matter. The output from the remapper is not used directly and instead a copy of the DAGNode object is just used as a container for the required args in apply_operation_back() where a new DAGNode is always created from the op, qargs, etc. The remapping just replaces the qargs parameter with the remapped one. So this commit changes the deepcopy to a shallow copy which won't have the performance overhead. Fixes #5197 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit a506ba9)

This commit changes the deepcopy usage in the sabreswap pass to a shallow copy. In sabre_swap the deepcopy was run on the DAGCircuit node when remapping it, however this was not necessary because it's not actually used where shared references matter. The output from the remapper is not used directly and instead a copy of the DAGNode object is just used as a container for the required args in apply_operation_back() where a new DAGNode is always created from the op, qargs, etc. The remapping just replaces the qargs parameter with the remapped one. So this commit changes the deepcopy to a shallow copy which won't have the performance overhead. Fixes #5197 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> (cherry picked from commit a506ba9) Co-authored-by: Matthew Treinish <[email protected]> Co-authored-by: Luciano Bello <[email protected]>

* Remove list searches in CouplingMap.distance() Previously, the CouplingMap.distance() method was checking whether the qubit number was in the coupling graph by searching over a list for the coupling map indices. This search becomes a bottleneck at large qubit numbers because in the worst case it has to traverse the entire list to find a match. This also is unecessary because the coupling map indices are always an ordered range from 0 to n for n qubits. This commit fixes this bottleneck by updating the check to just error if the index is out of bounds for the size of the graph. Related to #5197 * Cache size instead of calling each time * Fix lint Co-authored-by: Luciano Bello <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Kevin Krsulich <[email protected]>

nonhermitian added the type: enhancement It's working, but needs polishing label Oct 8, 2020

mtreinish mentioned this issue Oct 27, 2020

Remove list searches in CouplingMap.distance() #5294

Merged

mtreinish mentioned this issue Oct 29, 2020

Avoid deepcopy in sabre_swap #5316

Merged

mergify bot closed this as completed in #5316 Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sabre swap mapper dominated by data copy #5197

Sabre swap mapper dominated by data copy #5197

nonhermitian commented Oct 8, 2020

mtreinish commented Oct 8, 2020

nonhermitian commented Oct 8, 2020

mtreinish commented Oct 8, 2020

nonhermitian commented Oct 8, 2020

mtreinish commented Oct 28, 2020

mtreinish commented Oct 29, 2020

Sabre swap mapper dominated by data copy #5197

Sabre swap mapper dominated by data copy #5197

Comments

nonhermitian commented Oct 8, 2020

Information

What is the current behavior?

Steps to reproduce the problem

What is the expected behavior?

Suggested solutions

mtreinish commented Oct 8, 2020

nonhermitian commented Oct 8, 2020

mtreinish commented Oct 8, 2020

nonhermitian commented Oct 8, 2020

mtreinish commented Oct 28, 2020

mtreinish commented Oct 29, 2020