Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sabre swap mapper dominated by data copy #5197

Closed
nonhermitian opened this issue Oct 8, 2020 · 6 comments · Fixed by #5316
Closed

Sabre swap mapper dominated by data copy #5197

nonhermitian opened this issue Oct 8, 2020 · 6 comments · Fixed by #5316
Labels
type: enhancement It's working, but needs polishing

Comments

@nonhermitian
Copy link
Contributor

Information

  • Qiskit Terra version: master
  • Python version:
  • Operating system:

What is the current behavior?

Running Sabre on a QV circuit over 20 qubits shows that data copying is the current bottleneck:

Abbreviated profile:

         9631133 function calls (8921864 primitive calls) in 4.184 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
497066/663    0.492    0.000    1.516    0.002 copy.py:132(deepcopy)
44375/860    0.312    0.000    1.499    0.002 copy.py:220(_deepcopy_tuple)
   389939    0.232    0.000    0.326    0.000 layout.py:97(__getitem__)
   191092    0.179    0.000    0.511    0.000 sabre_swap.py:326(<genexpr>)
55118/10740    0.160    0.000    1.537    0.000 copy.py:269(_reconstruct)
  1312882    0.112    0.000    0.112    0.000 bit.py:73(__hash__)
  1032203    0.107    0.000    0.108    0.000 {method 'get' of 'dict' objects}
44177/660    0.090    0.000    1.497    0.002 copy.py:237(_deepcopy_dict)

Steps to reproduce the problem

What is the expected behavior?

Suggested solutions

@nonhermitian nonhermitian added the type: enhancement It's working, but needs polishing label Oct 8, 2020
@mtreinish
Copy link
Member

It actually depends on how large you scale things. For ~100 qubits or more it's typically calls to bfs_successors that takes the most time, at 1k qubits (because I had the profile flame graph handy ) it gets bogged down in the coupling map (which is why I opened #5183 ) distance(). There's a lot in the sabre passes that needs to be tuned it's not really performance optimized and scales pretty poorly.

@nonhermitian
Copy link
Contributor Author

Humm BFS should be linear runtime if done right. Also the distance calculation should probably be done only for those qubits which actually need evaluating at some point; most distances are not usually needed and a dense million+ element array is getting a bit hefty.

@mtreinish
Copy link
Member

IIRC, the bfs_successors one was because of a retworkx limitation right now it computes the entire list of nodes and then returns that as a python list (there is an issue open on this Qiskit/rustworkx#71 but its blocked on the python binding lib) which for modest sized graphs is still an order of magnitude faster, but as the graph grows sufficiently large it can become a bottleneck. Looking at the sabre code:

https://github.com/Qiskit/qiskit-terra/blob/master/qiskit/transpiler/passes/routing/sabre_swap.py#L264-L285

it is designed with an iterator like networkx returns. But, with retworkx it's spending an excessive amount of time waiting for the full list to be generate over many calls to bfs_successors for each node. Which as the circuit grows becomes a bottleneck.

As for the distance() it's not actually the matrix generation where it was getting bogged down. Looking at the profile flame graph I have locally (its an interactive svg which github doesn't understand otherwise I'd upload it here) its spending all its time in https://github.com/Qiskit/qiskit-terra/blob/master/qiskit/transpiler/coupling.py#L173 and https://github.com/Qiskit/qiskit-terra/blob/master/qiskit/transpiler/coupling.py#L175 which makes sense because physical_qubits is a list not a set, so lookups like that traverse the list which for >=1k qubits gets expensive especially when it's called a lot. That should be easy to fix by just changing it to a set so it's a fixed time lookup, or even better would be to assume a contiguous set of indicies and just check that it's < length of the coupling map. But, I agree after a certain size we'll have to switch to doing the distance calculation on the graph directly on demand instead of all at once in a big matrix. (although I am quite proud that the parallel implementation of the distance matrix I wrote in Qiskit/rustworkx#166 can generate the 100k x 100k array for a 1000 x 100 grid coupling map in ~50 sec on my desktop)

@nonhermitian
Copy link
Contributor Author

even better would be to assume a contiguous set of indicies and just check that it's < length

Yes, this would be good. Also another reason to avoid going to the faulty qubits model in Qiskit. This should be possible already I think.

although I am quite proud that the parallel implementation of the distance matrix

The ability to give the parallel cutoff heuristic that you have there is also quite cool.

The bfs code also looks to be the usual one that scales well, so that not the issue.

mtreinish added a commit to mtreinish/qiskit-core that referenced this issue Oct 27, 2020
Previously, the CouplingMap.distance() method was checking whether the
qubit number was in the coupling graph by searching over a list for the
coupling map indices. This search becomes a bottleneck at large qubit
numbers because in the worst case it has to traverse the entire list to
find a match. This also is unecessary because the coupling map indices
are always an ordered range from 0 to n for n qubits. This commit fixes
this bottleneck by updating the check to just error if the index is out
of bounds for the size of the graph.

Related to Qiskit#5197
@mtreinish
Copy link
Member

For bfs_successors I just pushed up Qiskit/rustworkx#185 which improves things for that potential bottleneck. When I was looking at the profile of a QV 150x10 model circuit with sabre I realized that most of the slowdown for bfs_success was actually the rust -> python type conversion because the conversion of the list((object, list(object)) is expensive. So I implemented a custom return class that doesn't try to do that conversion in a single shot. The profile from before that commit was:

Screenshot_2020-10-28_12-13-49

and after that commit it becomes:

Screenshot_2020-10-28_12-18-38

it basically eliminates bfs_successors from being picked up by the sampling profile. The run time also decreases from 8min 41 sec with just #5294 (this was just what my local branch was running when I was testing, I forgot to checkout master) to 7min 11 sec with Qiskit/rustworkx#185 too.

mtreinish added a commit to mtreinish/qiskit-core that referenced this issue Oct 29, 2020
This commit changes the deepcopy usage in the sabreswap pass to a
shallow copy. In sabre_swap the deepcopy was run on the DAGCircuit node
when remapping it, however this was not necessary because it's not
actually used where shared references matter. The output from the
remapper is not used directly and instead a copy of the DAGNode object
is just used as a container for the required args in
apply_operation_back() where a new DAGNode is always created from the
op, qargs, etc. The remapping just replaces the qargs parameter with the
remapped one. So this commit changes the deepcopy to a shallow copy
which won't have the performance overhead.

Fixes Qiskit#5197
@mtreinish
Copy link
Member

I pushed up #5316 which is tagged as fixing this issue. I did that because it fixes the copy bottleneck for small numbers of qubits outlined in this issue. But even with that there are still significant bottlenecks with sabre_swap especially as things scale up. I've started to address those in #5294, #5183, Qiskit/rustworkx#185 I'll need to evaluate where things are after all of those are applied and see where things are. But I'll open up a separate issue for that

@mergify mergify bot closed this as completed in #5316 Nov 2, 2020
mergify bot added a commit that referenced this issue Nov 2, 2020
This commit changes the deepcopy usage in the sabreswap pass to a
shallow copy. In sabre_swap the deepcopy was run on the DAGCircuit node
when remapping it, however this was not necessary because it's not
actually used where shared references matter. The output from the
remapper is not used directly and instead a copy of the DAGNode object
is just used as a container for the required args in
apply_operation_back() where a new DAGNode is always created from the
op, qargs, etc. The remapping just replaces the qargs parameter with the
remapped one. So this commit changes the deepcopy to a shallow copy
which won't have the performance overhead.

Fixes #5197

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
mergify bot pushed a commit that referenced this issue Nov 2, 2020
This commit changes the deepcopy usage in the sabreswap pass to a
shallow copy. In sabre_swap the deepcopy was run on the DAGCircuit node
when remapping it, however this was not necessary because it's not
actually used where shared references matter. The output from the
remapper is not used directly and instead a copy of the DAGNode object
is just used as a container for the required args in
apply_operation_back() where a new DAGNode is always created from the
op, qargs, etc. The remapping just replaces the qargs parameter with the
remapped one. So this commit changes the deepcopy to a shallow copy
which won't have the performance overhead.

Fixes #5197

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
(cherry picked from commit a506ba9)
mergify bot added a commit that referenced this issue Nov 3, 2020
This commit changes the deepcopy usage in the sabreswap pass to a
shallow copy. In sabre_swap the deepcopy was run on the DAGCircuit node
when remapping it, however this was not necessary because it's not
actually used where shared references matter. The output from the
remapper is not used directly and instead a copy of the DAGNode object
is just used as a container for the required args in
apply_operation_back() where a new DAGNode is always created from the
op, qargs, etc. The remapping just replaces the qargs parameter with the
remapped one. So this commit changes the deepcopy to a shallow copy
which won't have the performance overhead.

Fixes #5197

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
(cherry picked from commit a506ba9)

Co-authored-by: Matthew Treinish <[email protected]>
Co-authored-by: Luciano Bello <[email protected]>
mergify bot added a commit that referenced this issue Dec 4, 2020
* Remove list searches in CouplingMap.distance()

Previously, the CouplingMap.distance() method was checking whether the
qubit number was in the coupling graph by searching over a list for the
coupling map indices. This search becomes a bottleneck at large qubit
numbers because in the worst case it has to traverse the entire list to
find a match. This also is unecessary because the coupling map indices
are always an ordered range from 0 to n for n qubits. This commit fixes
this bottleneck by updating the check to just error if the index is out
of bounds for the size of the graph.

Related to #5197

* Cache size instead of calling each time

* Fix lint

Co-authored-by: Luciano Bello <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Kevin Krsulich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement It's working, but needs polishing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants