Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further oxidize sabre #8388

Merged
merged 29 commits into from
Aug 22, 2022
Merged

Further oxidize sabre #8388

merged 29 commits into from
Aug 22, 2022

Conversation

mtreinish
Copy link
Member

@mtreinish mtreinish commented Jul 21, 2022

Summary

In #7977 we started the process of oxidizing SabreSwap by replacing the
inner-most scoring heuristic loop with a rust routine. This greatly
improved the overall performance and scaling of the transpiler pass.
Continuing from where that started this commit migrates more of the pass
into the Rust domain so that almost all the pass's operations are done
inside a rust module and all that is returned is a list of swaps to run
prior to each 2q gate. This should further improve the runtime
performance of the pass and scaling because the only steps performed in
Python are generating the input data structures and then replaying the
circuit with SWAPs inserted at the appropriate points.

While we could have stuck with #7977 as the performance of the pass was
more than sufficient after it. What this commit really enables by moving
most of the pass to the rust domain is to expand with improvements and
expansion of the sabre algorithm which will require multithreaded to be
efficiently implemented. So while this will have some modest performance
improvements this is more about setting the stage for introducing
variants of SabreSwap that do more thorough analysis in the future
(which were previously precluded by the parallelism limitations of python).

Details and comments

TODO:

  • Fix test failures
  • Benchmark and tune performance

In Qiskit#7977 we started the process of oxidizing SabreSwap by replacing the
inner-most scoring heuristic loop with a rust routine. This greatly
improved the overall performance and scaling of the transpiler pass.
Continuing from where that started this commit migrates more of the pass
into the Rust domain so that almost all the pass's operations are done
inside a rust module and all that is returned is a list of swaps to run
prior to each 2q gate. This should further improve the runtime
performance of the pass and scaling because the only steps performed in
Python are generating the input data structures and then replaying the
circuit with SWAPs inserted at the appropriate points.

While we could have stuck with Qiskit#7977 as the performance of the pass was
more than sufficient after it. What this commit really enables by moving
most of the pass to the rust domain is to expand with improvments and
expansion of the sabre algorithm which will require multithreaded to be
efficiently implemented. So while this will have some modest performance
improvements this is more about setting the stage for introducing
variants of SabreSwap that do more thorough analysis in the future
(which were previously preculded by the parallelism limitations of python).
@mtreinish mtreinish added on hold Can not fix yet performance Rust This PR or issue is related to Rust code in the repository labels Jul 21, 2022
@mtreinish mtreinish requested a review from a team as a code owner July 21, 2022 16:29
@qiskit-bot
Copy link
Collaborator

Thank you for opening a new pull request.

Before your PR can be merged it will first need to pass continuous integration tests and be reviewed. Sometimes the review process can be slow, so please be patient.

While you're waiting, please feel free to review other open PRs. While only a subset of people are authorized to approve pull requests for merging, everyone is encouraged to review open pull requests. Doing reviews helps reduce the burden on the core team and helps make the project's code better for everyone.

One or more of the the following people are requested to review this:

This commit fixes a small typo/logic error in the algorithm
implementation that was preventing sabre from making forward progress
because it wasn't correctly identifying successors for the next layer.
By fixing this all the hard errors in the SabreSwap tests are fixed. The
only failures left seem to be related to a different layout which
hopefully is not a correctness issue but just caused by different
ordering.
In some tests there were subtle differences in the relative positioning
of the 1q gates relative to inserted swaps (i.e. a 1q gate which was
before the swap previously could move to after it). This was caused by
different topological ordering being used between the hybrid python sabre
implementation and the mostly rust sabre implementations. To ensure a
consistent ordering fater moving mostly to rust this changes the
swap insertion loop to iterate over the circuit layers which mirrors how
the old sabre implementation worked.
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Aug 16, 2022
This commit updates the preset pass manager construction to use the
SabreLayout and SabreSwap passes by default for optimization level 1 and
level 2. With the recently merged Qiskit#7977 the performance of the sabre
swap pass has improved significantly enough to be considered for use by
default with optimization levels 1 and 2. While for small numbers of
target device qubits (< 30) the SabreLayout/SabreSwap pass doesn't quite
match the runtime performance of DenseLayout/StochasticSwap it typically
has better runtime performance for larger target devices. Additionally,
the runtime performance of Sabre should also improve further after Qiskit#8388
is finished. However, the output quality from the sabre passes is
typically better resulting in fewer swap gates being inserted. With the
combination of better quality and comparable runtime performance it
makes sense to use sabre as the default for optimization levels 1 and 2.
For optimization level 0 stochastic swap is still used there because we
want to continue to leverage TrivialLayout for that level and to get
the full quality advantages SabreSwap and SabreLayout should be used
together.
This commit fixes an issue where in some cases the topological order the
DAGCircuit is traversed is different from the topological order that
sabre uses internally. The build_swap_map sabre swap function is only
valid if the 2q gates are replayed in the same exact order when
rebuilding the DAGCircuit. If a 2q gate gets replayed in a different
order the layout mapping will cause the circuit to diverge and
potentially be invalid. This commit updates the replay logic in the
python side to detect when the topological order over the dagcircuit
differs from the sabre traversal order and attempts to correct it.
@mtreinish
Copy link
Member Author

mtreinish commented Aug 16, 2022

This is still failing 5 tests locally, I'm thinking I might want to revert 2da3050 and just insert the non-2q gates into the sabre dag and handle an arbitrary number of qargs and just return the full traversal order instead of trying to recreate it with just the 2q component of the circuit. But I need to do some more digging into the last 5 failures to see what is going on there.

Edit: This was done in 45dc04c

Previously we attempted to just have the rust component of sabre deal
solely with the 2q component of the input circuit. However, while this
works for ~80% of the cases it fails to account ordering and
interactions between non-2q gates or instructions with classical bits.
To address this the sabre dag structure is modified to contain all
isntructions in the input circuit and structurally match the
DAGCircuit's edges. This fixes most of the issues related to gate
ordering the previous implementation was encountering. It also
simplifies the swap insertion/replay of the circuit in the python side
as we now get an exact application order from the rust code.
@mtreinish
Copy link
Member Author

mtreinish commented Aug 17, 2022

I ran some preliminary asv comparison benchmarks with 45dc04c against main. The results below are showing quite a promising speed up especially for smaller number of qubits. Where there is a regression I think most of the difference can be attributed to differing (but still valid) results (which is also the cause of last 3 test failures) and sabre doing more work by going down a worse path. I'd like to get to the bottom of where the divergence is happening to try and make this PR not effect the output vs main to make A/B comparisons easier.

Benchmarks that have improved:

       before           after         ratio
     [74c27f83]       [45dc04cc]
     <main>       <rusty-sabre>
-              61               55     0.90  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'sabre')
-         557±2ms          495±4ms     0.89  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'stochastic', 'sabre')
-         519±3ms        461±0.9ms     0.89  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'stochastic', 'sabre')
-         490±2ms          434±4ms     0.89  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'sabre')
-       354±0.3ms          313±1ms     0.88  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'sabre', 'sabre')
-             582              513     0.88  queko.QUEKOTranspilerBench.track_depth_bntf_optimal_depth_25(3, 'sabre')
-         598±2ms          527±3ms     0.88  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'sabre')
-         361±1ms          315±1ms     0.87  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'sabre', 'sabre')
-         436±3ms          377±5ms     0.87  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'sabre', 'sabre')
-             343              296     0.86  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(0, 'sabre')
-              54               46     0.85  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'stochastic', 'sabre')
-         673±7ms        573±0.8ms     0.85  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'stochastic', 'sabre')
-         448±1ms          373±2ms     0.83  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'sabre', 'sabre')
-             807              663     0.82  queko.QUEKOTranspilerBench.track_depth_bntf_optimal_depth_25(1, 'sabre')
-         712±5ms          579±5ms     0.81  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'stochastic', 'sabre')
-             124               88     0.71  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(1, 'sabre')
-              54               38     0.70  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'stochastic', 'sabre')
-             731              514     0.70  queko.QUEKOTranspilerBench.track_depth_bntf_optimal_depth_25(2, 'sabre')
-         1.40±0s          970±3ms     0.70  queko.QUEKOTranspilerBench.time_transpile_bss(2, 'sabre')
-      49.5±0.5ms         33.0±4ms     0.67  queko.QUEKOTranspilerBench.time_transpile_bigd(1, 'sabre')
-             120               78     0.65  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(2, 'sabre')
-         1.18±0s          749±4ms     0.64  queko.QUEKOTranspilerBench.time_transpile_bss(1, 'sabre')
-             142               87     0.61  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(3, 'sabre')
-         639±3ms          378±4ms     0.59  queko.QUEKOTranspilerBench.time_transpile_bss(0, 'sabre')
-      1.16±0.02s          490±4ms     0.42  mapping_passes.PassBenchmarks.time_sabre_layout(5, 1024)
-         164±4ms       53.8±0.5ms     0.33  mapping_passes.PassBenchmarks.time_sabre_swap(5, 1024)
-      4.52±0.01s       1.45±0.01s     0.32  mapping_passes.PassBenchmarks.time_sabre_layout(14, 1024)
-      7.76±0.02s       2.29±0.01s     0.29  mapping_passes.PassBenchmarks.time_sabre_layout(20, 1024)
-      1.30±0.01s         383±20ms     0.29  mapping_passes.PassBenchmarks.time_sabre_swap(20, 1024)
-         727±1ms         14.1±7ms     0.02  mapping_passes.PassBenchmarks.time_sabre_swap(14, 1024)

Benchmarks that have stayed the same:

       before           after         ratio
     [74c27f83]       [45dc04cc]
     <main>       <rusty-sabre>
          648±2ms          710±7ms     1.10  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'noise_adaptive')
              524              574     1.10  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(2, 'sabre')
              275              297     1.08  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'sabre', 'noise_adaptive')
              278              299     1.08  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'sabre', 'dense')
               55               59     1.07  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'dense')
               55               59     1.07  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'dense')
               69               73     1.06  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'sabre')
              210              222     1.06  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'stochastic', 'sabre')
              226              238     1.05  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'stochastic', 'sabre')
              478              502     1.05  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'sabre', 'sabre')
          839±4ms          880±3ms     1.05  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'stochastic', 'sabre')
              217              227     1.05  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'sabre', 'noise_adaptive')
              218              228     1.05  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'sabre', 'dense')
              223              233     1.04  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'sabre', 'noise_adaptive')
              224              234     1.04  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'sabre', 'dense')
              506              526     1.04  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'sabre', 'sabre')
              519              538     1.04  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'sabre', 'dense')
          1.57±0s       1.62±0.01s     1.03  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'noise_adaptive')
              281              289     1.03  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'sabre', 'sabre')
               73               75     1.03  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'noise_adaptive')
              220              226     1.03  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'sabre', 'sabre')
              524              537     1.02  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'sabre', 'noise_adaptive')
              531              544     1.02  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'sabre', 'dense')
              214              218     1.02  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'sabre', 'sabre')
              542              551     1.02  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'sabre', 'noise_adaptive')
          693±7ms          700±7ms     1.01  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'dense')
              640              644     1.01  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'sabre', 'sabre')
              683              687     1.01  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'sabre', 'noise_adaptive')
          710±2ms          714±5ms     1.01  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'sabre')
              639              642     1.00  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(3, 'sabre')
       1.20±0.01s        1.20±0.2s     1.00  queko.QUEKOTranspilerBench.time_transpile_bntf(2, 'sabre')
          336±2ms          337±3ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'dense')
          291±3ms        291±0.9ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'sabre', 'noise_adaptive')
                7                7     1.00  transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(3)
                7                7     1.00  transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(3)
              262              262     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'sabre', 'dense')
              273              273     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'sabre', 'noise_adaptive')
              210              210     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'stochastic', 'sabre')
               68               68     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'dense')
               60               60     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'noise_adaptive')
               56               56     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'stochastic', 'sabre')
              610              610     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'sabre', 'dense')
              622              622     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'sabre', 'noise_adaptive')
              626              626     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'sabre', 'sabre')
              563              563     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'stochastic', 'sabre')
              521              521     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'stochastic', 'sabre')
              502              502     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'stochastic', 'sabre')
              754              754     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'stochastic', 'sabre')
          398±2ms          398±1ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'dense')
          273±2ms          273±4ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'sabre', 'noise_adaptive')
        335±0.5ms          335±1ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'noise_adaptive')
        290±0.7ms        290±0.4ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'sabre', 'dense')
          303±2ms          301±4ms     0.99  transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(3)
          275±2ms          273±3ms     0.99  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'sabre', 'dense')
          288±2ms          284±4ms     0.99  transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(3)
              309              305     0.99  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'stochastic', 'sabre')
          337±2ms          332±2ms     0.99  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'sabre', 'dense')
       1.38±0.01s       1.36±0.01s     0.98  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'sabre')
             1630             1600     0.98  transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(3)
          335±3ms          328±7ms     0.98  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'sabre', 'noise_adaptive')
          412±3ms        403±0.9ms     0.98  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'noise_adaptive')
       1.54±0.01s       1.50±0.01s     0.98  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'dense')
              672              655     0.97  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'sabre', 'dense')
               75               73     0.97  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'stochastic', 'sabre')
          429±2ms          417±4ms     0.97  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'noise_adaptive')
       2.12±0.02s       2.05±0.01s     0.97  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'stochastic', 'sabre')
          343±1ms        332±0.3ms     0.97  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'sabre', 'dense')
          297±1ms          287±1ms     0.97  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'sabre', 'noise_adaptive')
          433±1ms          419±3ms     0.97  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'dense')
               58               56     0.97  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'noise_adaptive')
          347±3ms          335±3ms     0.96  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'sabre', 'noise_adaptive')
        301±0.4ms          290±1ms     0.96  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'sabre', 'dense')
          333±4ms          319±1ms     0.96  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'stochastic', 'sabre')
          308±2ms          295±2ms     0.96  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'sabre', 'dense')
          424±2ms          403±2ms     0.95  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'stochastic', 'sabre')
       1.75±0.01s       1.66±0.01s     0.95  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'sabre')
          312±1ms        296±0.4ms     0.95  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'sabre', 'noise_adaptive')
               56               53     0.95  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'sabre')
       27.1±0.1ms         25.6±7ms     0.94  queko.QUEKOTranspilerBench.time_transpile_bigd(0, 'sabre')
              583              550     0.94  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(1, 'sabre')
          298±2ms          281±2ms     0.94  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'sabre', 'sabre')
       9.25±0.04s       8.68±0.02s     0.94  transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(3)
          376±2ms          352±2ms     0.93  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'sabre')
          469±3ms          437±4ms     0.93  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'stochastic', 'sabre')
               58               54     0.93  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'noise_adaptive')
          375±1ms          349±1ms     0.93  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'stochastic', 'sabre')
        333±0.6ms          309±2ms     0.93  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'sabre', 'sabre')
              294              271     0.92  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'sabre', 'sabre')
          437±2ms          402±2ms     0.92  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'stochastic', 'sabre')
       4.27±0.02s       3.93±0.09s     0.92  queko.QUEKOTranspilerBench.time_transpile_bntf(3, 'sabre')
               56               51     0.91  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'sabre')
       6.00±0.01s       5.22±0.01s    ~0.87  queko.QUEKOTranspilerBench.time_transpile_bss(3, 'sabre')
       65.3±0.2ms        48.9±10ms    ~0.75  queko.QUEKOTranspilerBench.time_transpile_bigd(2, 'sabre')
          333±4ms         248±60ms    ~0.74  queko.QUEKOTranspilerBench.time_transpile_bntf(0, 'sabre')
          273±2ms         200±40ms    ~0.73  queko.QUEKOTranspilerBench.time_transpile_bigd(3, 'sabre')

Benchmarks that have got worse:

       before           after         ratio
     [74c27f83]       [45dc04cc]
     <main>       <rusty-sabre>
+        811±10ms        1.07±0.1s     1.32  queko.QUEKOTranspilerBench.time_transpile_bntf(1, 'sabre')
+             385              499     1.30  transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(3)
+             776              931     1.20  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(0, 'sabre')
+         1.27±0s          1.50±0s     1.18  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'dense')
+              64               75     1.17  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'dense')
+      1.79±0.01s       2.09±0.01s     1.17  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'stochastic', 'sabre')
+             927             1060     1.14  queko.QUEKOTranspilerBench.track_depth_bntf_optimal_depth_25(0, 'sabre')
+         1.14±0s       1.30±0.01s     1.14  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'noise_adaptive')
+         1.86±0s       2.07±0.01s     1.11  transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(3)

@mtreinish mtreinish changed the title [WIP] Further oxidize sabre Further oxidize sabre Aug 17, 2022
The edge weights in the SabreDAG struct were set to the qubit indices
from the input DAGCircuit because the edges represent the flow of data
on the qubit. However, we never actually inspect the edge weights and
all having them present does is use extra memory. This commit changes
SabreDAG to just not set any weight for edges as all we need is the
source and target nodes for the algorithm to work.
kevinhartman
kevinhartman previously approved these changes Aug 19, 2022
Copy link
Contributor

@kevinhartman kevinhartman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I may have some follow-on comments on code simplicity if I get a chance to look further, but I don't see any glaring issues, and tests are passing so I'm confident this is a good start / something we can iterate on.

Copy link
Member

@jakelishman jakelishman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to look more closely at this at some point when I had time, but since Kevin's also looking, here's the couple of things I spotted before I realised it was too big for me to look at immediately.

qiskit/transpiler/passes/routing/sabre_swap.py Outdated Show resolved Hide resolved
Comment on lines 432 to 440
#[pymodule]
pub fn sabre_swap(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_wrapped(wrap_pyfunction!(sabre_score_heuristic))?;
m.add_wrapped(wrap_pyfunction!(build_swap_map))?;
m.add_class::<Heuristic>()?;
m.add_class::<EdgeList>()?;
m.add_class::<QubitsDecay>()?;
m.add_class::<NeighborTable>()?;
m.add_class::<SabreRng>()?;
m.add_class::<SabreDAG>()?;
m.add_class::<SwapMap>()?;
Ok(())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm: since _accelerate is private itself, we're not making any guarantees about the stability of the interface, right? I agree with that approach, but just want to make sure we're on the same page.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we're on the same page, it's explicitly private and the intent is to not treat this as a stable interface, it's just for internal terra usage (although in this case the removals also haven't been released yet). I think if for some reason we ever did want to treat anything as stable from rust directly we'd re-export it under a different subpackage (thinking maybe for some quantum_info thing in the future)

src/sabre_swap/swap_map.rs Outdated Show resolved Hide resolved
src/sabre_swap/sabre_dag.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@kevinhartman kevinhartman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mergify mergify bot merged commit 8ce6e0a into Qiskit:main Aug 22, 2022
@mtreinish mtreinish deleted the rusty-sabre branch August 22, 2022 15:46
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Aug 22, 2022
The SabreSwap algorithm's output is quite linked to the random seed used
to run the algorithm. Typically to get the best result a user will run
the pass (or the full transpilation) multiple times with different seeds
and pick the best output to get a better result. Since Qiskit#8388 the
SabreSwap pass has moved mostly the domain of Rust. This enables us to
leverage multithreading easily to run parallel sabre over multiple seeds
and pick the best result. This commit adds a new argument trials to the
SabreSwap pass which is used to specify the number of random seed trials
to run sabre with. Each trial will perform a complete run of the sabre
algorithm and compute the swaps necessary for the algorithm. Then the
result with the least number of swaps will be selected and used as the
swap mapping for the pass.
mergify bot added a commit that referenced this pull request Sep 1, 2022
* Enable multiple parallel seed trials for SabreSwap

The SabreSwap algorithm's output is quite linked to the random seed used
to run the algorithm. Typically to get the best result a user will run
the pass (or the full transpilation) multiple times with different seeds
and pick the best output to get a better result. Since #8388 the
SabreSwap pass has moved mostly the domain of Rust. This enables us to
leverage multithreading easily to run parallel sabre over multiple seeds
and pick the best result. This commit adds a new argument trials to the
SabreSwap pass which is used to specify the number of random seed trials
to run sabre with. Each trial will perform a complete run of the sabre
algorithm and compute the swaps necessary for the algorithm. Then the
result with the least number of swaps will be selected and used as the
swap mapping for the pass.

* Make parallel case fully deterministic

The parallel trial code was potentially non-deterministic in it's
execution because the way the parallel trials were compared was
dependent on execution speed of the pass. This could result in a
different output if results with equal number of swaps finished
executing in differing amounts of time between runs. This commit
addresses this by first collecting the results into an ordered Vec
first which is then iterated over serially to find the minimum swap
count. This will make the output independent of execution speed of the
individual trials.

* Fix test failures

This commit updates tests which started to fail because of the different
RNG behavior used by the parallel SabreSwap seed trials. For the most
part these are just mechanical changes that either changed the expected
layout with a fixed seed or updating a fixed seed so the output matches
the expected result. The one interesting case was the
TestTranspileLevelsSwap change which was caused by different swaps being
inserted that for optimization level enabled the 2q block optimization
passes to synthesize away the swap as part of its optimization. This was
fixed by changing the seed, but it was a different case than the other
failures.

* Add swap_trials argument to SabreLayout

This commit adds a swap_trials argument to the SabreLayout pass so that
users can control how many trials to run in SabreSwap internally. This
is necessary for reproducibility between systems for the same reason
it's required on SabreSwap.

* Add comment explaining the intermediate Vec usage

* Update layout in new test

* Update releasenotes/notes/multiple-parallel-rusty-sabres-32bc93f79ae48a1f.yaml

Co-authored-by: Kevin Hartman <[email protected]>

* Remove intermediate Vec for parallel trials

In an earlier commit we switched the parallel iterator to collect into
an intermediate `Vec` to ensure the output result was deterministic. The
min_by_key() will have a degree of non-determinism for equal entries as
the parallel iterator's threads finish. However, collecting to a Vec
isn't necessary as we can use the index as an element in a 2 element
array we can get the deterministic evaluation and avoid the overhead of
collecting into a `Vec`. This commit makes this change to improve the
performance of the parallel execution path.

Co-authored-by: Kevin Hartman <[email protected]>

Co-authored-by: Kevin Hartman <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Sep 12, 2022
In Qiskit#8388 we moved all of the SabreSwap pass's processing into rust. To
facilitate that we had to create a rust DAG data structure to represent
the data flow graph solely in Rust to enable analyzing the DAG structure
in rust code without having to callback to python. To do this the
DAGCircuit (which has a nearly identical representation in python) would
export a list of nodes and the bits (both quantum and classical) that
they operated on. This information is then used to create the SabreDAG
used in the rust code. However, in this process an edge case was missed
with classical condtions. If a condition was specified solely as a
property of the operation and not in cargs list the SabreDAG would not
know about the classical bit dependency between any subsequent
operations involving that classical bit. This would cause incorrect
output because the ndoes would not get processed as they were in the
circuit. This commit fixes this issue by explicitly checking if there is
a condition on the operation and there are no cargs and if so adding the
carg bits to the SabreDAG directly. This fixes the incorrect
representation in SabreDAG.
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Sep 12, 2022
In Qiskit#8388 we moved all of the SabreSwap pass's processing into rust. To
facilitate that we had to create a rust DAG data structure to represent
the data flow graph solely in Rust to enable analyzing the DAG structure
in rust code without having to callback to python. To do this the
DAGCircuit (which has a nearly identical representation in python) would
export a list of nodes and the bits (both quantum and classical) that
they operated on. This information is then used to create the SabreDAG
used in the rust code. However, in this process an edge case was missed
with classical condtions. If a condition was specified solely as a
property of the operation and not in cargs list the SabreDAG would not
know about the classical bit dependency between any subsequent
operations involving that classical bit. This would cause incorrect
output because the ndoes would not get processed as they were in the
circuit. This commit fixes this issue by explicitly checking if there is
a condition on the operation and there are no cargs and if so adding the
carg bits to the SabreDAG directly. This fixes the incorrect
representation in SabreDAG.

Fixes Qiskit#8675
mergify bot added a commit that referenced this pull request Sep 16, 2022
* Fix handling of conditions in SabreDAG creation

In #8388 we moved all of the SabreSwap pass's processing into rust. To
facilitate that we had to create a rust DAG data structure to represent
the data flow graph solely in Rust to enable analyzing the DAG structure
in rust code without having to callback to python. To do this the
DAGCircuit (which has a nearly identical representation in python) would
export a list of nodes and the bits (both quantum and classical) that
they operated on. This information is then used to create the SabreDAG
used in the rust code. However, in this process an edge case was missed
with classical condtions. If a condition was specified solely as a
property of the operation and not in cargs list the SabreDAG would not
know about the classical bit dependency between any subsequent
operations involving that classical bit. This would cause incorrect
output because the ndoes would not get processed as they were in the
circuit. This commit fixes this issue by explicitly checking if there is
a condition on the operation and there are no cargs and if so adding the
carg bits to the SabreDAG directly. This fixes the incorrect
representation in SabreDAG.

Fixes #8675

* Fix handling of instructions with condition and cargs

This commit fixes the logic for handling instructions that have both
cargs and conditions set. The previous commit fixed the behavior only if
cargs was mutually exclusive with having classical arguments. However,
the same bug this PR is fixing would persist for instructions with clbit
arguments and a condition. This fixes the behavior to correctly
represent the data dependency in SabreDAG for these cases.

* Improve efficiency of condition handling

This commit improves the efficiency of the condition handling on
SabreDAG creation that was added in the previous commit. Previously, it
was potentially expensive to loop over the list of cargs and condition
bits as for instructions with a large number of both the repeated
duplicate searches would get quite expensive. This commit fixes that by
converting the cargs data structure to be a set to prevent adding
duplicates. Since the order isn't significant for the cargs in sabre as
it's only used to represent data dependency between instructions we can
convert it to internally use a set in python and a HashSet in rust. This
also removes storing of cargs for each node in the SabreDAG as this was
never used and just wasted memory by storing the list and never using
it.

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
mergify bot added a commit that referenced this pull request Sep 29, 2022
* Use Sabre by default for optimization levels 1 and 2

This commit updates the preset pass manager construction to use the
SabreLayout and SabreSwap passes by default for optimization level 1 and
level 2. With the recently merged #7977 the performance of the sabre
swap pass has improved significantly enough to be considered for use by
default with optimization levels 1 and 2. While for small numbers of
target device qubits (< 30) the SabreLayout/SabreSwap pass doesn't quite
match the runtime performance of DenseLayout/StochasticSwap it typically
has better runtime performance for larger target devices. Additionally,
the runtime performance of Sabre should also improve further after #8388
is finished. However, the output quality from the sabre passes is
typically better resulting in fewer swap gates being inserted. With the
combination of better quality and comparable runtime performance it
makes sense to use sabre as the default for optimization levels 1 and 2.
For optimization level 0 stochastic swap is still used there because we
want to continue to leverage TrivialLayout for that level and to get
the full quality advantages SabreSwap and SabreLayout should be used
together.

* Fix pickling of SabreSwap object

In #7977 we moved to use compiled objects for part of the SabreSwap
compiler pass. However an unintended side effect of that PR was the use
of Rust objects stored in instance level variables which weren't
pickleable. This breaks multiprocessing at the PassManager level which
expects to be able to pickle and send a SabreSwap object to the
subprocess running on a circuit. This commit fixes this by making the
Rust NeighborTable object pickleable and switching to storing the
heuristic string at the instance level instead of the heuristic enum.

* Update layout tests to match new default

This commit updates a failing layout test which was assuming that level
1 and level 2 where still running DenseLayout. The test has been updated
to reflect the new default of SabreLayout.

* Fix stochastic swap specific test to use that routing method

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Nov 9, 2022
This commit modifies the SabreLayout pass when run without the
routing_pass argument to run primarily in Rust. This builds on top of
the rust version of SabreSwap previously added in Qiskit#7977, Qiskit#8388,
and Qiskit#8572. Internally, when the routing_pass argument is not set
SabreLayout will perform the full sabre algorithm both layout selection
and final swap mapping in rust and return the selected initial layout,
the final layout, the toplogical sorting used to traverse the circuit,
and a SwapMap for any swaps inserted. This is then used to build the
output circuit in place of running separate layout and routing passes.
The preset pass managers are updated to handle the new combined layout
and routing mode of operation for SabreLayout. The routing stage to the
preset pass managers remains intact, it will just operate as if a
perfect layout was selected and skip SabreSwap because the circuit is
already matching the connectivity constraints.

Besides just operating more quickly because the heavy lifting of the
algorithm operates more efficiently in a compiled language, doing this
in rust also lets change our parallelization model for running multiple
seed in Sabre. Just as in Qiskit#8572 we added support for SabreSwap to run
multiple parallel trials with different seeds this commit adds a
layout_trials argument to SabreLayout to try multiple seeds in parallel.
When this is used it parallelizes at the outer layer for each
layout/routing combination and the total minimal swap count seed is used.
So for example if you set swap_trials=5 and layout_trails=5 that will run
5 tasks in the threadpool with 5 different seeds for the outer layout run.
Inside that every time sabre swap is run (which will be multiple times
as part of layout plus the final routing run) it tries 5 different seeds
for each execution serially inside that parallel task. This should
hopefully further improve the quality of the transpiler output and better
match expectations for users who were previously calling transpile()
multiple times to emulate this behavior.

Implements Qiskit#9090
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Nov 10, 2022
This commit modifies the SabreLayout pass when run without the
routing_pass argument to run primarily in Rust. This builds on top of
the rust version of SabreSwap previously added in Qiskit#7977, Qiskit#8388,
and Qiskit#8572. Internally, when the routing_pass argument is not set
SabreLayout will perform the full sabre algorithm both layout selection
and final swap mapping in rust and return the selected initial layout,
the final layout, the toplogical sorting used to traverse the circuit,
and a SwapMap for any swaps inserted. This is then used to build the
output circuit in place of running separate layout and routing passes.
The preset pass managers are updated to handle the new combined layout
and routing mode of operation for SabreLayout. The routing stage to the
preset pass managers remains intact, it will just operate as if a
perfect layout was selected and skip SabreSwap because the circuit is
already matching the connectivity constraints.

Besides just operating more quickly because the heavy lifting of the
algorithm operates more efficiently in a compiled language, doing this
in rust also lets change our parallelization model for running multiple
seed in Sabre. Just as in Qiskit#8572 we added support for SabreSwap to run
multiple parallel trials with different seeds this commit adds a
layout_trials argument to SabreLayout to try multiple seeds in parallel.
When this is used it parallelizes at the outer layer for each
layout/routing combination and the total minimal swap count seed is used.
So for example if you set swap_trials=5 and layout_trails=5 that will run
5 tasks in the threadpool with 5 different seeds for the outer layout run.
Inside that every time sabre swap is run (which will be multiple times
as part of layout plus the final routing run) it tries 5 different seeds
for each execution serially inside that parallel task. This should
hopefully further improve the quality of the transpiler output and better
match expectations for users who were previously calling transpile()
multiple times to emulate this behavior.

Implements Qiskit#9090
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Nov 10, 2022
This commit modifies the SabreLayout pass when run without the
routing_pass argument to run primarily in Rust. This builds on top of
the rust version of SabreSwap previously added in Qiskit#7977, Qiskit#8388,
and Qiskit#8572. Internally, when the routing_pass argument is not set
SabreLayout will perform the full sabre algorithm both layout selection
and final swap mapping in rust and return the selected initial layout,
the final layout, the toplogical sorting used to traverse the circuit,
and a SwapMap for any swaps inserted. This is then used to build the
output circuit in place of running separate layout and routing passes.
The preset pass managers are updated to handle the new combined layout
and routing mode of operation for SabreLayout. The routing stage to the
preset pass managers remains intact, it will just operate as if a
perfect layout was selected and skip SabreSwap because the circuit is
already matching the connectivity constraints.

Besides just operating more quickly because the heavy lifting of the
algorithm operates more efficiently in a compiled language, doing this
in rust also lets change our parallelization model for running multiple
seed in Sabre. Just as in Qiskit#8572 we added support for SabreSwap to run
multiple parallel trials with different seeds this commit adds a
layout_trials argument to SabreLayout to try multiple seeds in parallel.
When this is used it parallelizes at the outer layer for each
layout/routing combination and the total minimal swap count seed is used.
So for example if you set swap_trials=5 and layout_trails=5 that will run
5 tasks in the threadpool with 5 different seeds for the outer layout run.
Inside that every time sabre swap is run (which will be multiple times
as part of layout plus the final routing run) it tries 5 different seeds
for each execution serially inside that parallel task. This should
hopefully further improve the quality of the transpiler output and better
match expectations for users who were previously calling transpile()
multiple times to emulate this behavior.

Implements Qiskit#9090
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Nov 10, 2022
This commit modifies the SabreLayout pass when run without the
routing_pass argument to run primarily in Rust. This builds on top of
the rust version of SabreSwap previously added in Qiskit#7977, Qiskit#8388,
and Qiskit#8572. Internally, when the routing_pass argument is not set
SabreLayout will perform the full sabre algorithm both layout selection
and final swap mapping in rust and return the selected initial layout,
the final layout, the toplogical sorting used to traverse the circuit,
and a SwapMap for any swaps inserted. This is then used to build the
output circuit in place of running separate layout and routing passes.
The preset pass managers are updated to handle the new combined layout
and routing mode of operation for SabreLayout. The routing stage to the
preset pass managers remains intact, it will just operate as if a
perfect layout was selected and skip SabreSwap because the circuit is
already matching the connectivity constraints.

Besides just operating more quickly because the heavy lifting of the
algorithm operates more efficiently in a compiled language, doing this
in rust also lets change our parallelization model for running multiple
seed in Sabre. Just as in Qiskit#8572 we added support for SabreSwap to run
multiple parallel trials with different seeds this commit adds a
layout_trials argument to SabreLayout to try multiple seeds in parallel.
When this is used it parallelizes at the outer layer for each
layout/routing combination and the total minimal swap count seed is used.
So for example if you set swap_trials=5 and layout_trails=5 that will run
5 tasks in the threadpool with 5 different seeds for the outer layout run.
Inside that every time sabre swap is run (which will be multiple times
as part of layout plus the final routing run) it tries 5 different seeds
for each execution serially inside that parallel task. This should
hopefully further improve the quality of the transpiler output and better
match expectations for users who were previously calling transpile()
multiple times to emulate this behavior.

Implements Qiskit#9090
@mtreinish mtreinish mentioned this pull request Nov 10, 2022
4 tasks
mergify bot added a commit that referenced this pull request Dec 8, 2022
* Oxidize SabreLayout pass

This commit modifies the SabreLayout pass when run without the
routing_pass argument to run primarily in Rust. This builds on top of
the rust version of SabreSwap previously added in #7977, #8388,
and #8572. Internally, when the routing_pass argument is not set
SabreLayout will perform the full sabre algorithm both layout selection
and final swap mapping in rust and return the selected initial layout,
the final layout, the toplogical sorting used to traverse the circuit,
and a SwapMap for any swaps inserted. This is then used to build the
output circuit in place of running separate layout and routing passes.
The preset pass managers are updated to handle the new combined layout
and routing mode of operation for SabreLayout. The routing stage to the
preset pass managers remains intact, it will just operate as if a
perfect layout was selected and skip SabreSwap because the circuit is
already matching the connectivity constraints.

Besides just operating more quickly because the heavy lifting of the
algorithm operates more efficiently in a compiled language, doing this
in rust also lets change our parallelization model for running multiple
seed in Sabre. Just as in #8572 we added support for SabreSwap to run
multiple parallel trials with different seeds this commit adds a
layout_trials argument to SabreLayout to try multiple seeds in parallel.
When this is used it parallelizes at the outer layer for each
layout/routing combination and the total minimal swap count seed is used.
So for example if you set swap_trials=5 and layout_trails=5 that will run
5 tasks in the threadpool with 5 different seeds for the outer layout run.
Inside that every time sabre swap is run (which will be multiple times
as part of layout plus the final routing run) it tries 5 different seeds
for each execution serially inside that parallel task. This should
hopefully further improve the quality of the transpiler output and better
match expectations for users who were previously calling transpile()
multiple times to emulate this behavior.

Implements #9090

* Use deepcopy for coupling map copy

Previously this PR was using copy() to copy the coupling map before we
mutated it to be symmetric (a requirement for the sabre algorithm).
However, this modification of the object was leaking out causing test
failures. This commit switches it to a deepcopy to ensure there are no
shared references (and a comment added to explain it's needed).

* Fix failing unitary synthesis tests

This PR branch modifies the default behavior of the SabreLayout pass so
it is now a transformation pass that computes a layout, applies it, and
then performs routing. This means when using sabre layout in a custom
pass manager we no longer need to embed a layout after computing the
layout. The failing unitary synthesis tests were using a custom pass
manager and trying to apply the layout again after SabreLayout already
did. This commit just removes this now unecessary steps from the test
code.

* Add release note

* Run BarrierBeforeMeasurement before new SabreLayout

Now that the routing stage is integrated into the SabreLayout pass we
should be running the BarrierBeforeMeasurement pass prior to layout in
the preset pass managers instead of before routing. The goal of the pass
is to prevent the routing algorithms for accidentally reusing a qubit
after a final measurement which would be invalid by inserting a barrier
before the measurements to ensure all qubits are swap mapped prior to
adding the measurements during routing. While this might not strictly be
necessary (it didn't affect any test output) it feels like best practice
to ensure we're doing this prior to potentially routing to prevent
issues.

* Improve docstrings

* Set a fixed number of layout trials in preset pass managers

For reproducible results with a fixed seed this commit sets a fixed
number of layout_trials for the SabreLayout pass in the preset pass
managers. If we did not set a fixed value than the output of the
transpiler with a fixed seed will vary based on the number of
physical cores that is running the compilation. To start
optimization levels 0 and 1 use 5, level 2 uses 10, and level
3 uses 20 which matches the swap_trials argument we used. This is just a
starting point, we can adjust these values later if needed.

* Update tests for layout changes

This commit updates the tests which are checking exact layouts with a
fixed seed when running SabreLayout. The changes to SabreLayout breaks
exact seed reproducibility from the earlier version of the pass. So we
need to update these tests for their new layout assignment from the
improved pass. One exception is a test which was trying to assert that
transpile() preserves a swap if it's in the basis set. However, the new
layout and routing output from SabreLayout for that test was resulting
in all the swaps getting optimized away at optimization level 3
(resulting in 13 cx gates instead of ~4 cx gates and 5 swaps before,
which would be more efficient on real hardware). So the test was removed
and only run at lower optimziation levels.

* Set a fixed number of layout trials in SabreLayout tests

The dedicated tests for SabreLayout were not running a fixed number of
trials. This was causing a different layout to be returned in tests when
run across multiple systems as the number of trials defaults to the
number of physical CPUs. This commit fixes the trial count to the number
of cores on the local system where the layout was updated. This should
fix the non-determinism in the tests causing failures in CI and on
different local systems.

* Run SabreSwap in parallel if only a single layout trial

If there is only a single layout trial being run we don't have to worry
about trying to do too much work in parallel at once by parallelizing
the inner sabre swap execution. This commit updates the threading logic
to enable running the inner sabre swap trials in parallel if there is
only a single layout trial.

* Remove duplicated SabreDAG creation

* Correctly apply selected layout on dag nodes

This commit corrects a bug in the PR branch that was caused by applying
the selected initial layout in a trial to the swapped order node list.
This was causing unexpected results when applying the circuit because
the intent was to apply it only to the original input not the reversed
input.

* Remove unnecessary clone from serial layout trials

In the case we're evaluating the layout trials serially instead of in a
parallel iterator we don't need to clone the dag nodes list. This is
because nothing will be modifying it in parallel, so we don't need a
thread local copy. Each call to layout_trial() will keep the dag nodes
vector intact (see previous commit for fixing this) so it can just be
passed by reference if there are no parallel threads involved.

* Fix seed setup when no user seed specified

This commit fixes an issue prevent seed randomization when no seed is
specified. On subsequent uses of a pass SabreLayout would not randomize
the seed between runs because it was setting the seed to instance state.
This commit fixes this issue by relying on initializing the RNG from
entropy each time run() is called if no user specified seed is provided.

* Start from trivial layout for routing stage

This commit fixes the routing run to run from a trivial layout instead
of the initial layout. By the time we do final routing for a trial we've
already applied the selected initial layout to the SabreDAG. So the
correct layout to use for running final swap mapping is a trivial layout
where logical bit 0 is at physical bit 0. Using initial layout twice
means we end up mapping more than is needed resulting in incorrect
results.

* Revert "Correctly apply selected layout on dag nodes"

This change was incorrect, the output was already in the correct order
and this was causing the behavior it strived to fix. This commit reverts
the addition of the extra mem::swap() call to fix things.

This reverts commit d98ef6c.

* Deduplicate NLayout trivial layout creation

This commit deduplicates the trivial layout generation for the NLayout
class. Previously there were a few places both in rust and python that
sabre layout was manually generating a trivial NLayout object. THis
commit adds a static method to the NLayout class that allows both Python
and Rust to easily create a new trivial NLayout object instead of
manually creating the object.

* Fix fixed layout tests after updates

Since more recent commits fixed a few bugs in the behavior of the
SabreLayout pass, the previously updated fixed layout tests were no
longer correct. This commit updates the tests which were now failing
because the layout changed again after fixing bugs in the new pass code.

* Try nesting parallelism in the sabres

Looking at profiles for running the new SabreLayout pass, as expected
the runtime of the rust SabreSwap routines is dominating. This is
because we've basically serialized the sabre swap routines and are
running multiple seed trials. As an experiment this commit sets the
inner SabreSwap routines to run in parallel too. Since the rayon
algorithm uses a work stealing algorithm this hopefully shouldn't cause
too much extra overhead, especially because the layout trials are quite
fast. This ideally means we're just scheduling each sabre swap trial in
a big parallel work queue and rayon does the rest of the magic to figure
out how to execute things. Initial testing is showing an improvement for
large circuits and a more modest improvement for more modest circuits.

* Add skip_routing argument to preserve custom user provided routing

This commit adds a new argument, skip_routing, to the SabreLayout
constructor. The intent of this new option is to enable mixing custom
routing_method user arguments with SabreLayout in it's new accelerated
mode of operation. In the earlier commits no matter what users specified
the preset pass manager construction would use sabreswap for routing as
it was run internally as part of layout. This meant doing something
like:

transpile(qc, backend, routing_method='stochastic')

would really run SabreSwap which is clearly not the user intent. To
provide the layout benefits with multiple seed trials this new argument
allows disabling the application of the routing found. This comes with a
runtime penalty because effectively we end up running routing twice and
only using one of the results. But for custom user provided methods or
plugins this seems like a reasonable tradeoff.

* Fix typo in docstring

* Update random seed usage in rust code

In #9132 we updated the random seed parameters in the rust code for
sabre swap to make the seed optional and default to initializing from
entropy if it's not specified. This commit updates the usage to account
for this change on main.

* s/retworkx/rustworkx/g

* Add alternate constructor for NLayout from a logic_to_phys vec

This commit adds a new constructor method to the NLayout class that
builds an NLayout object from just a logic_to_phys Vec. This constructor
can be accessed from either rust or python (although it's not as
efficient from Python). This is used to simplify some of the SabreLayout
rust code that was doing this inline manually.

* Move layout embedding into a method

This commit moves the code the optimized SabreLayout pass was using to
embed the found layout from the Rust code into a method. This will make
it easier to refactor later if a more efficient pass manager path is
added.

* Simplify pass logic and update comments

This commit removes an unnecessary else branch in the SabreLayout.run()
code to make it slightly easier to read. At the same time some comments
are updated to better explain the logic of the code.

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Cryoris pushed a commit to Cryoris/qiskit-terra that referenced this pull request Jan 12, 2023
* Oxidize SabreLayout pass

This commit modifies the SabreLayout pass when run without the
routing_pass argument to run primarily in Rust. This builds on top of
the rust version of SabreSwap previously added in Qiskit#7977, Qiskit#8388,
and Qiskit#8572. Internally, when the routing_pass argument is not set
SabreLayout will perform the full sabre algorithm both layout selection
and final swap mapping in rust and return the selected initial layout,
the final layout, the toplogical sorting used to traverse the circuit,
and a SwapMap for any swaps inserted. This is then used to build the
output circuit in place of running separate layout and routing passes.
The preset pass managers are updated to handle the new combined layout
and routing mode of operation for SabreLayout. The routing stage to the
preset pass managers remains intact, it will just operate as if a
perfect layout was selected and skip SabreSwap because the circuit is
already matching the connectivity constraints.

Besides just operating more quickly because the heavy lifting of the
algorithm operates more efficiently in a compiled language, doing this
in rust also lets change our parallelization model for running multiple
seed in Sabre. Just as in Qiskit#8572 we added support for SabreSwap to run
multiple parallel trials with different seeds this commit adds a
layout_trials argument to SabreLayout to try multiple seeds in parallel.
When this is used it parallelizes at the outer layer for each
layout/routing combination and the total minimal swap count seed is used.
So for example if you set swap_trials=5 and layout_trails=5 that will run
5 tasks in the threadpool with 5 different seeds for the outer layout run.
Inside that every time sabre swap is run (which will be multiple times
as part of layout plus the final routing run) it tries 5 different seeds
for each execution serially inside that parallel task. This should
hopefully further improve the quality of the transpiler output and better
match expectations for users who were previously calling transpile()
multiple times to emulate this behavior.

Implements Qiskit#9090

* Use deepcopy for coupling map copy

Previously this PR was using copy() to copy the coupling map before we
mutated it to be symmetric (a requirement for the sabre algorithm).
However, this modification of the object was leaking out causing test
failures. This commit switches it to a deepcopy to ensure there are no
shared references (and a comment added to explain it's needed).

* Fix failing unitary synthesis tests

This PR branch modifies the default behavior of the SabreLayout pass so
it is now a transformation pass that computes a layout, applies it, and
then performs routing. This means when using sabre layout in a custom
pass manager we no longer need to embed a layout after computing the
layout. The failing unitary synthesis tests were using a custom pass
manager and trying to apply the layout again after SabreLayout already
did. This commit just removes this now unecessary steps from the test
code.

* Add release note

* Run BarrierBeforeMeasurement before new SabreLayout

Now that the routing stage is integrated into the SabreLayout pass we
should be running the BarrierBeforeMeasurement pass prior to layout in
the preset pass managers instead of before routing. The goal of the pass
is to prevent the routing algorithms for accidentally reusing a qubit
after a final measurement which would be invalid by inserting a barrier
before the measurements to ensure all qubits are swap mapped prior to
adding the measurements during routing. While this might not strictly be
necessary (it didn't affect any test output) it feels like best practice
to ensure we're doing this prior to potentially routing to prevent
issues.

* Improve docstrings

* Set a fixed number of layout trials in preset pass managers

For reproducible results with a fixed seed this commit sets a fixed
number of layout_trials for the SabreLayout pass in the preset pass
managers. If we did not set a fixed value than the output of the
transpiler with a fixed seed will vary based on the number of
physical cores that is running the compilation. To start
optimization levels 0 and 1 use 5, level 2 uses 10, and level
3 uses 20 which matches the swap_trials argument we used. This is just a
starting point, we can adjust these values later if needed.

* Update tests for layout changes

This commit updates the tests which are checking exact layouts with a
fixed seed when running SabreLayout. The changes to SabreLayout breaks
exact seed reproducibility from the earlier version of the pass. So we
need to update these tests for their new layout assignment from the
improved pass. One exception is a test which was trying to assert that
transpile() preserves a swap if it's in the basis set. However, the new
layout and routing output from SabreLayout for that test was resulting
in all the swaps getting optimized away at optimization level 3
(resulting in 13 cx gates instead of ~4 cx gates and 5 swaps before,
which would be more efficient on real hardware). So the test was removed
and only run at lower optimziation levels.

* Set a fixed number of layout trials in SabreLayout tests

The dedicated tests for SabreLayout were not running a fixed number of
trials. This was causing a different layout to be returned in tests when
run across multiple systems as the number of trials defaults to the
number of physical CPUs. This commit fixes the trial count to the number
of cores on the local system where the layout was updated. This should
fix the non-determinism in the tests causing failures in CI and on
different local systems.

* Run SabreSwap in parallel if only a single layout trial

If there is only a single layout trial being run we don't have to worry
about trying to do too much work in parallel at once by parallelizing
the inner sabre swap execution. This commit updates the threading logic
to enable running the inner sabre swap trials in parallel if there is
only a single layout trial.

* Remove duplicated SabreDAG creation

* Correctly apply selected layout on dag nodes

This commit corrects a bug in the PR branch that was caused by applying
the selected initial layout in a trial to the swapped order node list.
This was causing unexpected results when applying the circuit because
the intent was to apply it only to the original input not the reversed
input.

* Remove unnecessary clone from serial layout trials

In the case we're evaluating the layout trials serially instead of in a
parallel iterator we don't need to clone the dag nodes list. This is
because nothing will be modifying it in parallel, so we don't need a
thread local copy. Each call to layout_trial() will keep the dag nodes
vector intact (see previous commit for fixing this) so it can just be
passed by reference if there are no parallel threads involved.

* Fix seed setup when no user seed specified

This commit fixes an issue prevent seed randomization when no seed is
specified. On subsequent uses of a pass SabreLayout would not randomize
the seed between runs because it was setting the seed to instance state.
This commit fixes this issue by relying on initializing the RNG from
entropy each time run() is called if no user specified seed is provided.

* Start from trivial layout for routing stage

This commit fixes the routing run to run from a trivial layout instead
of the initial layout. By the time we do final routing for a trial we've
already applied the selected initial layout to the SabreDAG. So the
correct layout to use for running final swap mapping is a trivial layout
where logical bit 0 is at physical bit 0. Using initial layout twice
means we end up mapping more than is needed resulting in incorrect
results.

* Revert "Correctly apply selected layout on dag nodes"

This change was incorrect, the output was already in the correct order
and this was causing the behavior it strived to fix. This commit reverts
the addition of the extra mem::swap() call to fix things.

This reverts commit d98ef6c.

* Deduplicate NLayout trivial layout creation

This commit deduplicates the trivial layout generation for the NLayout
class. Previously there were a few places both in rust and python that
sabre layout was manually generating a trivial NLayout object. THis
commit adds a static method to the NLayout class that allows both Python
and Rust to easily create a new trivial NLayout object instead of
manually creating the object.

* Fix fixed layout tests after updates

Since more recent commits fixed a few bugs in the behavior of the
SabreLayout pass, the previously updated fixed layout tests were no
longer correct. This commit updates the tests which were now failing
because the layout changed again after fixing bugs in the new pass code.

* Try nesting parallelism in the sabres

Looking at profiles for running the new SabreLayout pass, as expected
the runtime of the rust SabreSwap routines is dominating. This is
because we've basically serialized the sabre swap routines and are
running multiple seed trials. As an experiment this commit sets the
inner SabreSwap routines to run in parallel too. Since the rayon
algorithm uses a work stealing algorithm this hopefully shouldn't cause
too much extra overhead, especially because the layout trials are quite
fast. This ideally means we're just scheduling each sabre swap trial in
a big parallel work queue and rayon does the rest of the magic to figure
out how to execute things. Initial testing is showing an improvement for
large circuits and a more modest improvement for more modest circuits.

* Add skip_routing argument to preserve custom user provided routing

This commit adds a new argument, skip_routing, to the SabreLayout
constructor. The intent of this new option is to enable mixing custom
routing_method user arguments with SabreLayout in it's new accelerated
mode of operation. In the earlier commits no matter what users specified
the preset pass manager construction would use sabreswap for routing as
it was run internally as part of layout. This meant doing something
like:

transpile(qc, backend, routing_method='stochastic')

would really run SabreSwap which is clearly not the user intent. To
provide the layout benefits with multiple seed trials this new argument
allows disabling the application of the routing found. This comes with a
runtime penalty because effectively we end up running routing twice and
only using one of the results. But for custom user provided methods or
plugins this seems like a reasonable tradeoff.

* Fix typo in docstring

* Update random seed usage in rust code

In Qiskit#9132 we updated the random seed parameters in the rust code for
sabre swap to make the seed optional and default to initializing from
entropy if it's not specified. This commit updates the usage to account
for this change on main.

* s/retworkx/rustworkx/g

* Add alternate constructor for NLayout from a logic_to_phys vec

This commit adds a new constructor method to the NLayout class that
builds an NLayout object from just a logic_to_phys Vec. This constructor
can be accessed from either rust or python (although it's not as
efficient from Python). This is used to simplify some of the SabreLayout
rust code that was doing this inline manually.

* Move layout embedding into a method

This commit moves the code the optimized SabreLayout pass was using to
embed the found layout from the Rust code into a method. This will make
it easier to refactor later if a more efficient pass manager path is
added.

* Simplify pass logic and update comments

This commit removes an unnecessary else branch in the SabreLayout.run()
code to make it slightly easier to read. At the same time some comments
are updated to better explain the logic of the code.

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: None Do not include in changelog performance Rust This PR or issue is related to Rust code in the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants