Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable multiple parallel seed trials for SabreSwap #8572

Merged
merged 12 commits into from
Sep 1, 2022

Conversation

mtreinish
Copy link
Member

@mtreinish mtreinish commented Aug 17, 2022

Summary

The SabreSwap algorithm's output is quite linked to the random seed used
to run the algorithm. Typically to get the best result a user will run
the pass (or the full transpilation) multiple times with different seeds
and pick the best output to get a better result. Since #8388 the
SabreSwap pass has moved mostly the domain of Rust. This enables us to
leverage multithreading easily to run parallel sabre over multiple seeds
and pick the best result. This commit adds a new argument, trials, to the
SabreSwap pass which is used to specify the number of random seed trials
to run sabre with. Each trial will perform a complete run of the sabre
algorithm and compute the swaps necessary for the algorithm. Then the
result with the least number of swaps will be selected and used as the
swap mapping for the pass.

Details and comments

See: 914b22a for the diff of just this PR (this is based on top of #8388)

TODO:

@mtreinish mtreinish added on hold Can not fix yet performance Changelog: New Feature Include in the "Added" section of the changelog Rust This PR or issue is related to Rust code in the repository mod: transpiler Issues and PRs related to Transpiler labels Aug 17, 2022
@mtreinish mtreinish added this to the 0.22 milestone Aug 17, 2022
@mtreinish mtreinish requested a review from a team as a code owner August 17, 2022 22:28
@qiskit-bot
Copy link
Collaborator

Thank you for opening a new pull request.

Before your PR can be merged it will first need to pass continuous integration tests and be reviewed. Sometimes the review process can be slow, so please be patient.

While you're waiting, please feel free to review other open PRs. While only a subset of people are authorized to approve pull requests for merging, everyone is encouraged to review open pull requests. Doing reviews helps reduce the burden on the core team and helps make the project's code better for everyone.

One or more of the the following people are requested to review this:

Comment on lines 168 to 197
(0..num_trials)
.into_par_iter()
.map(|trial_num| {
swap_map_trial(
num_qubits,
dag,
neighbor_table,
&dist,
&coupling_graph,
heuristic,
seed_vec[trial_num],
layout.clone(),
)
})
.min_by_key(|(out_map, _gate_order, _layout)| {
out_map.values().map(|x| x.len()).sum::<usize>()
})
Copy link
Member Author

@mtreinish mtreinish Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm debating changing this block to something like:

Suggested change
(0..num_trials)
.into_par_iter()
.map(|trial_num| {
swap_map_trial(
num_qubits,
dag,
neighbor_table,
&dist,
&coupling_graph,
heuristic,
seed_vec[trial_num],
layout.clone(),
)
})
.min_by_key(|(out_map, _gate_order, _layout)| {
out_map.values().map(|x| x.len()).sum::<usize>()
})
let trial_results: Vec<(HashMap<usize, Vec<[usize; 2]>>, Vec<usize>, NLayout)> = (0..num_trials)
.into_par_iter()
.map(|trial_num| {
swap_map_trial(
num_qubits,
dag,
neighbor_table,
&dist,
&coupling_graph,
heuristic,
seed_vec[trial_num],
layout.clone(),
)
}).collect();
trial_results.into_iter().min_by_key(|(out_map, _gate_order, _layout)| {
out_map.values().map(|x| x.len()).sum::<usize>()
})

it is less efficient because it collects into an intermediate Vec. But I'm a bit concerned about potential non-determinism with the function because min_by_key on the parallel iterator will process the results when they finish meaning the output may be dependent on execution speed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in: 3d0429a

@coveralls
Copy link

coveralls commented Aug 18, 2022

Pull Request Test Coverage Report for Build 2974836781

  • 18 of 19 (94.74%) changed or added relevant lines in 7 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.0004%) to 84.192%

Changes Missing Coverage Covered Lines Changed/Added Lines %
qiskit/transpiler/passes/layout/sabre_layout.py 5 6 83.33%
Totals Coverage Status
Change from base Build 2973164045: 0.0004%
Covered Lines: 57014
Relevant Lines: 67719

💛 - Coveralls

@mtreinish
Copy link
Member Author

I've been playing with asv benchmarks on this to see how this changes things. Comparing this PR to #8388 which it is based on yielded:

Benchmarks that have improved:

       before           after         ratio
     [f940b296]       [31df7962]
     <rusty-sabre>       <parallel-rusty-sabres>
-              73               66     0.90  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'noise_adaptive')
-     24.1±0.09ms       21.7±0.1ms     0.90  queko.QUEKOTranspilerBench.time_transpile_bigd(0, 'sabre')
-      1.04±0.01s         936±10ms     0.90  queko.QUEKOTranspilerBench.time_transpile_bss(2, 'sabre')
-             433              391     0.90  transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(3)
-            1598             1432     0.90  transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(3)
-         646±4ms          578±7ms     0.89  queko.QUEKOTranspilerBench.time_transpile_bntf(1, 'sabre')
-              58               51     0.88  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'noise_adaptive')
-             785              687     0.88  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(0, 'sabre')
-              69               58     0.84  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'sabre')
-             294              241     0.82  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'sabre', 'sabre')
-             343              281     0.82  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(0, 'sabre')
-             339              277     0.82  queko.QUEKOTranspilerBench.track_depth_bntf_optimal_depth_25(3, 'sabre')
-             677              512     0.76  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(3, 'sabre')
-             672              480     0.71  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(2, 'sabre')
-      4.24±0.04s       2.90±0.01s     0.68  queko.QUEKOTranspilerBench.time_transpile_bntf(3, 'sabre')
-             565              371     0.66  queko.QUEKOTranspilerBench.track_depth_bntf_optimal_depth_25(2, 'sabre')
-      1.25±0.01s         741±10ms     0.59  queko.QUEKOTranspilerBench.time_transpile_bntf(2, 'sabre')

Benchmarks that have stayed the same:

       before           after         ratio
     [f940b296]       [31df7962]
     <rusty-sabre>       <parallel-rusty-sabres>
       61.2±0.3ms       66.8±0.3ms     1.09  queko.QUEKOTranspilerBench.time_transpile_bigd(2, 'sabre')
       15.0±0.2ms       16.4±0.3ms     1.09  mapping_passes.PassBenchmarks.time_sabre_swap(14, 1024)
          395±4ms          427±3ms     1.08  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'stochastic', 'sabre')
          368±1ms          398±3ms     1.08  mapping_passes.PassBenchmarks.time_sabre_swap(20, 1024)
          647±2ms          699±1ms     1.08  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'noise_adaptive')
          498±9ms          537±7ms     1.08  mapping_passes.PassBenchmarks.time_sabre_layout(5, 1024)
       2.35±0.02s       2.53±0.01s     1.07  mapping_passes.PassBenchmarks.time_sabre_layout(20, 1024)
       54.1±0.3ms       58.0±0.5ms     1.07  mapping_passes.PassBenchmarks.time_sabre_swap(5, 1024)
          475±5ms          505±1ms     1.06  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'stochastic', 'sabre')
       45.6±0.8ms       48.3±0.4ms     1.06  queko.QUEKOTranspilerBench.time_transpile_bigd(1, 'sabre')
          830±8ms          878±3ms     1.06  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'stochastic', 'sabre')
         656±10ms          682±6ms     1.04  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'stochastic', 'sabre')
              478              497     1.04  queko.QUEKOTranspilerBench.track_depth_bntf_optimal_depth_25(1, 'sabre')
       1.48±0.01s       1.54±0.01s     1.04  mapping_passes.PassBenchmarks.time_sabre_layout(14, 1024)
               54               56     1.04  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'stochastic', 'sabre')
               54               56     1.04  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'stochastic', 'sabre')
              142              147     1.04  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(3, 'sabre')
          416±4ms          430±2ms     1.03  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'noise_adaptive')
               61               63     1.03  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'sabre')
       1.06±0.01s       1.10±0.01s     1.03  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'noise_adaptive')
        329±0.7ms          339±1ms     1.03  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'noise_adaptive')
               75               77     1.03  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'stochastic', 'sabre')
          689±6ms          706±8ms     1.02  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'sabre')
              610              624     1.02  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'sabre', 'dense')
              262              268     1.02  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'sabre', 'dense')
        271±0.4ms          277±1ms     1.02  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'sabre', 'dense')
          521±2ms          532±1ms     1.02  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'sabre')
          504±4ms          513±5ms     1.02  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'stochastic', 'sabre')
              217              221     1.02  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'sabre', 'noise_adaptive')
          1.68±0s       1.71±0.04s     1.02  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'stochastic', 'sabre')
        303±0.9ms          307±1ms     1.02  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'sabre', 'sabre')
              220              223     1.01  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'sabre', 'sabre')
          276±1ms          280±2ms     1.01  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'sabre', 'noise_adaptive')
              223              225     1.01  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'sabre', 'noise_adaptive')
          351±2ms        353±0.7ms     1.01  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'sabre')
          392±2ms          395±3ms     1.01  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'stochastic', 'sabre')
              626              630     1.01  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'sabre', 'sabre')
         483±10ms          486±2ms     1.01  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'dense')
          334±3ms          336±3ms     1.01  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'dense')
        290±0.3ms          291±1ms     1.01  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'sabre', 'noise_adaptive')
          375±4ms        377±0.9ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'sabre', 'sabre')
        336±0.5ms          337±2ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'sabre', 'noise_adaptive')
        297±0.3ms          298±1ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'sabre', 'noise_adaptive')
        331±0.7ms          332±2ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'sabre', 'dense')
        335±0.5ms          335±2ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'sabre', 'dense')
          322±3ms          323±2ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'stochastic', 'sabre')
        293±0.7ms        294±0.7ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'sabre', 'dense')
       1.54±0.01s       1.54±0.01s     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'dense')
          688±6ms          688±7ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'dense')
                7                7     1.00  transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(3)
                7                7     1.00  transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(3)
              226              226     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'stochastic', 'sabre')
              224              224     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'sabre', 'dense')
              210              210     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'stochastic', 'sabre')
              218              218     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'sabre', 'dense')
              214              214     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'sabre', 'sabre')
              210              210     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'stochastic', 'sabre')
              278              278     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'sabre', 'dense')
              281              281     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'sabre', 'sabre')
              309              309     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'stochastic', 'sabre')
               56               56     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'stochastic', 'sabre')
               55               55     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'dense')
               55               55     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'dense')
               64               64     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'dense')
              563              563     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'stochastic', 'sabre')
              542              542     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'sabre', 'noise_adaptive')
              521              521     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'stochastic', 'sabre')
              524              524     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'sabre', 'noise_adaptive')
              502              502     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'stochastic', 'sabre')
              683              683     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'sabre', 'noise_adaptive')
              754              754     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'stochastic', 'sabre')
          292±1ms          292±1ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'sabre', 'dense')
          283±2ms        283±0.3ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'sabre', 'sabre')
        291±0.8ms          291±2ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'sabre', 'dense')
              640              639     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'sabre', 'sabre')
        316±0.6ms          315±1ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'sabre', 'sabre')
          378±2ms          376±2ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'sabre', 'sabre')
          444±1ms          442±3ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'stochastic', 'sabre')
        332±0.5ms        330±0.4ms     1.00  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'sabre', 'noise_adaptive')
       1.29±0.01s       1.28±0.02s     0.99  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'dense')
          319±2ms        317±0.9ms     0.99  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'sabre', 'sabre')
          349±2ms          347±2ms     0.99  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'stochastic', 'sabre')
          431±1ms        428±0.5ms     0.99  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'sabre')
          392±1ms          389±4ms     0.99  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'dense')
       1.62±0.01s       1.61±0.01s     0.99  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'sabre')
              519              514     0.99  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'sabre', 'dense')
          308±6ms          305±1ms     0.99  transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(3)
              506              501     0.99  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'sabre', 'sabre')
          292±2ms          289±1ms     0.99  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'sabre', 'noise_adaptive')
              275              272     0.99  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'sabre', 'noise_adaptive')
              531              524     0.99  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'sabre', 'dense')
              478              471     0.99  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'sabre', 'sabre')
       8.66±0.01s       8.51±0.01s     0.98  transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(3)
          392±3ms          382±1ms     0.97  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'noise_adaptive')
              621              604     0.97  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(1, 'sabre')
          226±3ms          219±2ms     0.97  queko.QUEKOTranspilerBench.time_transpile_bntf(0, 'sabre')
              672              652     0.97  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'sabre', 'dense')
               60               58     0.97  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'noise_adaptive')
          294±6ms          284±2ms     0.96  transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(3)
          266±2ms          256±1ms     0.96  queko.QUEKOTranspilerBench.time_transpile_bigd(3, 'sabre')
              622              597     0.96  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'sabre', 'noise_adaptive')
       1.91±0.01s       1.81±0.01s     0.95  transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(3)
              273              258     0.95  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'sabre', 'noise_adaptive')
        372±0.8ms          351±5ms     0.94  queko.QUEKOTranspilerBench.time_transpile_bss(0, 'sabre')
          799±4ms          751±4ms     0.94  queko.QUEKOTranspilerBench.time_transpile_bss(1, 'sabre')
               68               63     0.93  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'dense')
          1.29±0s       1.19±0.01s     0.92  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'sabre')
               58               53     0.91  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'noise_adaptive')
               56               51     0.91  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'sabre')
               56               51     0.91  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'sabre')
       6.35±0.03s       4.80±0.06s    ~0.75  queko.QUEKOTranspilerBench.time_transpile_bss(3, 'sabre')

Benchmarks that have got worse:

       before           after         ratio
     [f940b296]       [31df7962]
     <rusty-sabre>       <parallel-rusty-sabres>
+             124              155     1.25  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(1, 'sabre')
+             120              148     1.23  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(2, 'sabre')
+         568±3ms          696±7ms     1.23  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'stochastic', 'sabre')
+         1.33±0s       1.56±0.02s     1.17  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'noise_adaptive')
+             756              879     1.16  queko.QUEKOTranspilerBench.track_depth_bntf_optimal_depth_25(0, 'sabre')
+      1.98±0.03s       2.27±0.01s     1.15  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'stochastic', 'sabre')

and then to level set as this is basically the end of a 3 PR series (well second to last if you count #8552) I also compared this PR to 0.21.1 with the same benchmarks:

Benchmarks that have improved:

       before           after         ratio
     [01a7aa6f]       [31df7962]
     <0.21.1>       <parallel-rusty-sabres>
-             162              147     0.91  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(3, 'sabre')
-         317±1ms          288±1ms     0.91  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'sabre', 'noise_adaptive')
-              73               66     0.90  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'noise_adaptive')
-         358±3ms        322±0.6ms     0.90  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'stochastic', 'sabre')
-         967±6ms          869±6ms     0.90  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'stochastic', 'sabre')
-             289              258     0.89  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'sabre', 'noise_adaptive')
-         744±1ms          663±2ms     0.89  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'stochastic', 'sabre')
-       326±0.4ms        290±0.8ms     0.89  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'sabre', 'dense')
-            1627             1432     0.88  transpiler_levels.TranspilerLevelBenchmarks.track_depth_quantum_volume_transpile_50_x_20(3)
-              58               51     0.88  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'noise_adaptive')
-             546              480     0.88  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(2, 'sabre')
-              67               58     0.87  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'sabre')
-      2.05±0.03s       1.77±0.01s     0.86  transpiler_levels.TranspilerLevelBenchmarks.time_transpile_qv_14_x_14(3)
-             804              687     0.85  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(0, 'sabre')
-       331±0.5ms        281±0.4ms     0.85  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'sabre', 'sabre')
-         814±2ms          685±2ms     0.84  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'sabre')
-         514±5ms          432±3ms     0.84  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'stochastic', 'sabre')
-      2.01±0.01s       1.69±0.01s     0.84  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'stochastic', 'sabre')
-         419±3ms          348±2ms     0.83  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'stochastic', 'sabre')
-        58.2±1ms       48.3±0.2ms     0.83  queko.QUEKOTranspilerBench.time_transpile_bigd(1, 'sabre')
-         512±5ms        421±0.8ms     0.82  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'stochastic', 'sabre')
-         478±3ms          390±2ms     0.82  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'stochastic', 'sabre')
-         437±2ms        354±0.5ms     0.81  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'sabre')
-         392±3ms          311±4ms     0.79  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'sabre', 'sabre')
-      27.6±0.1ms       21.7±0.1ms     0.78  queko.QUEKOTranspilerBench.time_transpile_bigd(0, 'sabre')
-         648±6ms          507±6ms     0.78  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'stochastic', 'sabre')
-         644±4ms         499±10ms     0.77  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'stochastic', 'sabre')
-         1.53±0s          1.19±0s     0.77  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'sabre')
-             360              277     0.77  queko.QUEKOTranspilerBench.track_depth_bntf_optimal_depth_25(3, 'sabre')
-         333±2ms          256±2ms     0.77  queko.QUEKOTranspilerBench.time_transpile_bigd(3, 'sabre')
-         419±1ms          319±2ms     0.76  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'sabre', 'sabre')
-         695±1ms          529±2ms     0.76  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'sabre')
-            1239              879     0.71  queko.QUEKOTranspilerBench.track_depth_bntf_optimal_depth_25(0, 'sabre')
-             523              371     0.71  queko.QUEKOTranspilerBench.track_depth_bntf_optimal_depth_25(2, 'sabre')
-       445±0.8ms          311±1ms     0.70  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(0, 'sabre', 'sabre')
-         548±2ms        375±0.4ms     0.68  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'sabre', 'sabre')
-         633±5ms          429±3ms     0.68  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'sabre')
-       575±0.7ms          375±1ms     0.65  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'sabre', 'sabre')
-         7.94±0s       4.74±0.03s     0.60  queko.QUEKOTranspilerBench.time_transpile_bss(3, 'sabre')
-      20.0±0.04s          8.41±0s     0.42  transpiler_levels.TranspilerLevelBenchmarks.time_quantum_volume_transpile_50_x_20(3)
-         2.02±0s          731±6ms     0.36  queko.QUEKOTranspilerBench.time_transpile_bntf(2, 'sabre')
-         1.70±0s          571±4ms     0.34  queko.QUEKOTranspilerBench.time_transpile_bntf(1, 'sabre')
-         2.82±0s          929±7ms     0.33  queko.QUEKOTranspilerBench.time_transpile_bss(2, 'sabre')
-      1.74±0.01s          517±3ms     0.30  mapping_passes.PassBenchmarks.time_sabre_layout(5, 1024)
-         2.64±0s          746±3ms     0.28  queko.QUEKOTranspilerBench.time_transpile_bss(1, 'sabre')
-         247±1ms       56.9±0.3ms     0.23  mapping_passes.PassBenchmarks.time_sabre_swap(5, 1024)
-         997±1ms          217±2ms     0.22  queko.QUEKOTranspilerBench.time_transpile_bntf(0, 'sabre')
-      1.73±0.01s          347±1ms     0.20  queko.QUEKOTranspilerBench.time_transpile_bss(0, 'sabre')
-      11.4±0.01s       1.53±0.01s     0.13  mapping_passes.PassBenchmarks.time_sabre_layout(14, 1024)
-         3.27±0s          394±3ms     0.12  mapping_passes.PassBenchmarks.time_sabre_swap(20, 1024)
-         22.4±0s       2.51±0.01s     0.11  mapping_passes.PassBenchmarks.time_sabre_layout(20, 1024)
-         1.54±0s       16.6±0.3ms     0.01  mapping_passes.PassBenchmarks.time_sabre_swap(14, 1024)

Benchmarks that have stayed the same:

       before           after         ratio
     [01a7aa6f]       [31df7962]
     <0.21.1>       <parallel-rusty-sabres>
              141              155     1.10  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(1, 'sabre')
              470              512     1.09  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(3, 'sabre')
          273±2ms          286±3ms     1.05  transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm(3)
          285±2ms          298±3ms     1.04  transpiler_levels.TranspilerLevelBenchmarks.time_transpile_from_large_qasm_backend_with_prop(3)
              274              281     1.03  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'sabre', 'sabre')
              218              223     1.02  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'sabre', 'sabre')
              627              639     1.02  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'sabre', 'sabre')
               54               55     1.02  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'dense')
               54               55     1.02  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'dense')
               57               58     1.02  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'noise_adaptive')
              305              309     1.01  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'stochastic', 'sabre')
              212              214     1.01  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'sabre', 'sabre')
              219              221     1.01  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'sabre', 'noise_adaptive')
       1.53±0.01s       1.54±0.01s     1.01  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'noise_adaptive')
          1.08±0s          1.09±0s     1.01  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'noise_adaptive')
              651              652     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'sabre', 'dense')
                7                7     1.00  transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm(3)
                7                7     1.00  transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_from_large_qasm_backend_with_prop(3)
              225              225     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'sabre', 'noise_adaptive')
              210              210     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'stochastic', 'sabre')
               56               56     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'stochastic', 'sabre')
               51               51     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'sabre')
               51               51     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'sabre')
              524              524     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'sabre', 'dense')
              542              542     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'sabre', 'noise_adaptive')
              514              514     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'sabre', 'dense')
              524              524     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'sabre', 'noise_adaptive')
              686              683     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'sabre', 'noise_adaptive')
              226              224     0.99  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'sabre', 'dense')
              220              218     0.99  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(2, 'sabre', 'dense')
          1.53±0s       1.51±0.02s     0.99  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'dense')
          697±3ms          688±3ms     0.99  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'dense')
              508              501     0.99  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'sabre', 'sabre')
              529              521     0.98  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(1, 'stochastic', 'sabre')
              510              502     0.98  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'stochastic', 'sabre')
              572              563     0.98  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'stochastic', 'sabre')
              245              241     0.98  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'sabre', 'sabre')
              642              630     0.98  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'sabre', 'sabre')
          718±4ms          701±3ms     0.98  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(3, 'sabre', 'noise_adaptive')
              285              278     0.98  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'sabre', 'dense')
              774              754     0.97  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(3, 'stochastic', 'sabre')
          715±4ms          696±5ms     0.97  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'stochastic', 'sabre')
               66               64     0.97  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'dense')
               65               63     0.97  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'sabre')
        345±0.4ms          335±2ms     0.97  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'dense')
              488              471     0.97  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(2, 'sabre', 'sabre')
          283±2ms          272±2ms     0.96  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'sabre', 'noise_adaptive')
              283              272     0.96  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(3, 'sabre', 'noise_adaptive')
              650              624     0.96  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'sabre', 'dense')
          285±3ms        272±0.6ms     0.96  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(0, 'sabre', 'dense')
          348±2ms          333±1ms     0.96  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(2, 'sabre', 'noise_adaptive')
       1.67±0.01s       1.59±0.01s     0.95  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(3, 'sabre', 'sabre')
        305±0.6ms        290±0.6ms     0.95  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'sabre', 'noise_adaptive')
              523              497     0.95  queko.QUEKOTranspilerBench.track_depth_bntf_optimal_depth_25(1, 'sabre')
        304±0.6ms          289±1ms     0.95  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_179(1, 'sabre', 'dense')
              222              210     0.95  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(1, 'stochastic', 'sabre')
       2.43±0.01s       2.28±0.01s     0.94  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'stochastic', 'sabre')
          451±3ms          422±2ms     0.94  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'noise_adaptive')
       1.35±0.01s          1.26±0s     0.93  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(3, 'sabre', 'dense')
        358±0.8ms          334±1ms     0.93  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'sabre', 'noise_adaptive')
        414±0.9ms          383±1ms     0.93  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'noise_adaptive')
              646              597     0.92  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_mod8_10_178(0, 'sabre', 'noise_adaptive')
        363±0.6ms          335±1ms     0.92  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'sabre', 'noise_adaptive')
          422±3ms          389±1ms     0.92  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(2, 'sabre', 'dense')
        320±0.5ms          294±1ms     0.92  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'sabre', 'dense')
              427              391     0.92  transpiler_levels.TranspilerLevelBenchmarks.track_depth_transpile_qv_14_x_14(3)
        323±0.4ms          295±1ms     0.91  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(0, 'sabre', 'noise_adaptive')
       73.8±0.9ms       67.5±0.9ms     0.91  queko.QUEKOTranspilerBench.time_transpile_bigd(2, 'sabre')
               58               53     0.91  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'noise_adaptive')
          361±2ms          330±2ms     0.91  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(1, 'sabre', 'dense')
               69               63     0.91  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'dense')
              294              268     0.91  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'sabre', 'dense')
              248              226     0.91  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4gt10_v1_81(0, 'stochastic', 'sabre')
        364±0.6ms          332±1ms     0.91  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_qft_16(1, 'sabre', 'dense')
       4.97±0.03s       2.94±0.01s    ~0.59  queko.QUEKOTranspilerBench.time_transpile_bntf(3, 'sabre')

Benchmarks that have got worse:

       before           after         ratio
     [01a7aa6f]       [31df7962]
     <0.21.1>       <parallel-rusty-sabres>
+              45               56     1.24  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'stochastic', 'sabre')
+             231              281     1.22  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(0, 'sabre')
+              64               77     1.20  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'stochastic', 'sabre')
+              48               56     1.17  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'stochastic', 'sabre')
+             526              604     1.15  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(1, 'sabre')
+             134              148     1.10  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(2, 'sabre')
+         439±2ms          483±2ms     1.10  transpiler_qualitative.TranspilerQualitativeBench.time_transpile_time_cnt3_5_180(2, 'sabre', 'dense')

It's hard to do exact comparisons because of the nature of the algorithm and differences in RNG used between 0.21.1 and this PR but in general I think it's fair to say overall quality and performance of sabre improve significantly with this patch series.

@mtreinish
Copy link
Member Author

I also tested bumping all the trial counts to 250 just to confirm it doesn't get worse (which is something we saw with stochastic swap, but the trials there are per layer so it's not an exact analogy, see #4094 for more details). Running just the depth benchmarks and concentrating on those that showed regressions against 0.21.1 with 250 trials yielded:

Benchmarks that have improved:

       before           after         ratio
     [31df7962]       [0869ee00]
     <parallel-rusty-sabres>       <test-more-trials>
-              63               56     0.89  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'sabre')
-             480              411     0.86  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(2, 'sabre')
-             512              398     0.78  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(3, 'sabre')
-             687              498     0.72  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(0, 'sabre')
-             281              189     0.67  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(0, 'sabre')
-             604              390     0.65  queko.QUEKOTranspilerBench.track_depth_bss_optimal_depth_100(1, 'sabre')

Benchmarks that have stayed the same:

       before           after         ratio
     [31df7962]       [0869ee00]
     <parallel-rusty-sabres>       <test-more-trials>
              155              155     1.00  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(1, 'sabre')
              148              148     1.00  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(2, 'sabre')
              147              147     1.00  queko.QUEKOTranspilerBench.track_depth_bigd_optimal_depth_45(3, 'sabre')
               56               56     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'stochastic', 'sabre')
               55               55     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'dense')
               53               53     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'noise_adaptive')
               51               51     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'sabre', 'sabre')
               56               56     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(1, 'stochastic', 'sabre')
               55               55     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'dense')
               51               51     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'noise_adaptive')
               51               51     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'sabre', 'sabre')
               56               56     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(2, 'stochastic', 'sabre')
               64               64     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'dense')
               66               66     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'noise_adaptive')
               58               58     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'sabre', 'sabre')
               77               77     1.00  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(3, 'stochastic', 'sabre')
               63               62     0.98  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'dense')
               58               54     0.93  transpiler_qualitative.TranspilerQualitativeBench.track_depth_transpile_4mod5_v0_19(0, 'sabre', 'noise_adaptive')

which matches my expectations and the output quality in the worst case is the same as without this PR, but in some cases the total depth decreases

The SabreSwap algorithm's output is quite linked to the random seed used
to run the algorithm. Typically to get the best result a user will run
the pass (or the full transpilation) multiple times with different seeds
and pick the best output to get a better result. Since Qiskit#8388 the
SabreSwap pass has moved mostly the domain of Rust. This enables us to
leverage multithreading easily to run parallel sabre over multiple seeds
and pick the best result. This commit adds a new argument trials to the
SabreSwap pass which is used to specify the number of random seed trials
to run sabre with. Each trial will perform a complete run of the sabre
algorithm and compute the swaps necessary for the algorithm. Then the
result with the least number of swaps will be selected and used as the
swap mapping for the pass.
The parallel trial code was potentially non-deterministic in it's
execution because the way the parallel trials were compared was
dependent on execution speed of the pass. This could result in a
different output if results with equal number of swaps finished
executing in differing amounts of time between runs. This commit
addresses this by first collecting the results into an ordered Vec
first which is then iterated over serially to find the minimum swap
count. This will make the output independent of execution speed of the
individual trials.
This commit updates tests which started to fail because of the different
RNG behavior used by the parallel SabreSwap seed trials. For the most
part these are just mechanical changes that either changed the expected
layout with a fixed seed or updating a fixed seed so the output matches
the expected result. The one interesting case was the
TestTranspileLevelsSwap change which was caused by different swaps being
inserted that for optimization level enabled the 2q block optimization
passes to synthesize away the swap as part of its optimization. This was
fixed by changing the seed, but it was a different case than the other
failures.
This commit adds a swap_trials argument to the SabreLayout pass so that
users can control how many trials to run in SabreSwap internally. This
is necessary for reproducibility between systems for the same reason
it's required on SabreSwap.
@mtreinish mtreinish force-pushed the parallel-rusty-sabres branch from 31df796 to 7b60168 Compare August 22, 2022 15:48
@mtreinish mtreinish removed the on hold Can not fix yet label Aug 22, 2022
@mtreinish mtreinish changed the title [WIP] Enable multiple parallel seed trials for SabreSwap Enable multiple parallel seed trials for SabreSwap Aug 22, 2022
@mtreinish
Copy link
Member Author

I ran the benchmarks that I've been using for #8552 with this PR applied and compared them against sabre on main and against DenseLayout + StochasticSwap`. For the comparison of this vs sabre on main (the axis say with #8388 which has already merged, I was using data from a prior run before it merged):

mapping_bench_kdk_view_depth_parallel_vs_serial
mapping_bench_kdk_view_parallel_vs_serial
mapping_bench_kdk_view_parallel_vs_serial_time

For the most part this is a minimal overhead in runtime for this PR but generally better cx counts and depth is a mixed bag. I think the outliers where this gets worse is because of different RNG behavior between this PR and main.

For comparing this against DenseLayout and Stochastic Swap:

mapping_bench_kdk_view_depth_parallel
mapping_bench_kdk_view_parallel
mapping_bench_kdk_view_time_parallel

it doesn't look drastically different than what I ran for it serially in: #8552 (comment)

For the parallel runs it was using: SabreLayout(coupling_map, routing_pass=_swap[0], max_iterations=5, seed=seed_transpiler) and SabreSwap(coupling_map, heuristic="decay", seed=seed_transpiler, trials=20) on a CPU with 8 cores and 16 threads. (so SabreLayout will be running SabreSwap internally with 8 trials)

@kdk
Copy link
Member

kdk commented Aug 30, 2022

Is there any intuition for why the above doesn't show an improvement in quality when running across multiple trials? In some prior discussions, it looked like for for QV circuits and the benchmarks we have in asv there was considerable improvement, so is this limited to red-queen's mapping/ benchmarks?

@mtreinish
Copy link
Member Author

My guess is that it's probably a function of the specific circuits in the red queen misc mapping benchmarks. I'm not really too familiar with the circuits there as @boschmitt added them all at once. But I could see there being diminishing return for multiple seed trials if sabre only finds a few (or one) best swap candidate per layer the RNG seed won't have much influence on the result because there isn't much to randomly choose from.

src/sabre_swap/mod.rs Outdated Show resolved Hide resolved
src/sabre_swap/mod.rs Outdated Show resolved Hide resolved
mtreinish and others added 2 commits September 1, 2022 15:14
In an earlier commit we switched the parallel iterator to collect into
an intermediate `Vec` to ensure the output result was deterministic. The
min_by_key() will have a degree of non-determinism for equal entries as
the parallel iterator's threads finish. However, collecting to a Vec
isn't necessary as we can use the index as an element in a 2 element
array we can get the deterministic evaluation and avoid the overhead of
collecting into a `Vec`. This commit makes this change to improve the
performance of the parallel execution path.

Co-authored-by: Kevin Hartman <[email protected]>
Copy link
Contributor

@kevinhartman kevinhartman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mergify mergify bot merged commit 2ab0ae5 into Qiskit:main Sep 1, 2022
@mtreinish mtreinish deleted the parallel-rusty-sabres branch September 2, 2022 22:33
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Nov 9, 2022
This commit modifies the SabreLayout pass when run without the
routing_pass argument to run primarily in Rust. This builds on top of
the rust version of SabreSwap previously added in Qiskit#7977, Qiskit#8388,
and Qiskit#8572. Internally, when the routing_pass argument is not set
SabreLayout will perform the full sabre algorithm both layout selection
and final swap mapping in rust and return the selected initial layout,
the final layout, the toplogical sorting used to traverse the circuit,
and a SwapMap for any swaps inserted. This is then used to build the
output circuit in place of running separate layout and routing passes.
The preset pass managers are updated to handle the new combined layout
and routing mode of operation for SabreLayout. The routing stage to the
preset pass managers remains intact, it will just operate as if a
perfect layout was selected and skip SabreSwap because the circuit is
already matching the connectivity constraints.

Besides just operating more quickly because the heavy lifting of the
algorithm operates more efficiently in a compiled language, doing this
in rust also lets change our parallelization model for running multiple
seed in Sabre. Just as in Qiskit#8572 we added support for SabreSwap to run
multiple parallel trials with different seeds this commit adds a
layout_trials argument to SabreLayout to try multiple seeds in parallel.
When this is used it parallelizes at the outer layer for each
layout/routing combination and the total minimal swap count seed is used.
So for example if you set swap_trials=5 and layout_trails=5 that will run
5 tasks in the threadpool with 5 different seeds for the outer layout run.
Inside that every time sabre swap is run (which will be multiple times
as part of layout plus the final routing run) it tries 5 different seeds
for each execution serially inside that parallel task. This should
hopefully further improve the quality of the transpiler output and better
match expectations for users who were previously calling transpile()
multiple times to emulate this behavior.

Implements Qiskit#9090
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Nov 10, 2022
This commit modifies the SabreLayout pass when run without the
routing_pass argument to run primarily in Rust. This builds on top of
the rust version of SabreSwap previously added in Qiskit#7977, Qiskit#8388,
and Qiskit#8572. Internally, when the routing_pass argument is not set
SabreLayout will perform the full sabre algorithm both layout selection
and final swap mapping in rust and return the selected initial layout,
the final layout, the toplogical sorting used to traverse the circuit,
and a SwapMap for any swaps inserted. This is then used to build the
output circuit in place of running separate layout and routing passes.
The preset pass managers are updated to handle the new combined layout
and routing mode of operation for SabreLayout. The routing stage to the
preset pass managers remains intact, it will just operate as if a
perfect layout was selected and skip SabreSwap because the circuit is
already matching the connectivity constraints.

Besides just operating more quickly because the heavy lifting of the
algorithm operates more efficiently in a compiled language, doing this
in rust also lets change our parallelization model for running multiple
seed in Sabre. Just as in Qiskit#8572 we added support for SabreSwap to run
multiple parallel trials with different seeds this commit adds a
layout_trials argument to SabreLayout to try multiple seeds in parallel.
When this is used it parallelizes at the outer layer for each
layout/routing combination and the total minimal swap count seed is used.
So for example if you set swap_trials=5 and layout_trails=5 that will run
5 tasks in the threadpool with 5 different seeds for the outer layout run.
Inside that every time sabre swap is run (which will be multiple times
as part of layout plus the final routing run) it tries 5 different seeds
for each execution serially inside that parallel task. This should
hopefully further improve the quality of the transpiler output and better
match expectations for users who were previously calling transpile()
multiple times to emulate this behavior.

Implements Qiskit#9090
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Nov 10, 2022
This commit modifies the SabreLayout pass when run without the
routing_pass argument to run primarily in Rust. This builds on top of
the rust version of SabreSwap previously added in Qiskit#7977, Qiskit#8388,
and Qiskit#8572. Internally, when the routing_pass argument is not set
SabreLayout will perform the full sabre algorithm both layout selection
and final swap mapping in rust and return the selected initial layout,
the final layout, the toplogical sorting used to traverse the circuit,
and a SwapMap for any swaps inserted. This is then used to build the
output circuit in place of running separate layout and routing passes.
The preset pass managers are updated to handle the new combined layout
and routing mode of operation for SabreLayout. The routing stage to the
preset pass managers remains intact, it will just operate as if a
perfect layout was selected and skip SabreSwap because the circuit is
already matching the connectivity constraints.

Besides just operating more quickly because the heavy lifting of the
algorithm operates more efficiently in a compiled language, doing this
in rust also lets change our parallelization model for running multiple
seed in Sabre. Just as in Qiskit#8572 we added support for SabreSwap to run
multiple parallel trials with different seeds this commit adds a
layout_trials argument to SabreLayout to try multiple seeds in parallel.
When this is used it parallelizes at the outer layer for each
layout/routing combination and the total minimal swap count seed is used.
So for example if you set swap_trials=5 and layout_trails=5 that will run
5 tasks in the threadpool with 5 different seeds for the outer layout run.
Inside that every time sabre swap is run (which will be multiple times
as part of layout plus the final routing run) it tries 5 different seeds
for each execution serially inside that parallel task. This should
hopefully further improve the quality of the transpiler output and better
match expectations for users who were previously calling transpile()
multiple times to emulate this behavior.

Implements Qiskit#9090
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request Nov 10, 2022
This commit modifies the SabreLayout pass when run without the
routing_pass argument to run primarily in Rust. This builds on top of
the rust version of SabreSwap previously added in Qiskit#7977, Qiskit#8388,
and Qiskit#8572. Internally, when the routing_pass argument is not set
SabreLayout will perform the full sabre algorithm both layout selection
and final swap mapping in rust and return the selected initial layout,
the final layout, the toplogical sorting used to traverse the circuit,
and a SwapMap for any swaps inserted. This is then used to build the
output circuit in place of running separate layout and routing passes.
The preset pass managers are updated to handle the new combined layout
and routing mode of operation for SabreLayout. The routing stage to the
preset pass managers remains intact, it will just operate as if a
perfect layout was selected and skip SabreSwap because the circuit is
already matching the connectivity constraints.

Besides just operating more quickly because the heavy lifting of the
algorithm operates more efficiently in a compiled language, doing this
in rust also lets change our parallelization model for running multiple
seed in Sabre. Just as in Qiskit#8572 we added support for SabreSwap to run
multiple parallel trials with different seeds this commit adds a
layout_trials argument to SabreLayout to try multiple seeds in parallel.
When this is used it parallelizes at the outer layer for each
layout/routing combination and the total minimal swap count seed is used.
So for example if you set swap_trials=5 and layout_trails=5 that will run
5 tasks in the threadpool with 5 different seeds for the outer layout run.
Inside that every time sabre swap is run (which will be multiple times
as part of layout plus the final routing run) it tries 5 different seeds
for each execution serially inside that parallel task. This should
hopefully further improve the quality of the transpiler output and better
match expectations for users who were previously calling transpile()
multiple times to emulate this behavior.

Implements Qiskit#9090
@mtreinish mtreinish mentioned this pull request Nov 10, 2022
4 tasks
mergify bot added a commit that referenced this pull request Dec 8, 2022
* Oxidize SabreLayout pass

This commit modifies the SabreLayout pass when run without the
routing_pass argument to run primarily in Rust. This builds on top of
the rust version of SabreSwap previously added in #7977, #8388,
and #8572. Internally, when the routing_pass argument is not set
SabreLayout will perform the full sabre algorithm both layout selection
and final swap mapping in rust and return the selected initial layout,
the final layout, the toplogical sorting used to traverse the circuit,
and a SwapMap for any swaps inserted. This is then used to build the
output circuit in place of running separate layout and routing passes.
The preset pass managers are updated to handle the new combined layout
and routing mode of operation for SabreLayout. The routing stage to the
preset pass managers remains intact, it will just operate as if a
perfect layout was selected and skip SabreSwap because the circuit is
already matching the connectivity constraints.

Besides just operating more quickly because the heavy lifting of the
algorithm operates more efficiently in a compiled language, doing this
in rust also lets change our parallelization model for running multiple
seed in Sabre. Just as in #8572 we added support for SabreSwap to run
multiple parallel trials with different seeds this commit adds a
layout_trials argument to SabreLayout to try multiple seeds in parallel.
When this is used it parallelizes at the outer layer for each
layout/routing combination and the total minimal swap count seed is used.
So for example if you set swap_trials=5 and layout_trails=5 that will run
5 tasks in the threadpool with 5 different seeds for the outer layout run.
Inside that every time sabre swap is run (which will be multiple times
as part of layout plus the final routing run) it tries 5 different seeds
for each execution serially inside that parallel task. This should
hopefully further improve the quality of the transpiler output and better
match expectations for users who were previously calling transpile()
multiple times to emulate this behavior.

Implements #9090

* Use deepcopy for coupling map copy

Previously this PR was using copy() to copy the coupling map before we
mutated it to be symmetric (a requirement for the sabre algorithm).
However, this modification of the object was leaking out causing test
failures. This commit switches it to a deepcopy to ensure there are no
shared references (and a comment added to explain it's needed).

* Fix failing unitary synthesis tests

This PR branch modifies the default behavior of the SabreLayout pass so
it is now a transformation pass that computes a layout, applies it, and
then performs routing. This means when using sabre layout in a custom
pass manager we no longer need to embed a layout after computing the
layout. The failing unitary synthesis tests were using a custom pass
manager and trying to apply the layout again after SabreLayout already
did. This commit just removes this now unecessary steps from the test
code.

* Add release note

* Run BarrierBeforeMeasurement before new SabreLayout

Now that the routing stage is integrated into the SabreLayout pass we
should be running the BarrierBeforeMeasurement pass prior to layout in
the preset pass managers instead of before routing. The goal of the pass
is to prevent the routing algorithms for accidentally reusing a qubit
after a final measurement which would be invalid by inserting a barrier
before the measurements to ensure all qubits are swap mapped prior to
adding the measurements during routing. While this might not strictly be
necessary (it didn't affect any test output) it feels like best practice
to ensure we're doing this prior to potentially routing to prevent
issues.

* Improve docstrings

* Set a fixed number of layout trials in preset pass managers

For reproducible results with a fixed seed this commit sets a fixed
number of layout_trials for the SabreLayout pass in the preset pass
managers. If we did not set a fixed value than the output of the
transpiler with a fixed seed will vary based on the number of
physical cores that is running the compilation. To start
optimization levels 0 and 1 use 5, level 2 uses 10, and level
3 uses 20 which matches the swap_trials argument we used. This is just a
starting point, we can adjust these values later if needed.

* Update tests for layout changes

This commit updates the tests which are checking exact layouts with a
fixed seed when running SabreLayout. The changes to SabreLayout breaks
exact seed reproducibility from the earlier version of the pass. So we
need to update these tests for their new layout assignment from the
improved pass. One exception is a test which was trying to assert that
transpile() preserves a swap if it's in the basis set. However, the new
layout and routing output from SabreLayout for that test was resulting
in all the swaps getting optimized away at optimization level 3
(resulting in 13 cx gates instead of ~4 cx gates and 5 swaps before,
which would be more efficient on real hardware). So the test was removed
and only run at lower optimziation levels.

* Set a fixed number of layout trials in SabreLayout tests

The dedicated tests for SabreLayout were not running a fixed number of
trials. This was causing a different layout to be returned in tests when
run across multiple systems as the number of trials defaults to the
number of physical CPUs. This commit fixes the trial count to the number
of cores on the local system where the layout was updated. This should
fix the non-determinism in the tests causing failures in CI and on
different local systems.

* Run SabreSwap in parallel if only a single layout trial

If there is only a single layout trial being run we don't have to worry
about trying to do too much work in parallel at once by parallelizing
the inner sabre swap execution. This commit updates the threading logic
to enable running the inner sabre swap trials in parallel if there is
only a single layout trial.

* Remove duplicated SabreDAG creation

* Correctly apply selected layout on dag nodes

This commit corrects a bug in the PR branch that was caused by applying
the selected initial layout in a trial to the swapped order node list.
This was causing unexpected results when applying the circuit because
the intent was to apply it only to the original input not the reversed
input.

* Remove unnecessary clone from serial layout trials

In the case we're evaluating the layout trials serially instead of in a
parallel iterator we don't need to clone the dag nodes list. This is
because nothing will be modifying it in parallel, so we don't need a
thread local copy. Each call to layout_trial() will keep the dag nodes
vector intact (see previous commit for fixing this) so it can just be
passed by reference if there are no parallel threads involved.

* Fix seed setup when no user seed specified

This commit fixes an issue prevent seed randomization when no seed is
specified. On subsequent uses of a pass SabreLayout would not randomize
the seed between runs because it was setting the seed to instance state.
This commit fixes this issue by relying on initializing the RNG from
entropy each time run() is called if no user specified seed is provided.

* Start from trivial layout for routing stage

This commit fixes the routing run to run from a trivial layout instead
of the initial layout. By the time we do final routing for a trial we've
already applied the selected initial layout to the SabreDAG. So the
correct layout to use for running final swap mapping is a trivial layout
where logical bit 0 is at physical bit 0. Using initial layout twice
means we end up mapping more than is needed resulting in incorrect
results.

* Revert "Correctly apply selected layout on dag nodes"

This change was incorrect, the output was already in the correct order
and this was causing the behavior it strived to fix. This commit reverts
the addition of the extra mem::swap() call to fix things.

This reverts commit d98ef6c.

* Deduplicate NLayout trivial layout creation

This commit deduplicates the trivial layout generation for the NLayout
class. Previously there were a few places both in rust and python that
sabre layout was manually generating a trivial NLayout object. THis
commit adds a static method to the NLayout class that allows both Python
and Rust to easily create a new trivial NLayout object instead of
manually creating the object.

* Fix fixed layout tests after updates

Since more recent commits fixed a few bugs in the behavior of the
SabreLayout pass, the previously updated fixed layout tests were no
longer correct. This commit updates the tests which were now failing
because the layout changed again after fixing bugs in the new pass code.

* Try nesting parallelism in the sabres

Looking at profiles for running the new SabreLayout pass, as expected
the runtime of the rust SabreSwap routines is dominating. This is
because we've basically serialized the sabre swap routines and are
running multiple seed trials. As an experiment this commit sets the
inner SabreSwap routines to run in parallel too. Since the rayon
algorithm uses a work stealing algorithm this hopefully shouldn't cause
too much extra overhead, especially because the layout trials are quite
fast. This ideally means we're just scheduling each sabre swap trial in
a big parallel work queue and rayon does the rest of the magic to figure
out how to execute things. Initial testing is showing an improvement for
large circuits and a more modest improvement for more modest circuits.

* Add skip_routing argument to preserve custom user provided routing

This commit adds a new argument, skip_routing, to the SabreLayout
constructor. The intent of this new option is to enable mixing custom
routing_method user arguments with SabreLayout in it's new accelerated
mode of operation. In the earlier commits no matter what users specified
the preset pass manager construction would use sabreswap for routing as
it was run internally as part of layout. This meant doing something
like:

transpile(qc, backend, routing_method='stochastic')

would really run SabreSwap which is clearly not the user intent. To
provide the layout benefits with multiple seed trials this new argument
allows disabling the application of the routing found. This comes with a
runtime penalty because effectively we end up running routing twice and
only using one of the results. But for custom user provided methods or
plugins this seems like a reasonable tradeoff.

* Fix typo in docstring

* Update random seed usage in rust code

In #9132 we updated the random seed parameters in the rust code for
sabre swap to make the seed optional and default to initializing from
entropy if it's not specified. This commit updates the usage to account
for this change on main.

* s/retworkx/rustworkx/g

* Add alternate constructor for NLayout from a logic_to_phys vec

This commit adds a new constructor method to the NLayout class that
builds an NLayout object from just a logic_to_phys Vec. This constructor
can be accessed from either rust or python (although it's not as
efficient from Python). This is used to simplify some of the SabreLayout
rust code that was doing this inline manually.

* Move layout embedding into a method

This commit moves the code the optimized SabreLayout pass was using to
embed the found layout from the Rust code into a method. This will make
it easier to refactor later if a more efficient pass manager path is
added.

* Simplify pass logic and update comments

This commit removes an unnecessary else branch in the SabreLayout.run()
code to make it slightly easier to read. At the same time some comments
are updated to better explain the logic of the code.

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Cryoris pushed a commit to Cryoris/qiskit-terra that referenced this pull request Jan 12, 2023
* Oxidize SabreLayout pass

This commit modifies the SabreLayout pass when run without the
routing_pass argument to run primarily in Rust. This builds on top of
the rust version of SabreSwap previously added in Qiskit#7977, Qiskit#8388,
and Qiskit#8572. Internally, when the routing_pass argument is not set
SabreLayout will perform the full sabre algorithm both layout selection
and final swap mapping in rust and return the selected initial layout,
the final layout, the toplogical sorting used to traverse the circuit,
and a SwapMap for any swaps inserted. This is then used to build the
output circuit in place of running separate layout and routing passes.
The preset pass managers are updated to handle the new combined layout
and routing mode of operation for SabreLayout. The routing stage to the
preset pass managers remains intact, it will just operate as if a
perfect layout was selected and skip SabreSwap because the circuit is
already matching the connectivity constraints.

Besides just operating more quickly because the heavy lifting of the
algorithm operates more efficiently in a compiled language, doing this
in rust also lets change our parallelization model for running multiple
seed in Sabre. Just as in Qiskit#8572 we added support for SabreSwap to run
multiple parallel trials with different seeds this commit adds a
layout_trials argument to SabreLayout to try multiple seeds in parallel.
When this is used it parallelizes at the outer layer for each
layout/routing combination and the total minimal swap count seed is used.
So for example if you set swap_trials=5 and layout_trails=5 that will run
5 tasks in the threadpool with 5 different seeds for the outer layout run.
Inside that every time sabre swap is run (which will be multiple times
as part of layout plus the final routing run) it tries 5 different seeds
for each execution serially inside that parallel task. This should
hopefully further improve the quality of the transpiler output and better
match expectations for users who were previously calling transpile()
multiple times to emulate this behavior.

Implements Qiskit#9090

* Use deepcopy for coupling map copy

Previously this PR was using copy() to copy the coupling map before we
mutated it to be symmetric (a requirement for the sabre algorithm).
However, this modification of the object was leaking out causing test
failures. This commit switches it to a deepcopy to ensure there are no
shared references (and a comment added to explain it's needed).

* Fix failing unitary synthesis tests

This PR branch modifies the default behavior of the SabreLayout pass so
it is now a transformation pass that computes a layout, applies it, and
then performs routing. This means when using sabre layout in a custom
pass manager we no longer need to embed a layout after computing the
layout. The failing unitary synthesis tests were using a custom pass
manager and trying to apply the layout again after SabreLayout already
did. This commit just removes this now unecessary steps from the test
code.

* Add release note

* Run BarrierBeforeMeasurement before new SabreLayout

Now that the routing stage is integrated into the SabreLayout pass we
should be running the BarrierBeforeMeasurement pass prior to layout in
the preset pass managers instead of before routing. The goal of the pass
is to prevent the routing algorithms for accidentally reusing a qubit
after a final measurement which would be invalid by inserting a barrier
before the measurements to ensure all qubits are swap mapped prior to
adding the measurements during routing. While this might not strictly be
necessary (it didn't affect any test output) it feels like best practice
to ensure we're doing this prior to potentially routing to prevent
issues.

* Improve docstrings

* Set a fixed number of layout trials in preset pass managers

For reproducible results with a fixed seed this commit sets a fixed
number of layout_trials for the SabreLayout pass in the preset pass
managers. If we did not set a fixed value than the output of the
transpiler with a fixed seed will vary based on the number of
physical cores that is running the compilation. To start
optimization levels 0 and 1 use 5, level 2 uses 10, and level
3 uses 20 which matches the swap_trials argument we used. This is just a
starting point, we can adjust these values later if needed.

* Update tests for layout changes

This commit updates the tests which are checking exact layouts with a
fixed seed when running SabreLayout. The changes to SabreLayout breaks
exact seed reproducibility from the earlier version of the pass. So we
need to update these tests for their new layout assignment from the
improved pass. One exception is a test which was trying to assert that
transpile() preserves a swap if it's in the basis set. However, the new
layout and routing output from SabreLayout for that test was resulting
in all the swaps getting optimized away at optimization level 3
(resulting in 13 cx gates instead of ~4 cx gates and 5 swaps before,
which would be more efficient on real hardware). So the test was removed
and only run at lower optimziation levels.

* Set a fixed number of layout trials in SabreLayout tests

The dedicated tests for SabreLayout were not running a fixed number of
trials. This was causing a different layout to be returned in tests when
run across multiple systems as the number of trials defaults to the
number of physical CPUs. This commit fixes the trial count to the number
of cores on the local system where the layout was updated. This should
fix the non-determinism in the tests causing failures in CI and on
different local systems.

* Run SabreSwap in parallel if only a single layout trial

If there is only a single layout trial being run we don't have to worry
about trying to do too much work in parallel at once by parallelizing
the inner sabre swap execution. This commit updates the threading logic
to enable running the inner sabre swap trials in parallel if there is
only a single layout trial.

* Remove duplicated SabreDAG creation

* Correctly apply selected layout on dag nodes

This commit corrects a bug in the PR branch that was caused by applying
the selected initial layout in a trial to the swapped order node list.
This was causing unexpected results when applying the circuit because
the intent was to apply it only to the original input not the reversed
input.

* Remove unnecessary clone from serial layout trials

In the case we're evaluating the layout trials serially instead of in a
parallel iterator we don't need to clone the dag nodes list. This is
because nothing will be modifying it in parallel, so we don't need a
thread local copy. Each call to layout_trial() will keep the dag nodes
vector intact (see previous commit for fixing this) so it can just be
passed by reference if there are no parallel threads involved.

* Fix seed setup when no user seed specified

This commit fixes an issue prevent seed randomization when no seed is
specified. On subsequent uses of a pass SabreLayout would not randomize
the seed between runs because it was setting the seed to instance state.
This commit fixes this issue by relying on initializing the RNG from
entropy each time run() is called if no user specified seed is provided.

* Start from trivial layout for routing stage

This commit fixes the routing run to run from a trivial layout instead
of the initial layout. By the time we do final routing for a trial we've
already applied the selected initial layout to the SabreDAG. So the
correct layout to use for running final swap mapping is a trivial layout
where logical bit 0 is at physical bit 0. Using initial layout twice
means we end up mapping more than is needed resulting in incorrect
results.

* Revert "Correctly apply selected layout on dag nodes"

This change was incorrect, the output was already in the correct order
and this was causing the behavior it strived to fix. This commit reverts
the addition of the extra mem::swap() call to fix things.

This reverts commit d98ef6c.

* Deduplicate NLayout trivial layout creation

This commit deduplicates the trivial layout generation for the NLayout
class. Previously there were a few places both in rust and python that
sabre layout was manually generating a trivial NLayout object. THis
commit adds a static method to the NLayout class that allows both Python
and Rust to easily create a new trivial NLayout object instead of
manually creating the object.

* Fix fixed layout tests after updates

Since more recent commits fixed a few bugs in the behavior of the
SabreLayout pass, the previously updated fixed layout tests were no
longer correct. This commit updates the tests which were now failing
because the layout changed again after fixing bugs in the new pass code.

* Try nesting parallelism in the sabres

Looking at profiles for running the new SabreLayout pass, as expected
the runtime of the rust SabreSwap routines is dominating. This is
because we've basically serialized the sabre swap routines and are
running multiple seed trials. As an experiment this commit sets the
inner SabreSwap routines to run in parallel too. Since the rayon
algorithm uses a work stealing algorithm this hopefully shouldn't cause
too much extra overhead, especially because the layout trials are quite
fast. This ideally means we're just scheduling each sabre swap trial in
a big parallel work queue and rayon does the rest of the magic to figure
out how to execute things. Initial testing is showing an improvement for
large circuits and a more modest improvement for more modest circuits.

* Add skip_routing argument to preserve custom user provided routing

This commit adds a new argument, skip_routing, to the SabreLayout
constructor. The intent of this new option is to enable mixing custom
routing_method user arguments with SabreLayout in it's new accelerated
mode of operation. In the earlier commits no matter what users specified
the preset pass manager construction would use sabreswap for routing as
it was run internally as part of layout. This meant doing something
like:

transpile(qc, backend, routing_method='stochastic')

would really run SabreSwap which is clearly not the user intent. To
provide the layout benefits with multiple seed trials this new argument
allows disabling the application of the routing found. This comes with a
runtime penalty because effectively we end up running routing twice and
only using one of the results. But for custom user provided methods or
plugins this seems like a reasonable tradeoff.

* Fix typo in docstring

* Update random seed usage in rust code

In Qiskit#9132 we updated the random seed parameters in the rust code for
sabre swap to make the seed optional and default to initializing from
entropy if it's not specified. This commit updates the usage to account
for this change on main.

* s/retworkx/rustworkx/g

* Add alternate constructor for NLayout from a logic_to_phys vec

This commit adds a new constructor method to the NLayout class that
builds an NLayout object from just a logic_to_phys Vec. This constructor
can be accessed from either rust or python (although it's not as
efficient from Python). This is used to simplify some of the SabreLayout
rust code that was doing this inline manually.

* Move layout embedding into a method

This commit moves the code the optimized SabreLayout pass was using to
embed the found layout from the Rust code into a method. This will make
it easier to refactor later if a more efficient pass manager path is
added.

* Simplify pass logic and update comments

This commit removes an unnecessary else branch in the SabreLayout.run()
code to make it slightly easier to read. At the same time some comments
are updated to better explain the logic of the code.

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: New Feature Include in the "Added" section of the changelog mod: transpiler Issues and PRs related to Transpiler performance Rust This PR or issue is related to Rust code in the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants