Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize qubit order for better data locality #26

Open
totikom opened this issue Feb 13, 2024 · 4 comments
Open

Optimize qubit order for better data locality #26

totikom opened this issue Feb 13, 2024 · 4 comments
Labels
performance Poor performance or performance regression

Comments

@totikom
Copy link

totikom commented Feb 13, 2024

(This may be completely unnecessary/inapplicable, so feel free to just close it)

The performance of simulating gates on big states drastically depends on its position (i.e. it is much faster to apply H to 0th qubit of the 30-qubit state that to the 29th)

So, for "unbalanced" circuits it is very beneficial to move "the most used" qubits to lower indices.

@totikom totikom changed the title Optimize qubit order for batter data locality Optimize qubit order for better data locality Feb 13, 2024
@smu160 smu160 added the performance Poor performance or performance regression label Feb 13, 2024
@smu160
Copy link
Member

smu160 commented Feb 13, 2024

Hi @totikom,

Thank you for your insight! This would be a great initial start to a "circuit optimizer" that can be run prior to executing the actual circuit. If you're interested in implementing your suggestion, you are more than welcome to submit a PR.

For the sake of completeness and posterity, I will include some other findings/insights:

In my recent tests, I found that the single-threaded performance difference is negligible whether applying H to the 0th qubit, or to the 29th qubit; however, once you throw more threads at the problem, the story changes. Applying H to the 0th qubit is 2x faster than applying H to the 29th qubit. In fact, applying H to the 29th qubit with more than a single thread degrades performance. Hence, your suggestion may be a way to fix this issue in the case where we only apply a gate to a subset of the qubits.

Note that applying a gate to the kth qubit has the same exact memory access pattern as the kth stage in the decimation-in-time FFT. I've been looking into leveraging strategies from high performance FFT implementations in order to improve data locality. In principle, we can bring all pairs right next to each-other by looking at how the FFT interleaves data-points at each stage.

This may be a way to tackle the implementation of your suggestion.

Best,
Saveliy

@totikom
Copy link
Author

totikom commented Feb 13, 2024

Note that applying a gate to the kth qubit has the same exact memory access pattern as the kth stage in the decimation-in-time FFT. I've been looking into leveraging strategies from high performance FFT implementations in order to improve data locality. In principle, we can bring all pairs right next to each-other by looking at how the FFT interleaves data-points at each stage.

Yes! I've also though about it (I was doing a toy QC simulator for my curriculum two years ago), but I haven't found a data structure, which will "remain local" for all qubits, so I decided that it would be beneficial to reorder circuits.

Unfortunately, that course has ended and I was busy working on my Master's, so that project was abandoned.

I'll try to implement a naive optimizer for circuits and will make a PR.)

@totikom
Copy link
Author

totikom commented Feb 13, 2024

The optimizer can be based on a cost model, which will justify, whenever transformations are beneficial. I'm going to use LLVM VPlan as an example.

(Well, it is going to be not so naive optimizer)

@smu160
Copy link
Member

smu160 commented Feb 13, 2024

@totikom Thank you!

I found a great resource that discusses the same data locality issue, but in the context of FFT. Please see chapter 7 in Construction of a High-Performance FFT.

Best,
Saveliy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Poor performance or performance regression
Projects
None yet
Development

No branches or pull requests

2 participants