-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for BLAS SVD functions in MPS simulation #1897
Conversation
Could you fix the errors so that all checks will be passed |
With the last commit I think I have solved the Windows-build problem. However, for macOS the error is: EDIT: No, I didn't fix everything... |
Finally, all ok :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very nice performance improvement for MPS. In answer to your question why we didn't implement this, it was in plan, but never actually got to it.
I reviewed your code, but I am not very familiar with Lapack, so could not say anything on that part.
A few general comments:
- Most important - I saw the performance comparison which looks very nice. However, did you compare results for deep and large circuits? In your test, the circuit is shallow, which is fine for regression, but before merge, please check results. In particular, I know the previous version used
long double
precision in some places, but Lapack useslong
so this might make a difference. - Also compare results when using approximation.
- I think it would be a good idea to turn on the
validate_SVD_result
function for the near future after this is merged. @doichanj - you should probably turn this off again after a few months of usage. - Add to the documentation the actual sizes of the matrices where one algorithm becomes better than the other.
- Also, best to automatically choose between the two algorithms depending on matrix size.
- It is not a good idea to include the change in
print_to_log
here. Best to separate to a different PR. In any case, I am not sure whether this change is needed after my comment above.
releasenotes/notes/compute-svd-with-lapack-3ee992d371d653d1.yaml
Outdated
Show resolved
Hide resolved
releasenotes/notes/compute-svd-with-lapack-3ee992d371d653d1.yaml
Outdated
Show resolved
Hide resolved
src/simulators/matrix_product_state/matrix_product_state_internal.cpp
Outdated
Show resolved
Hide resolved
src/simulators/matrix_product_state/matrix_product_state_internal.cpp
Outdated
Show resolved
Hide resolved
src/simulators/matrix_product_state/matrix_product_state_internal.cpp
Outdated
Show resolved
Hide resolved
Sorry for the delay, and thanks for the comments. I will try to fix and change the things mentioned. However, this PR was part of my work in an internship and it already finished, so I have no longer access to the "big" server and it will take me a while to replicate everything. I will come back to you with the things done. |
Hello, as mentioned before I have no longer access to the nice server I was using, and I didn't manage to get access to something similar. So, now (sadly) I am using my laptop. Nevertheless, I can simulate up to 16-18 qubits and for the remaining tests I think it's enough. Here I left you the plot for deeper circuits, from 10 to 200 with 16 qubits, with the automatic selector between the QR or Divide and Conquer SVD functions in LAPACK. Time in seconds
I still have more things to do, like the approximation (but I tested it a little back then and it was good, but I didn't generate speed-up graphs), code style and documentation. |
Hi @Patataman - nice work! I am not sure I understand your graphs. In the table above, it seems the performance improvement is ~x2000 or so. In your current graphs the improvement seems to be x3 at most. What am I missing? |
The main differences are:
Execution times are highly related with the matrix size of the SVD function. During my original evaluation I saw that this size is highly related with the number of qubits and circuit's entanglement. Therefore, as now I am using 16 qubits instead, matrix sizes and speed-up gains are smaller. Also, in the original results, maximum speed-up was x20, not ~x2000, maybe if you extrapolate it seems the potential speed up was x2000, but as mentioned before, it does not depend so much on the depth of the circuit, but rather the number of qubits. If I remember well, I think the maximum matrix size for a fully entangled circuit was something like 2^(n-1) with n as number of qubits. I will edit the previous post to include the execution times with approximation |
@@ -140,9 +141,28 @@ double reduce_zeros(cmatrix_t &U, rvector_t &S, cmatrix_t &V, | |||
return discarded_value; | |||
} | |||
|
|||
void validate_SVdD_result(const cmatrix_t &A, const cmatrix_t &U, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this function to avoid applying AER::Utils::dagger to V all the time in lapack_csvd_wrapper
Summary
Hello, in this PR I add support for using the OpenBLAS/LAPACK SVD functions to replace the Qiskit's SVD (sequential) implementation in https://github.com/Qiskit/qiskit-aer/blob/main/src/simulators/matrix_product_state/svd.cpp#L148 for the MPS simulation.
Details and comments
Some points about the implementation:
QISKIT_LAPACK_SVD
to "activate/deactivate" the replacement to simplify the testing and benchmarking. From the code and comments I understood that it is based in this implementation (https://dl.acm.org/doi/10.1145/363235.363249), therefore, LAPACK zgesvd function should be the same. If you are agree with the replacement, it is just as simple as removing the old code and leaving the new.QISKIT_LAPACK_SVD=DC
print_to_log
functions. While profiling, I saw that these functions where quite expensive, and if you are not in "debug" usually you don't care about the logs, so I added#ifdef DEBUG / #endif
around them to improve performance. If I am wrong, say it and I will just undo it.Finally, you want to see if this improve performance. For that I have used Random Quantum Circuits (https://arxiv.org/pdf/2207.14280.pdf) and this server configuration:
And, the average time (in seconds) from 5 different executions:
As you can see, in the deepest circuits we are talking in (several) hours with the current implementation, but in minutes with the D&C approach.