-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected Scaling of Runtime in Repeated Circuit Execution #2654
Comments
Edit: please disregard the rest of this comment and see the next comment in the thread instead. Hi @vinitX - I think that most of your execution time is unrelated to compilation time. (Compilation should only happen once in your example, and it will be cached for future iterations.) The reason your code is taking longer and longer to run is because you are asking it to run deeper and deeper kernels each time through the loop (because If you can rewrite your CUDA-Q kernel ( For example, take a look at https://github.com/NVIDIA/cuda-quantum/blob/main/docs/sphinx/applications/python/trotter.ipynb. The key line to focus on is this one: state = cudaq.get_state(trotter, state, coefficients, words, dt) Note how - in that example - the Let me know if you have any questions about that. |
Oops, sorry, I now see that However, I am not sure I agree with this comment from the original issue:
I agree that the circuit does not need to be recompiled on iterations 2-10, so they may be slightly faster. But if you are seeing that the first invocation takes ~9 seconds, I imagine that compilation is a small fraction of that time. Most of the time is circuit execution time, and should be roughly the same for each call to |
Hi, @bmhowe23, thanks for your reply! You're probably right that the small amount of compilation time is already accounted for in the first iteration. After that, the execution time is dominated by running the kernel, which explains the linear scaling as the number of runs increases. I had expected that, since we're working with a parameterized quantum circuit with a fixed architecture, re-running the circuit after compilation would be much faster. This is something I’ve observed in other libraries, like TensorCircuit with the JAX backend, where compiling a parameterized circuit creates a function that evaluates extremely quickly on subsequent runs. [See the code on my GitHub for reference – Link] It would be great to have a similar feature in CudaQ, where the compiled circuit is efficiently reused to speed up iterations. This kind of optimization would be especially valuable when working with parameterized circuits that require repeated function evaluations. |
Required prerequisites
Describe the bug
I'm experiencing unexpected performance behavior when running a parameterized circuit multiple times with different input parameters using CudaQ. I expected the first execution to take longer due to compilation, but subsequent runs should be faster due to the caching of the compiled circuit. However, I observed that the execution time scales linearly with the number of runs, significantly slower than other circuit simulation libraries.
Steps to reproduce the bug
Steps to Reproduce:
N
andsample_size
.sample_size
, indicating a lack of caching for the compiled circuit.E.g.:
python3 sampling.py 16 --sample_size 10
Output: Sampling Time: 84.00134873390198
E.g.:
python3 sampling.py 16 --sample_size 1
Output: Sampling Time: 9.460504531860352
The runtime for 10 samples is roughly 10x the runtime for one sample; I expect the runtime for subsequent samples to be faster.
Code to Reproduce:
Expected behavior
Expected Behavior:
Observed Behavior:
Is this a regression? If it is, put the last known working version (or commit) here.
Not a regression
Environment
The code is run on the CPU. No GPU is involved.
Suggestions
No response
The text was updated successfully, but these errors were encountered: