Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Scaling of Runtime in Repeated Circuit Execution #2654

Open
3 of 4 tasks
vinitX opened this issue Feb 24, 2025 · 3 comments
Open
3 of 4 tasks

Unexpected Scaling of Runtime in Repeated Circuit Execution #2654

vinitX opened this issue Feb 24, 2025 · 3 comments

Comments

@vinitX
Copy link

vinitX commented Feb 24, 2025

Required prerequisites

  • Consult the security policy. If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
  • Make sure you've read the documentation. Your issue may be addressed there.
  • Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
  • If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

I'm experiencing unexpected performance behavior when running a parameterized circuit multiple times with different input parameters using CudaQ. I expected the first execution to take longer due to compilation, but subsequent runs should be faster due to the caching of the compiled circuit. However, I observed that the execution time scales linearly with the number of runs, significantly slower than other circuit simulation libraries.

Steps to reproduce the bug

Steps to Reproduce:

  1. Run the provided script with different values of N and sample_size.
  2. Observe that the runtime increases linearly with sample_size, indicating a lack of caching for the compiled circuit.

E.g.: python3 sampling.py 16 --sample_size 10
Output: Sampling Time: 84.00134873390198

E.g.: python3 sampling.py 16 --sample_size 1
Output: Sampling Time: 9.460504531860352

The runtime for 10 samples is roughly 10x the runtime for one sample; I expect the runtime for subsequent samples to be faster.

Code to Reproduce:

import cudaq
import numpy as np
import time

@cudaq.kernel
def two_qubit_gate(angle:float, qubit_1: cudaq.qubit, qubit_2: cudaq.qubit):
    x.ctrl(qubit_1, qubit_2)
    rz(angle, qubit_2)
    x.ctrl(qubit_1, qubit_2)

@cudaq.kernel
def Trotter_circuit(N: int, k:int, angles_ry:np.ndarray, angles_u3:np.ndarray, angles_2q:np.ndarray):
    qreg = cudaq.qvector(N)

    for i in range(N):
        ry(angles_ry[i], qreg[i])

    for _ in range(k-1):
        for i in range(N):
            u3(angles_u3[i*3], angles_u3[i*3+1], angles_u3[i*3+2], qreg[i])

        for i in range(N):
            for j in range(i + 1, N): 
                two_qubit_gate(angles_2q[i*N+j], qreg[i], qreg[j])

    for i in range(N):
        u3(angles_u3[i*3], angles_u3[i*3+1], angles_u3[i*3+2], qreg[i])

def dict_to_res(counts):
    for key, value in counts.items():
        if value == 1: 
            final_config = key
    return np.array([1.0 if s == '1' else -1.0 for s in final_config])

def main(N, sample_size):
    k = 24
    s = np.random.choice([1., -1.], size=N)

    angles_u3 = np.random.uniform(0, 2*np.pi, 3*N)
    angles_2q = np.random.uniform(0, 2*np.pi, (N, N))

    start_time = time.time()

    for _ in range(sample_size):
        angles_ry = np.pi * (s + 1) / 2
        counts = cudaq.sample(Trotter_circuit, N, k, angles_ry, angles_u3, np.reshape(angles_2q, -1), shots_count=1)
        s = dict_to_res(counts)

    print("Sampling Time: ", time.time() - start_time)

if __name__ == "__main__":
    Trotter_circuit.compile()
    
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('N', type=int, help='The system size')
    parser.add_argument('--sample_size', type=int, default=100)
    args = parser.parse_args()
    
    main(args.N, args.sample_size)

Expected behavior

Expected Behavior:

  • The first execution should take longer due to compilation.
  • Subsequent runs should reuse the compiled circuit and execute faster.

Observed Behavior:

  • Each execution appears to take roughly the same amount of time, leading to a linear increase in total runtime.
  • This behavior suggests that caching might not be working as expected.

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

  • CUDA-Q version: cu12-latest
  • Python version: 3.10.12
  • C++ compiler: NA
  • Operating system: Windows 11

The code is run on the CPU. No GPU is involved.

Suggestions

No response

@vinitX vinitX changed the title Unexpected Linear Scaling of Runtime in Repeated Circuit Execution Unexpected Scaling of Runtime in Repeated Circuit Execution Feb 24, 2025
@bmhowe23
Copy link
Collaborator

bmhowe23 commented Feb 24, 2025

Edit: please disregard the rest of this comment and see the next comment in the thread instead.

Hi @vinitX - I think that most of your execution time is unrelated to compilation time. (Compilation should only happen once in your example, and it will be cached for future iterations.) The reason your code is taking longer and longer to run is because you are asking it to run deeper and deeper kernels each time through the loop (because N is getting bigger each time you are invoking Trotter_circuit).

If you can rewrite your CUDA-Q kernel (Trotter_circuit) to accept a cudaq.State as an input parameter, then you can save the state after each iteration and use it as input to your next iteration. This will save execution time, and this is the recommended way to do Trotterizations in CUDA-Q.

For example, take a look at https://github.com/NVIDIA/cuda-quantum/blob/main/docs/sphinx/applications/python/trotter.ipynb. The key line to focus on is this one:

state = cudaq.get_state(trotter, state, coefficients, words, dt)

Note how - in that example - the trotter kernel states a cudaq.State as input, and it essentially runs one iteration of quantum operations rather than all N iterations.

Let me know if you have any questions about that.

@bmhowe23
Copy link
Collaborator

Hi @vinitX - I think that most of your execution time is unrelated to compilation time. (Compilation should only happen once in your example, and it will be cached for future iterations.) The reason your code is taking longer and longer to run is because you are asking it to run deeper and deeper kernels each time through the loop (because N is getting bigger each time you are invoking Trotter_circuit).

If you can rewrite your CUDA-Q kernel (Trotter_circuit) to accept a cudaq.State as an input parameter, then you can save the state after each iteration and use it as input to your next iteration. This will save execution time, and this is the recommended way to do Trotterizations in CUDA-Q.

For example, take a look at https://github.com/NVIDIA/cuda-quantum/blob/main/docs/sphinx/applications/python/trotter.ipynb. The key line to focus on is this one:

state = cudaq.get_state(trotter, state, coefficients, words, dt)
Note how - in that example - the trotter kernel states a cudaq.State as input, and it essentially runs one iteration of quantum operations rather than all N iterations.

Let me know if you have any questions about that.

Oops, sorry, I now see that N is the same for each invocation, so please disregard my prior comment.

However, I am not sure I agree with this comment from the original issue:

The runtime for 10 samples is roughly 10x the runtime for one sample; I expect the runtime for subsequent samples to be faster.

I agree that the circuit does not need to be recompiled on iterations 2-10, so they may be slightly faster. But if you are seeing that the first invocation takes ~9 seconds, I imagine that compilation is a small fraction of that time. Most of the time is circuit execution time, and should be roughly the same for each call to cudaq.sample. Therefore, I expect 10 sample calls to take approx 10x the runtime for a single sample.

@vinitX
Copy link
Author

vinitX commented Feb 27, 2025

Hi, @bmhowe23, thanks for your reply! You're probably right that the small amount of compilation time is already accounted for in the first iteration. After that, the execution time is dominated by running the kernel, which explains the linear scaling as the number of runs increases.

I had expected that, since we're working with a parameterized quantum circuit with a fixed architecture, re-running the circuit after compilation would be much faster. This is something I’ve observed in other libraries, like TensorCircuit with the JAX backend, where compiling a parameterized circuit creates a function that evaluates extremely quickly on subsequent runs.

[See the code on my GitHub for reference – Link]

It would be great to have a similar feature in CudaQ, where the compiled circuit is efficiently reused to speed up iterations. This kind of optimization would be especially valuable when working with parameterized circuits that require repeated function evaluations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants