Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of gRPC failures #471

Open
erichulburd opened this issue May 14, 2024 · 0 comments
Open

Improve handling of gRPC failures #471

erichulburd opened this issue May 14, 2024 · 0 comments

Comments

@erichulburd
Copy link
Contributor

Over the past few months, I've seen a variety of different gRPC status failures that should be retryable on the client side. A most recent example:

QpuApiError                               Traceback (most recent call last)
...
    212 """Execute a job and return the shots."""
    213 job_id = submit(
    214     program=executable.program,
    215     patch_values=patch,
   (...)
    218     execution_options=self.execution_options,
    219 )
--> 220 return retrieve_results(
    221     job_id=job_id,
    222     quantum_processor_id=self.device_name,
    223     client=self.qcs_client,
    224     execution_options=self.execution_options,
    225 )

QpuApiError: Call failed during gRPC request: status: Unavailable, message: "error trying to connect: Unsuccessful reply: TtlExpired", details: [], metadata: MetadataMap { headers: {} }

It's difficult to diagnose and handle errors of this nature in Python as the QCS SDK is currently structured. I advocate consideration for the following:

  1. Supporting retry configuration on all gRPC calls - translation, execution (ie submit), and result retrieval retrieve_results. This should support retry based on gRPC status code as well as a backoff strategy - linear, exponential, max retries, etc.
  2. Surfacing gRPC exceptions to Python in a structured way. At a minimum, this should include the status code. Request id and timing data would also be nice.
  3. Configurable gRPC logging. The gRPC C API uses environment variables in a well structured and documented way: https://github.com/grpc/grpc/blob/15850972ddba9c1262a9d51341da03bc607bd934/doc/environment_variables.md
  4. A persistent handle to the gRPC channel. The way the client is currently structured, each call to translate, execute, and retrieve results instantiates a new channel (see for instance
    let mut controller_client = execution_options
    and then https://github.com/rigetti/qcs-sdk-rust/blob/main/crates/lib/src/qpu/api.rs#L525). This both adds latency and makes connections more fallible, which is contrary to the design of gRPC. If necessary, this should be achievable with some once_cell utilities: https://docs.rs/once_cell/latest/once_cell/sync/struct.Lazy.html.

If these options present inordinate technical challenges, I wonder if an alternative approach would be to interface with existing Python gRPC tooling - as in expose functions that convert Python based gRPC message objects to QCS SDK structs.

@erichulburd erichulburd changed the title Improve handling of retryable error Improve handling of gRPC failures May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant