Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface "could not get secrets" (and others) as errors #1959

Closed
alecthomas opened this issue Jul 3, 2024 · 1 comment · Fixed by #1969
Closed

Surface "could not get secrets" (and others) as errors #1959

alecthomas opened this issue Jul 3, 2024 · 1 comment · Fixed by #1969
Assignees
Labels

Comments

@alecthomas
Copy link
Collaborator

alecthomas commented Jul 3, 2024

Change the retry code for pulling ModuleContext such that it differentiates between retryable errors and fatal errors. The easiest solution is probably to use gRPC status codes to differentiate.

debug:controller0: Streaming RPC failed: internal: could not get secrets: did_web_portable_did: no keychain entry for "did_web_portable_did": not found: /xyz.block.ftl.v1.VerbService/GetModuleContext

Currently they are at debug level and the plugin just fails to start with no message.

@alecthomas alecthomas added next Work that will be be picked up next P1 labels Jul 3, 2024
@github-actions github-actions bot added the triage Issue needs triaging label Jul 3, 2024
@ftl-robot ftl-robot mentioned this issue Jul 3, 2024
@github-actions github-actions bot removed triage Issue needs triaging next Work that will be be picked up next labels Jul 3, 2024
@alecthomas alecthomas changed the title Surface "could not get secrets" (and other ModuleContext issues) to users (blocked by #1906) Surface "could not get secrets" (and others) as errors Jul 3, 2024
@jonathanj-square
Copy link
Contributor

jonathanj-square commented Jul 4, 2024

Based on our conversations there appears to be 2 goals for resolving this issue:

  • provide error visibility
  • terminate the runner if the error is deemed fatal (to avoid log spam)

What I am seeing is that when the server is successfully terminated the controller will attempt to respawn the server as the controller aims to reach the target replica count. Naturally this complicates the task of reaching the second goal without impacting deployment resilience.

jonathanj-square added a commit that referenced this issue Jul 4, 2024
controller startup failure troubleshooting was complicated by a failure to surface error messages from the `GetModuleContext` streaming end-point.

errors encountered from that end-point are now made visible and if those errors are deemed fatal the runner will terminate.
jonathanj-square added a commit that referenced this issue Jul 4, 2024
controller startup failure troubleshooting was complicated by a failure to surface error messages from the `GetModuleContext` streaming end-point.

errors encountered from that end-point are now made visible and if those errors are deemed fatal the runner will terminate.
jonathanj-square added a commit that referenced this issue Jul 4, 2024
controller startup failure troubleshooting was complicated by a failure to surface error messages from the `GetModuleContext` streaming end-point.

errors encountered from that end-point are now made visible and if those errors are deemed fatal the runner will terminate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants