-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling failures? #24
Comments
I ended up writing my own little 'worker manager' defmodule MyApp.OpenIdConnectWorkerManager do
use GenServer
require Logger
def start_link(init_arg) do
GenServer.start_link(__MODULE__, init_arg)
end
# `GenServer` callbacks
@impl true
def init(init_arg)
def init(:ignore) do
:ignore
end
def init(_init_arg) do
state = try_to_start_worker(nil)
{:ok, state}
end
@impl true
def handle_info(msg, state)
def handle_info({:DOWN, _ref, :process, object, reason}, state) do
{worker_pid, _} = state
if object === worker_pid do
Logger.error("OpenID Connect worker failed; reason: #{inspect reason}")
delay_ms = next_delay_ms(nil)
Process.send_after(self(), :try_to_start_worker, delay_ms)
{:noreply, build_state(nil, delay_ms)}
else
Logger.info("Down signal received NOT matching the OpenID Connect worker process")
{:noreply, state}
end
end
def handle_info(:try_to_start_worker, state) do
{_, previous_delay_ms} = state
state = try_to_start_worker(previous_delay_ms)
{:noreply, state}
end
# Other functions
def build_state(worker_pid, delay_ms) do
{worker_pid, delay_ms}
end
def try_to_start_worker(previous_delay_ms) do
Logger.info("Starting OpenID Connect worker ...")
worker_config = Application.get_env(:my_app, :openid_connect_providers)
case GenServer.start(OpenIDConnect.Worker, worker_config, name: :openid_connect) do
{:ok, worker_pid} ->
Logger.info("OpenID Connect worker started")
Process.monitor(worker_pid)
build_state(worker_pid, nil)
{:error, reason} ->
Logger.error("OpenID Connect worker failed to start; reason: #{inspect reason}")
delay_ms = next_delay_ms(previous_delay_ms)
Process.send_after(self(), :try_to_start_worker, delay_ms)
build_state(nil, delay_ms)
end
end
def next_delay_ms(nil), do: 225_000
def next_delay_ms(previous_delay_ms) when previous_delay_ms >= 3600_000, do: 3600_000
def next_delay_ms(previous_delay_ms) do
previous_delay_ms * 2
end
end The |
There could be a better way to handle the worker starting, this would be a good start to discuss. |
Minimally, it seems like all of the changes could be restricted to the defp update_documents(provider, config) do
{:ok, %{remaining_lifetime: remaining_lifetime}} =
{:ok, documents} = OpenIDConnect.update_documents(config)
refresh_time = time_until_next_refresh(remaining_lifetime)
Process.send_after(self(), {:update_documents, provider}, refresh_time)
documents
end The |
Alternatively, it could simply be spelled out in the README that users should consider implementing their own workers if they need something more robust. For one, I could imagine that users might want to handle failures differently depending on the provider. |
I ran into the same issue and I think this should be fixed within the Implementing an exponential backoff could be a strategy, but using a delay of 225 seconds and up to an hour like the code snippet that @kenny-evitt posted would be too long IMO because my users would be unable to login until that call passes. In my case, ideally, it should be retried after a few seconds and then exponentially backoff until we reach 5 or 10 minutes (maybe make a configuration for that?). This mechanism would apply only to the failing provider while other providers could still be working. Another point that I think should be discussed is that getting WDYT? |
@kouno I'd be open to different strategies, or even a way to customize them. Our use cases didn't require the more complex scenarios that you're describing which is why we haven't been in a good position to implement. WRT refetching the discovery documents, according to the spec everything should be refreshed when the TTL window for the resource expires. The overhead for fetching is extremely minimal |
I'm not sure what expiry you are speaking of and I can't seem to find it in the documentation (https://openid.net/specs/openid-connect-discovery-1_0.html#ProviderConfigurationResponse). Do you have a reference that would explain what we are supposed to check? Are you speaking about the |
It's been a while, I'll have to dig through the document but when I was implementing this library I recall that all resources with an expiration should be refetched after expiration |
Seems like it's all in the HTTP headers: openid_connect/lib/openid_connect.ex Lines 384 to 396 in 5292a1b
But we are ignoring the remaining lifetime of the discovery document and use the remaining lifetime of the JWKS URI instead: openid_connect/lib/openid_connect.ex Lines 195 to 203 in 5292a1b
However, for me that isn't even used since my provider does not allow caching (i.e. no-cache) which would mean that it uses the hard coded value:
|
I don't believe the discovery document is in any way time-sensitive, for probably the vast majority of providers. I have reviewed its contents for the providers I work with (Google, and a small company in the insurance industry), and neither contain any autogenerated values. It should change only rarely. The spec for OpenID Connect Discovery is here, and it does not say anything about how cache should be handled: https://openid.net/specs/openid-connect-discovery-1_0.html So I think the most reliable solution is that a failure to re-retrieve the discovery document, after it has already been fetched and cached, should simply be ignored and can be attempted again at the next regular interval. I'm happy to submit a PR for such a fix. A failure of initial retrieval is another story, and will inevitably cause errors until the doc can be successfully retrieved, unless we build a mechanism to store its known content by other means. But except for ensuring that condition doesn't cause the entire application to exit, I'm probably going to treat such additional caching as beyond the scope of a PR. |
Google most certainly does issue a new discovery document at the end of each cache period. Just because the values aren't changing it doesn't mean they never will. |
Correct. That's why the PR keeps retrying. |
PR is here: #30 |
DockYard has assigned a new engineer to address this issue. @mcrumm will be taking a look at the library probably starting tomorrow (maybe?) and attempting to resole the issue. Sorry for the pain 😞 |
Hello @bcardarella, Thank you. I will be waiting. Sincerely, |
@danielweil That error – specifically the But the general problem of this worker crashing on errors can be solved in several ways mentioned or described earlier in the comments here on this issue. But you're going to have to fork this repo, make the changes yourself, and update your Mix dependency for this package to use your customized version. You could probably also setup a suitable supervisor that would just restart the worker indefinitely too. But it looks like you could also modify |
@kenny-evitt Hey, thanks for the reply. Its is very strange, because the authentication with the SSO provider is working perfectly while running the project in my machine. To make it work, I have had to fork the dependency and alter the code as told in this thread. But it still gives me the error when I try in the production machine. I also tried to add an redirect url in the code instead of using authorization_url to generate the provider address for me and when it comes back with the session information, it gives the same error about "No match of right hand side value". Can you think of why the OpenID URI I'm using/trying couldn't be resolved by DNS? Thank you |
Hello, I am the author of the PR mentioned in @danielweil's comment. The intention of the PR is to prevent the app from crashing on init due to a transient DNS resolution failure such as :nxdomain. If such a failure persists after the app has started, the first attempt to resolve the domain of the OpenID URI will indeed fail, and there is nothing that can be done in such a case. The MatchError occurs because of widespread error-handling issues in the original codebase. That problem is that many functions call a subroutine and take its results blindly, without checking for That said, my use case is pretty simple, I think we only call |
Guys, finally figured out what was my issue. With the advice you told me, about failed DNS resolution, I looked up this post about failing DNS resolution within kubernetes: It told me that there is an issue with alpine images. With that, I changed the bitwalker/alpine-elixir-phoenix with the elixir image, and added npm though the Dockerfile, and it worked! Thanks for all those who helped. |
@danielweil Great! I had started writing this in response:
|
@nbryant42 Hi – thanks for the PR! It's still not clear the best way for this package to handle errors. It was relatively straightforward to setup a supervisor ( Otherwise, your changes look okay. I added a comment for some of them that I had some questions about. |
Hi @kenny-evitt , thanks for the feedback. The first cut of this PR was definitely a quick-and-dirty minimal fix. I've made some updates to the PR that should bring it in line with style guidelines; feel free to have another look! |
Just rebased the PR on top of master, as well. |
Just my 2¢: For ease of library adoption, rather than having to implement some fancy incremental back-off supervisor, ideally the worker process should just be set up as documented here, e.g. children = [
{OpenIDConnect.Worker, Application.get_env(:my_app, :oidc_config)}
]
Supervisor.start_link(children, opts) Ideally the worker shouldn't die in One approach would be include a retry interval and a maximum time limit in the configuration. The worker would periodically attempt to read it's initial configuration until the maximum time limit is reached, after which it could terminate. For the general (rare?) case of multiple providers, it's probably worth considering starting a child process for each provider., so that a DNS failure on one provider doesn't bring down the others. |
This is still an issue currently. If we pass the wrong value to ** (Mix) Could not start application app: App.Application.start(:normal, []) returned an error: shutdown: failed to start child: OpenIDConnect.Worker
** (EXIT) an exception was raised:
** (MatchError) no match of right hand side value: {:error, :update_documents, %HTTPoison.Error{id: nil, reason: :nxdomain}}
(openid_connect 0.2.2) lib/openid_connect/worker.ex:55: OpenIDConnect.Worker.update_documents/2
(openid_connect 0.2.2) lib/openid_connect/worker.ex:23: anonymous fn/1 in OpenIDConnect.Worker.init/1
(elixir 1.13.4) lib/enum.ex:1593: Enum."-map/2-lists^map/1-0-"/2 |
@anildigital are you pulling from HEAD or from Hex? |
@bcardarella hex https://hex.pm/packages/openid_connect It's |
You may want to try pulling from this repo and see if the problem still exists |
- "openid_connect": {:hex, :openid_connect, "0.2.2", "c05055363330deab39ffd89e609db6b37752f255a93802006d83b45596189c0b", [:mix], [{:httpoison, "~> 1.2", [hex: :httpoison, repo: "hexpm", optional: false]}, {:jason, ">= 1.0.0", [hex: :jason, repo: "hexpm", optional: false]}, {:jose, "~> 1.8", [hex: :jose, repo: "hexpm", optional: false]}], "hexpm", "735769b6d592124b58edd0582554ce638524c0214cd783d8903d33357d74cc13"},
+ "openid_connect": {:git, "https://github.com/DockYard/openid_connect.git", "4cef83893f4dcb4ac44ed50aace83dc597ca98aa", []}, Using the git version also doesn't fix the issue. |
The app I'm maintaining references the
OpenIDConnect.Worker
directly in the app supervisor. Recently, I was working on the app while my internet was down. I was very surprised that my app failed to start and it seemed to be because the worker itself was failing to start (because theupdate_documents
call was failing).I also tested an 'intermittent' failure by using a network monitoring tool to block connections to intuit.com and then manually triggering the worker 'update documents' action (by sending a suitable message to the worker process with my app running under
iex
). I was again surprised that the (expected) failure of the worker caused my app itself to shutdown (and almost immediately too).I'm still pretty new to Elixir and Erlang/OTP so I'm not sure whether this is 'reasonable' and that I should expect to handle failures myself, or whether the
Worker
module itself would (at least ideally) gracefully handle errors connecting to the configured OpenID Connect providers.The text was updated successfully, but these errors were encountered: