You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
About once per month, we can see that many dbt deps invocations fail all at the same time. This happens because of some sort of intermittent error with the hub site host. While it may be worth taking action to understand and improve the uptime of the hub site, it's also a good idea to add retries to these requests.
In the registry._get method, dbt should retry any request that fails 1) without producing a response code or 2) that fails with a 5xx response code.
Since many dbt jobs run at specific wall clock times (like midnight UTC), we should randomize the timeout between retries to avoid a thundering herd scenario.
After the first failure, dbt should wait 5-10 seconds before retrying.
If the request fails again, dbt should wait 5-10 seconds again.
If that request fails, then dbt should raise the resulting exception
@beckjake@cmcarthur you guys have more experience with this class of problem than I do -- is this a reasonable solution? Would you recommend a different approach for the timeouts?
The text was updated successfully, but these errors were encountered:
Feature
Feature description
About once per month, we can see that many
dbt deps
invocations fail all at the same time. This happens because of some sort of intermittent error with the hub site host. While it may be worth taking action to understand and improve the uptime of the hub site, it's also a good idea to add retries to these requests.In the registry._get method, dbt should retry any request that fails 1) without producing a response code or 2) that fails with a 5xx response code.
Since many dbt jobs run at specific wall clock times (like midnight UTC), we should randomize the timeout between retries to avoid a thundering herd scenario.
@beckjake @cmcarthur you guys have more experience with this class of problem than I do -- is this a reasonable solution? Would you recommend a different approach for the timeouts?
The text was updated successfully, but these errors were encountered: