-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-472] postgres: Retry n times on connection timeout #5022
Comments
Thanks @barberscott! I think this will be the place for the change, for both Postgres and dbt-labs/dbt-redshift#96, since they both use dbt-postgres's There are related recommendations in #3303 around:
At this point, given precedent in other adapters (Snowflake + Spark), I'm not opposed to a naive I sense we'd need to add both complementary configs ( |
Hey all, since this issue is significantly hindering us, I went ahead and modified the Postgres adapter to take a stab at fixing it. The basic premise was modifying @dataclass
class PostgresCredentials(Credentials):
host: str
user: str
port: Port
password: str # on postgres the password is mandatory
connect_retries: int = 3 # 3 seemed like a sensible default number of retries
connect_timeout: int = 10
role: Optional[str] = None
search_path: Optional[str] = None
keepalives_idle: int = 0 # 0 means to use the default value
sslmode: Optional[str] = None
sslcert: Optional[str] = None
sslkey: Optional[str] = None
sslrootcert: Optional[str] = None
application_name: Optional[str] = "dbt"
retry_on_errors: list[str] = [
"08006", # ConnectionFailure
]
... Users can set SQLSTATE codes that they wish to retry on. I'm uncertain whether taking codes is the right approach instead of exception class names (or allowing both). Afterwards, I added an @classmethod
def open(cls, connection):
...
retryable_errors = tuple([psycopg2.errors.lookup(error_code) for error_code in credentials.retry_on_errors])
attempt = 0
while True:
try:
handle = psycopg2.connect(
dbname=credentials.database,
user=credentials.user,
host=credentials.host,
password=credentials.password,
port=credentials.port,
connect_timeout=credentials.connect_timeout,
**kwargs,
)
if credentials.role:
handle.cursor().execute("set role {}".format(credentials.role))
connection.handle = handle
connection.state = "open"
except retryable_errors as e:
logger.debug(
"Got a retryable error on attempt {} to open a postgres " "connection: '{}'".format(attempt, e)
)
attempt += 1
if attempt < credentials.connect_retries:
continue
logger.debug(
"Attempt number reached or exceeded {} retries when opening a postgres connection".format(credentials.connect_retries)
)
connection.handle = None
connection.state = "fail"
raise dbt.exceptions.FailedToConnectException(str(e))
except psycopg2.Error as e:
logger.debug(
"Got an unknown error on attempt {} to open a postgres " "connection: '{}'".format(attempt, e)
)
connection.handle = None
connection.state = "fail"
raise dbt.exceptions.FailedToConnectException(str(e))
return connection Some iteration over the code is most likely needed (as well as documentation). However, would this work for a PR? I'll be glad to add any necessary testing + documentation and contribute this. Thanks. |
@tomasfarias An approach in this vein makes a lot of sense to me! Thanks for sharing your code. SQLState errors offer a concise way of specifying retryable errors (if a bit arcane). I think we need to establish a reasonable default set, so that 99% of users never need to override (or even know about) that config. We could also consider a brute force "retry all" option.
cc @nathaniel-may — I know retry is a thing you know / care a lot about, so I'd be curious to hear any thoughts you might have here. |
Thanks for tagging me in @jtcohen6. This is one of those cases I would lean on "correctness above all else" rather than ask users to know enough about warehouse internals to list retryable error codes. Thankfully most dbt operations are retryable because they're idempotent in nature, but we do have a few places where we do stateful operations such as incremental models where we may need to be abundantly careful about retrying. My preference is to write our own retry logic unless this library solves a particularly difficult problem for us. Personally, I don't see much value-add in a general purpose "retry library," but I could be convinced if I'm missing something. |
@tomasfarias I'd be very supportive of a PR for this, if you're open to submitting one! My main recommendation would be, in keeping with Nate's suggestion — rather than ask users to list error codes that are retryable, let's try to develop a good set for starting out. I'm particularly interested in connection errors / timeouts. |
I'll submit a PR for this week. I've also been looking into dbt-bigquery and dbt-snowflake to ensure the retry implementation can be re-used across adapters, and it's not just a dbt-postgres/redshif thing. Thanks for the patience, I've been on vacation for the last couple of weeks. |
Is there an existing feature request for this?
Describe the Feature
Runs/builds/tests can create hundreds of independent database connections depending on the size of the project and a single connection timeout due to transient network connections, EC2 load (e.g. RDS case when connecting through EC2), or postgres load can cause an entire run to fail. Connection timeouts can be, and most often are, transient and will often succeed on a retry.
Describe alternatives you've considered
No response
Who will this benefit?
Anyone using the postgres connector.
Are you interested in contributing this feature?
No response
Anything else?
This would be similar to connect_retries on Snowflake.
See: https://github.com/dbt-labs/dbt-snowflake/pull/6/files
The text was updated successfully, but these errors were encountered: