Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-472] postgres: Retry n times on connection timeout #5022

Closed
1 task done
barberscott opened this issue Apr 9, 2022 · 6 comments · Fixed by #5432
Closed
1 task done

[CT-472] postgres: Retry n times on connection timeout #5022

barberscott opened this issue Apr 9, 2022 · 6 comments · Fixed by #5432
Labels
enhancement New feature or request postgres Team:Adapters Issues designated for the adapter area of the code
Milestone

Comments

@barberscott
Copy link

barberscott commented Apr 9, 2022

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

Runs/builds/tests can create hundreds of independent database connections depending on the size of the project and a single connection timeout due to transient network connections, EC2 load (e.g. RDS case when connecting through EC2), or postgres load can cause an entire run to fail. Connection timeouts can be, and most often are, transient and will often succeed on a retry.

Describe alternatives you've considered

No response

Who will this benefit?

Anyone using the postgres connector.

Are you interested in contributing this feature?

No response

Anything else?

This would be similar to connect_retries on Snowflake.
See: https://github.com/dbt-labs/dbt-snowflake/pull/6/files

@barberscott barberscott added enhancement New feature or request triage labels Apr 9, 2022
@github-actions github-actions bot changed the title Retry n times on connection timeout [CT-472] Retry n times on connection timeout Apr 9, 2022
@barberscott barberscott changed the title [CT-472] Retry n times on connection timeout [CT-472] postgres: Retry n times on connection timeout Apr 10, 2022
@jtcohen6 jtcohen6 added Team:Adapters Issues designated for the adapter area of the code postgres and removed triage labels Apr 11, 2022
@jtcohen6
Copy link
Contributor

Thanks @barberscott! I think this will be the place for the change, for both Postgres and dbt-labs/dbt-redshift#96, since they both use dbt-postgres's psycopg2 connection.

There are related recommendations in #3303 around:

  • Specific error codes/messages to capture as transient, and likely to succeed on retry
  • A proposal for user-specified "retryable" errors, as another config in profiles.yml

At this point, given precedent in other adapters (Snowflake + Spark), I'm not opposed to a naive retry_all parameter. It makes sense to turn on in production (best-possible chance at query success), and leave off in development (quicker iteration on errors).

I sense we'd need to add both complementary configs (connect_retries + connect_timeout), with reasonable defaults. Perhaps we also want to add our attempt at retry_on_database_errors, based on the psycopg2 errors we strongly believe to be transient.

@tomasfarias
Copy link
Contributor

Hey all, since this issue is significantly hindering us, I went ahead and modified the Postgres adapter to take a stab at fixing it.

The basic premise was modifying PostgresCredentials to take in the new connect_retries and retry_on_errors parameters:

@dataclass
class PostgresCredentials(Credentials):
    host: str
    user: str
    port: Port
    password: str  # on postgres the password is mandatory
    connect_retries: int = 3  # 3 seemed like a sensible default number of retries
    connect_timeout: int = 10
    role: Optional[str] = None
    search_path: Optional[str] = None
    keepalives_idle: int = 0  # 0 means to use the default value
    sslmode: Optional[str] = None
    sslcert: Optional[str] = None
    sslkey: Optional[str] = None
    sslrootcert: Optional[str] = None
    application_name: Optional[str] = "dbt"
    retry_on_errors: list[str] = [
        "08006", # ConnectionFailure
    ]
...

Users can set SQLSTATE codes that they wish to retry on. I'm uncertain whether taking codes is the right approach instead of exception class names (or allowing both).

Afterwards, I added an except clause for these errors in a connection loop added to PostgresConnectionManager.open:

    @classmethod
    def open(cls, connection):
        ...

        retryable_errors = tuple([psycopg2.errors.lookup(error_code) for error_code in credentials.retry_on_errors])

        attempt = 0
        while True:
            try:
                handle = psycopg2.connect(
                    dbname=credentials.database,
                    user=credentials.user,
                    host=credentials.host,
                    password=credentials.password,
                    port=credentials.port,
                    connect_timeout=credentials.connect_timeout,
                    **kwargs,
                )

                if credentials.role:
                    handle.cursor().execute("set role {}".format(credentials.role))

                connection.handle = handle
                connection.state = "open"

            except retryable_errors as e:
                logger.debug(
                    "Got a retryable error on attempt {} to open a postgres " "connection: '{}'".format(attempt, e)
                )

                attempt += 1
                if attempt < credentials.connect_retries:
                    continue

                logger.debug(
                    "Attempt number reached or exceeded {} retries when opening a postgres connection".format(credentials.connect_retries)
                )

                connection.handle = None
                connection.state = "fail"

                raise dbt.exceptions.FailedToConnectException(str(e))

            except psycopg2.Error as e:
                logger.debug(
                    "Got an unknown error on attempt {} to open a postgres " "connection: '{}'".format(attempt, e)
                )

                connection.handle = None
                connection.state = "fail"

                raise dbt.exceptions.FailedToConnectException(str(e))

        return connection

Some iteration over the code is most likely needed (as well as documentation). However, would this work for a PR? I'll be glad to add any necessary testing + documentation and contribute this.

Thanks.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 2, 2022

@tomasfarias An approach in this vein makes a lot of sense to me! Thanks for sharing your code.

SQLState errors offer a concise way of specifying retryable errors (if a bit arcane). I think we need to establish a reasonable default set, so that 99% of users never need to override (or even know about) that config. We could also consider a brute force "retry all" option.

pycopg2 doesn't offer us in-built retry. Would it better to look for another OSS retry alternative (e.g. tenacity), rather than rolling our own? It's not very complex code, but we do now have it in a few places (snowflake, spark, bigquery sorta) — and I think there's a lot of benefit in standardizing this in core / across adapters, both for users and for us as maintainers.

cc @nathaniel-may — I know retry is a thing you know / care a lot about, so I'd be curious to hear any thoughts you might have here.

@nathaniel-may
Copy link
Contributor

Thanks for tagging me in @jtcohen6. This is one of those cases I would lean on "correctness above all else" rather than ask users to know enough about warehouse internals to list retryable error codes. Thankfully most dbt operations are retryable because they're idempotent in nature, but we do have a few places where we do stateful operations such as incremental models where we may need to be abundantly careful about retrying.

My preference is to write our own retry logic unless this library solves a particularly difficult problem for us. Personally, I don't see much value-add in a general purpose "retry library," but I could be convinced if I'm missing something.

@jtcohen6
Copy link
Contributor

@tomasfarias I'd be very supportive of a PR for this, if you're open to submitting one! My main recommendation would be, in keeping with Nate's suggestion — rather than ask users to list error codes that are retryable, let's try to develop a good set for starting out. I'm particularly interested in connection errors / timeouts.

@tomasfarias
Copy link
Contributor

I'll submit a PR for this week. I've also been looking into dbt-bigquery and dbt-snowflake to ensure the retry implementation can be re-used across adapters, and it's not just a dbt-postgres/redshif thing.

Thanks for the patience, I've been on vacation for the last couple of weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request postgres Team:Adapters Issues designated for the adapter area of the code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants