Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RM-39 update to sqlalchemy 1.4 #208

Merged
merged 41 commits into from
Feb 24, 2023
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
b7ed499
RM-39 update to sqlalchemy 1.4
ryantimjohn Feb 6, 2023
f208d22
RM-39 update to sqlalchemy 1.4 syntax
ryantimjohn Feb 6, 2023
3c51771
RM-39 revert sqlalchemy version change
ryantimjohn Feb 6, 2023
0d13460
RM-39 bump to 1.4 less than 2.0
ryantimjohn Feb 6, 2023
8383ba3
RM-39 ratchet mypy
ryantimjohn Feb 6, 2023
2ced342
RM-39 update expected postgres data types
ryantimjohn Feb 6, 2023
955d99c
Merge branch 'master' into RM-39-Bump-SQLAlchemy-to-version-1.4
ryantimjohn Feb 17, 2023
10aed1b
RM-39 try mysqlclient
ryantimjohn Feb 21, 2023
979378b
RM-39 try mysqldb
ryantimjohn Feb 21, 2023
9b5681e
Revert "RM-39 try mysqldb"
ryantimjohn Feb 21, 2023
8f94c3a
Revert "RM-39 try mysqlclient"
ryantimjohn Feb 21, 2023
07cda90
RM-39 create connection with mysql instead of engine
Feb 21, 2023
14c8caf
RM-39 update mysql df datatypes
ryantimjohn Feb 21, 2023
653864b
RM-39 update mysql table2table dtypes
ryantimjohn Feb 21, 2023
8511c04
RM-39 records_numeric to conn instead of engine
ryantimjohn Feb 21, 2023
5144093
RM-39 update engine syntax
ryantimjohn Feb 21, 2023
d5dda88
RM-39 update syntax
ryantimjohn Feb 21, 2023
b95ed71
RM-39 update mysql syntax
ryantimjohn Feb 21, 2023
cfae5fe
RM-39 update mysql syntax
ryantimjohn Feb 21, 2023
9d18882
RM-39 update mysql numeric expectations
ryantimjohn Feb 21, 2023
216f137
RM-39 add testing for pymysql
ryantimjohn Feb 21, 2023
6ed364b
RM-39 add patches
ryantimjohn Feb 21, 2023
486ceea
RM-39 update mock references
ryantimjohn Feb 21, 2023
1612393
RM-39 remove test, lower water mark
ryantimjohn Feb 21, 2023
140e1f9
RM-39 make sure db is Engine
ryantimjohn Feb 21, 2023
d98a0bd
RM-39 update redshift datatypes
ryantimjohn Feb 21, 2023
7cb6795
RM-39 ignore type
ryantimjohn Feb 21, 2023
4908108
RM-39 remove type ignore
ryantimjohn Feb 21, 2023
2228613
RM-39 update table2table dtypes
ryantimjohn Feb 21, 2023
04ecb8f
RM-39 lower mypy high watermark
ryantimjohn Feb 21, 2023
cad91c6
RM-39 update to new dtypes
ryantimjohn Feb 21, 2023
ad38d17
Merge branch 'master' into RM-39-Bump-SQLAlchemy-to-version-1.4
ryantimjohn Feb 21, 2023
6e27b05
RM-39 update .get_columns to always use connection
ryantimjohn Feb 22, 2023
c1e5d0f
RM-39 add VSCode settings to .gitignore
ryantimjohn Feb 22, 2023
6d0f65c
Merge branch 'RM-39-Bump-SQLAlchemy-to-version-1.4' of https://github…
ryantimjohn Feb 22, 2023
bfa4a0c
RM-39 allow newest sqlalchemy
ryantimjohn Feb 22, 2023
75a56e7
Revert "RM-39 allow newest sqlalchemy"
ryantimjohn Feb 22, 2023
cf7cf2c
RM-39 update code to close connection
ryantimjohn Feb 22, 2023
7111705
RM-39 lower coverage watermark
ryantimjohn Feb 22, 2023
a37a086
RM-39 fix flake8 problems
ryantimjohn Feb 22, 2023
5393016
Merge branch 'master' into RM-39-Bump-SQLAlchemy-to-version-1.4
ryantimjohn Feb 24, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion metrics/coverage_high_water_mark
Original file line number Diff line number Diff line change
@@ -1 +1 @@
93.5400
93.5100
2 changes: 1 addition & 1 deletion metrics/mypy_high_water_mark
Original file line number Diff line number Diff line change
@@ -1 +1 @@
92.3700
92.3500
12 changes: 9 additions & 3 deletions records_mover/records/sources/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,9 +77,15 @@ def to_dataframes_source(self,
db = self.driver.db
records_schema = self.pull_records_schema()

columns = db.dialect.get_columns(db,
self.table_name,
schema=self.schema_name)
if isinstance(db, Engine) and db.driver == 'pymysql':
conn = db.connect()
columns = db.dialect.get_columns(conn,
self.table_name,
schema=self.schema_name)
else:
columns = db.dialect.get_columns(db,
self.table_name,
schema=self.schema_name)

num_columns = len(columns)
if num_columns == 0:
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ def initialize_options(self) -> None:
]

db_dependencies = [
'sqlalchemy>=1.3.18,<1.4',
'sqlalchemy>=1.4,<2.0',
]

smart_open_dependencies = [
Expand Down
58 changes: 29 additions & 29 deletions tests/integration/records/expected_column_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,32 +9,32 @@
'redshift': [
'INTEGER', 'VARCHAR(3)', 'VARCHAR(3)', 'VARCHAR(1)', 'VARCHAR(1)',
'VARCHAR(3)', 'VARCHAR(111)', 'DATE', 'VARCHAR(8)',
'TIMESTAMP WITHOUT TIME ZONE', 'TIMESTAMPTZ'
'TIMESTAMP', 'TIMESTAMPTZ'
],
'postgresql': [
'INTEGER', 'VARCHAR(3)', 'VARCHAR(3)', 'VARCHAR(1)', 'VARCHAR(1)',
'VARCHAR(3)', 'VARCHAR(111)', 'DATE', 'TIME WITHOUT TIME ZONE',
'TIMESTAMP WITHOUT TIME ZONE', 'TIMESTAMP WITH TIME ZONE'
'VARCHAR(3)', 'VARCHAR(111)', 'DATE', 'TIME',
'TIMESTAMP', 'TIMESTAMP'
],
'bigquery': [
'INTEGER', 'VARCHAR(3)', 'VARCHAR(3)', 'VARCHAR(1)', 'VARCHAR(1)', 'VARCHAR(3)',
'VARCHAR(111)', 'DATE', 'TIME', 'DATETIME', 'TIMESTAMP'
],
'mysql': [
'INTEGER(11)', 'VARCHAR(3)', 'VARCHAR(3)', 'VARCHAR(1)', 'VARCHAR(1)', 'VARCHAR(3)',
'VARCHAR(111)', 'DATE', 'TIME', 'DATETIME(6)', 'DATETIME(6)'
'INTEGER', 'VARCHAR(3)', 'VARCHAR(3)', 'VARCHAR(1)', 'VARCHAR(1)', 'VARCHAR(3)',
'VARCHAR(111)', 'DATE', 'TIME', 'DATETIME', 'DATETIME'
],
}

expected_df_loaded_database_column_types = {
'postgresql': [
'BIGINT', 'VARCHAR(12)', 'VARCHAR(12)', 'VARCHAR(4)', 'VARCHAR(4)',
'VARCHAR(12)', 'VARCHAR(444)', 'DATE', 'TIME WITHOUT TIME ZONE',
'TIMESTAMP WITHOUT TIME ZONE', 'TIMESTAMP WITH TIME ZONE'
'VARCHAR(12)', 'VARCHAR(444)', 'DATE', 'TIME',
'TIMESTAMP', 'TIMESTAMP'
],
'mysql': [
'BIGINT(20)', 'VARCHAR(3)', 'VARCHAR(3)', 'VARCHAR(1)', 'VARCHAR(1)', 'VARCHAR(3)',
'VARCHAR(111)', 'DATE', 'TIME', 'DATETIME(6)', 'DATETIME(6)'
'BIGINT', 'VARCHAR(3)', 'VARCHAR(3)', 'VARCHAR(1)', 'VARCHAR(1)', 'VARCHAR(3)',
'VARCHAR(111)', 'DATE', 'TIME', 'DATETIME', 'DATETIME'
],
'vertica': [
'INTEGER', 'VARCHAR(12)', 'VARCHAR(12)', 'VARCHAR(4)', 'VARCHAR(4)',
Expand All @@ -44,7 +44,7 @@
'redshift': [
'BIGINT', 'VARCHAR(12)', 'VARCHAR(12)', 'VARCHAR(4)', 'VARCHAR(4)',
'VARCHAR(12)', 'VARCHAR(444)', 'DATE', 'VARCHAR(8)',
'TIMESTAMP WITHOUT TIME ZONE', 'TIMESTAMPTZ'
'TIMESTAMP', 'TIMESTAMPTZ'
],
'bigquery': [
'INTEGER', 'VARCHAR(12)', 'VARCHAR(12)', 'VARCHAR(4)', 'VARCHAR(4)',
Expand Down Expand Up @@ -116,8 +116,8 @@
# date/time/timestamp/timestamptz all get turned into strings.
('postgresql', 'postgresql'): [
'INTEGER', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)',
'VARCHAR(256)', 'VARCHAR(256)', 'DATE', 'TIME WITHOUT TIME ZONE',
'TIMESTAMP WITHOUT TIME ZONE', 'TIMESTAMP WITH TIME ZONE'
'VARCHAR(256)', 'VARCHAR(256)', 'DATE', 'TIME',
'TIMESTAMP', 'TIMESTAMP'
],
('postgresql', 'vertica'): [
'INTEGER', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)',
Expand All @@ -127,7 +127,7 @@
('postgresql', 'redshift'): [
'INTEGER', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)',
'VARCHAR(256)', 'VARCHAR(256)', 'DATE', 'VARCHAR(8)',
'TIMESTAMP WITHOUT TIME ZONE', 'TIMESTAMPTZ'
'TIMESTAMP', 'TIMESTAMPTZ'
],
('postgresql', 'bigquery'): [
'INTEGER', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)',
Expand All @@ -141,7 +141,7 @@
('redshift', 'postgresql'): [
'INTEGER', 'VARCHAR(3)', 'VARCHAR(3)', 'VARCHAR(1)', 'VARCHAR(1)',
'VARCHAR(3)', 'VARCHAR(111)', 'DATE', 'VARCHAR(8)',
'TIMESTAMP WITHOUT TIME ZONE', 'TIMESTAMP WITH TIME ZONE'
'TIMESTAMP', 'TIMESTAMP'
],
('bigquery', 'redshift'): [
'BIGINT', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)',
Expand All @@ -154,8 +154,8 @@
],
('bigquery', 'postgresql'): [
'BIGINT', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)',
'VARCHAR(256)', 'VARCHAR(256)', 'DATE', 'TIME WITHOUT TIME ZONE',
'TIMESTAMP WITHOUT TIME ZONE', 'TIMESTAMP WITH TIME ZONE'
'VARCHAR(256)', 'VARCHAR(256)', 'DATE', 'TIME',
'TIMESTAMP', 'TIMESTAMP'
],
('bigquery', 'vertica'): [
'INTEGER', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)',
Expand All @@ -170,37 +170,37 @@
'VARCHAR(444)', 'DATE', 'TIME', 'DATETIME', 'DATETIME',
],
('redshift', 'mysql'): [
'INTEGER(11)', 'VARCHAR(3)', 'VARCHAR(3)', 'VARCHAR(1)', 'VARCHAR(1)', 'VARCHAR(3)',
'VARCHAR(111)', 'DATE', 'VARCHAR(8)', 'DATETIME(6)', 'DATETIME(6)'
'INTEGER', 'VARCHAR(3)', 'VARCHAR(3)', 'VARCHAR(1)', 'VARCHAR(1)', 'VARCHAR(3)',
'VARCHAR(111)', 'DATE', 'VARCHAR(8)', 'DATETIME', 'DATETIME'
],
('postgresql', 'mysql'): [
'INTEGER(11)', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)',
'INTEGER', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)',
'VARCHAR(256)',
'VARCHAR(256)', 'DATE', 'TIME', 'DATETIME(6)', 'DATETIME(6)'
'VARCHAR(256)', 'DATE', 'TIME', 'DATETIME', 'DATETIME'
],
('bigquery', 'mysql'): [
'BIGINT(20)', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)',
'BIGINT', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)', 'VARCHAR(256)',
'VARCHAR(256)',
'VARCHAR(256)', 'DATE', 'TIME', 'DATETIME(6)', 'DATETIME(6)'
'VARCHAR(256)', 'DATE', 'TIME', 'DATETIME', 'DATETIME'
],
('mysql', 'postgresql'): [
'INTEGER', 'VARCHAR(12)', 'VARCHAR(12)', 'VARCHAR(4)', 'VARCHAR(4)', 'VARCHAR(12)',
'VARCHAR(444)', 'DATE', 'TIME WITHOUT TIME ZONE', 'TIMESTAMP WITHOUT TIME ZONE',
'TIMESTAMP WITHOUT TIME ZONE'
'VARCHAR(444)', 'DATE', 'TIME', 'TIMESTAMP',
'TIMESTAMP'
],
('mysql', 'redshift'): [
'INTEGER', 'VARCHAR(12)', 'VARCHAR(12)', 'VARCHAR(4)', 'VARCHAR(4)', 'VARCHAR(12)',
'VARCHAR(444)', 'DATE', 'VARCHAR(8)', 'TIMESTAMP WITHOUT TIME ZONE',
'TIMESTAMP WITHOUT TIME ZONE'
'VARCHAR(444)', 'DATE', 'VARCHAR(8)', 'TIMESTAMP',
'TIMESTAMP'
],
('vertica', 'postgresql'): [
'BIGINT', 'VARCHAR(3)', 'VARCHAR(3)', 'VARCHAR(1)', 'VARCHAR(1)',
'VARCHAR(3)', 'VARCHAR(111)', 'DATE', 'TIME WITHOUT TIME ZONE',
'TIMESTAMP WITHOUT TIME ZONE', 'TIMESTAMP WITH TIME ZONE'
'VARCHAR(3)', 'VARCHAR(111)', 'DATE', 'TIME',
'TIMESTAMP', 'TIMESTAMP'
],
('vertica', 'redshift'): [
'BIGINT', 'VARCHAR(3)', 'VARCHAR(3)', 'VARCHAR(1)', 'VARCHAR(1)',
'VARCHAR(3)', 'VARCHAR(111)', 'DATE', 'VARCHAR(8)',
'TIMESTAMP WITHOUT TIME ZONE', 'TIMESTAMPTZ'
'TIMESTAMP', 'TIMESTAMPTZ'
],
}
28 changes: 14 additions & 14 deletions tests/integration/records/single_db/numeric_expectations.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,11 +120,11 @@
'uint64': 'NUMERIC(20, 0)',
'float16': 'REAL',
'float32': 'REAL',
'float64': 'DOUBLE PRECISION',
'float128': 'DOUBLE PRECISION', # Redshift doesn't support >float64
'float64': 'DOUBLE_PRECISION',
'float128': 'DOUBLE_PRECISION', # Redshift doesn't support >float64
'fixed_6_2': 'NUMERIC(6, 2)',
'fixed_38_9': 'NUMERIC(38, 9)',
'fixed_100_4': 'DOUBLE PRECISION' # Redshift doesn't support fixed precision > 38
'fixed_100_4': 'DOUBLE_PRECISION' # Redshift doesn't support fixed precision > 38
},
'vertica': {
'int8': 'INTEGER',
Expand Down Expand Up @@ -180,8 +180,8 @@
'uint64': 'NUMERIC(20, 0)',
'float16': 'REAL',
'float32': 'REAL',
'float64': 'DOUBLE PRECISION',
'float128': 'DOUBLE PRECISION', # Postgres doesn't support >float64
'float64': 'DOUBLE_PRECISION',
'float128': 'DOUBLE_PRECISION', # Postgres doesn't support >float64
'fixed_6_2': 'NUMERIC(6, 2)',
'fixed_38_9': 'NUMERIC(38, 9)',
'fixed_100_4': 'NUMERIC(100, 4)',
Expand All @@ -191,15 +191,15 @@
# relevant and records-mover just uses the defaults which end up
# as the below.
'mysql': {
'int8': 'TINYINT(4)',
'int16': 'SMALLINT(6)',
'int32': 'INTEGER(11)',
'int64': 'BIGINT(20)',
'ubyte': 'TINYINT(3) UNSIGNED',
'uint8': 'TINYINT(3) UNSIGNED',
'uint16': 'SMALLINT(5) UNSIGNED',
'uint32': 'INTEGER(10) UNSIGNED',
'uint64': 'BIGINT(20) UNSIGNED',
'int8': 'TINYINT',
'int16': 'SMALLINT',
'int32': 'INTEGER',
'int64': 'BIGINT',
'ubyte': 'TINYINT',
'uint8': 'TINYINT',
'uint16': 'SMALLINT',
'uint32': 'INTEGER',
'uint64': 'BIGINT',
'float16': 'FLOAT',
'float32': 'FLOAT',
'float64': 'DOUBLE',
Expand Down
11 changes: 9 additions & 2 deletions tests/integration/records/single_db/test_records_numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,15 @@ def test_numeric_schema_fields_created(self) -> None:
self.validate_records_schema(tempdir)

def validate_table(self):
columns = self.engine.dialect.get_columns(self.engine, self.table_name,
schema=self.schema_name)
if self.engine.driver == 'pymysql':
conn = self.engine.connect()
columns = self.engine.dialect.get_columns(conn,
self.table_name,
schema=self.schema_name)
else:
columns = self.engine.dialect.get_columns(self.engine,
self.table_name,
schema=self.schema_name)
# Note that Redshift doesn't support TIME type:
# https://docs.aws.amazon.com/redshift/latest/dg/r_Datetime_types.html
actual_column_types = {
Expand Down
12 changes: 9 additions & 3 deletions tests/integration/records/table_validator.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,15 @@ def validate(self,
self.validate_data_values(schema_name, table_name)

def validate_data_types(self, schema_name: str, table_name: str) -> None:
columns = self.target_db_engine.dialect.get_columns(self.target_db_engine,
table_name,
schema=schema_name)
if self.target_db_engine.driver == 'pymysql':
conn = self.target_db_engine.connect()
columns = self.target_db_engine.dialect.get_columns(conn,
table_name,
schema=schema_name)
else:
columns = self.target_db_engine.dialect.get_columns(self.target_db_engine,
table_name,
schema=schema_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so weird to me, does the get_columns method take different types of objects depending on what the driver is????

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the sqlalchemy documentation, get_columns says it always takes a connection as the first arg.
https://docs.sqlalchemy.org/en/14/core/internals.html#sqlalchemy.engine.Dialect.get_columns.

I'm sure you've already tried this, but we can't just pass the conn object to both?

Is this records-mover's fault for storing inconsistent types in its wrappers around sqlalchemy?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class DBDriver(metaclass=ABCMeta):
    def __init__(self,
                 db: Union[sqlalchemy.engine.Engine,
                           sqlalchemy.engine.Connection], **kwargs) -> None:

well this is weird

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright so I think the reason for this is that in some cases we want to instantiate a db_driver in the context of a transaction, as in this example from records/mover/records/prep_and_load.py

    with tbl.db_engine.begin() as db:
        # This second transaction ensures the table has been created
        # before non-transactional statements like Redshift's COPY
        # take place.  Otherwise you'll get an error like:
        #
        #  Cannot COPY into nonexistent table
        driver = tbl.db_driver(db)
        try:
            import_count = load(driver)
        except load_exception_type:
            if not tbl.drop_and_recreate_on_load_error:
                raise
            reset_before_reload()
            with tbl.db_engine.begin() as db:
                driver = tbl.db_driver(db)
                prep.prep(schema_sql=schema_sql,
                          driver=driver,
                          existing_table_handling=ExistingTableHandling.DROP_AND_RECREATE)
                import_count = load(driver)
        return MoveResult(move_count=import_count, output_urls=None)

begin() starts a transaction and returns a connection object - naming it a "db" is pretty confusing.

Anyway, other times we don't care about transactions and don't create the connection object beforehand. I'm kind of confused about this, but sqlalchemy had this feature "connection-less execution" which basically let you execute stuff on an engine directly without needing a connection first, but they've been phasing it out: https://docs.sqlalchemy.org/en/14/core/connections.html#connectionless-execution-implicit-execution. Maybe this is why the code used to work.

I think the proper solution is to split the union from DBDriver into an optional connection argument and an Engine. If no connection is given it can create one from the Engine. Then, when a caller wants to do something like get_columns, they always use the connection attribute of DBDriver, not the Engine. This solves a bunch of confusion caused by examples such as above, where target_db_engine isn't actually an engine, it's a connection.

Should we track this work separately or do it as part of this ticket?

If we choose to put this work off until later, in the short term I think we should change the above code that I commented on (as well as the other similar occurrences) to read something like

if isinstance(self.target.db_engine, sqlalchemy.engine.Connection):
    connection = self.target_db_engine
else:
    connection = self.target._db_engine.connect()
get_columns(connection)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this approach.

expected_column_names = [
'num', 'numstr', 'str', 'comma', 'doublequote', 'quotecommaquote',
'newlinestr', 'date', 'time', 'timestamp', 'timestamptz'
Expand Down