Also downcast constraints and statistics when downcasting field types #103

vinceatbluelabs · 2020-09-15T15:10:32Z

A scenario involving a Pandas dataframe with an object dtype but with integers inside was causing an issue - it looks like the code meant to handle situations like that wasn't complete. The rest of the code is expecting type-consistent constraints and statistics, but this code wasn't adjusting those to match.

Here's the scenario:

#!/usr/bin/env python3

from records_mover import Session
from records_mover.records import existing_table_handling
import pandas as pd

# Specify which file to upload to which table
database = ...
survey_schema = ...
table_name = "dummy_population_vmb"


def upload_file(db, survey_schema, table_name):
    data = {'Population': [11190846, 1303171035, 207847528]}
    df = pd.DataFrame(data,
                      columns=['Population'])

    df['Population'] = df['Population'].astype("Int64")
    df['Population'] = df['Population'].astype("object")

    session = Session()
    session.set_stream_logging()
    records = session.records

    db_engine = session.get_db_engine(db)

    table_handling = existing_table_handling.ExistingTableHandling.APPEND

    source = session.records.sources.dataframe(df=df)
    target = session.records.targets.table(
        schema_name=survey_schema,
        table_name=table_name,
        db_engine=db_engine,
        existing_table_handling=table_handling,
    )
    results = records.move(source, target)
    print(results)

and what was happening:

Traceback (most recent call last):
  File "./upload_issue_3.py", line 43, in <module>
    upload_file(database, survey_schema, table_name)
  File "./upload_issue_3.py", line 39, in upload_file
    results = records.move(source, target)
  File "/Users/broz/src/records-mover/records_mover/records/mover.py", line 124, in move
    return move(fileobjs_source, records_target, processing_instructions)
  File "/Users/broz/src/records-mover/records_mover/records/mover.py", line 84, in move
    return records_target.move_from_fileobjs_source(records_source,
  File "/Users/broz/src/records-mover/records_mover/records/targets/table/target.py", line 77, in move_from_fileobjs_source
    return DoMoveFromFileobjsSource(self.prep,
  File "/Users/broz/src/records-mover/records_mover/records/targets/table/move_from_fileobjs_source.py", line 59, in move
    schema_sql = self.schema_sql_for_load(schema_obj, self.records_format, driver)
  File "/Users/broz/src/records-mover/records_mover/records/targets/table/base.py", line 27, in schema_sql_for_load
    return tweaked_records_schema.to_schema_sql(driver,
  File "/Users/broz/src/records-mover/records_mover/records/schema/schema/__init__.py", line 126, in to_schema_sql
    return schema_to_schema_sql(records_schema=self,
  File "/Users/broz/src/records-mover/records_mover/records/schema/schema/sqlalchemy.py", line 20, in schema_to_schema_sql
    columns = [f.to_sqlalchemy_column(driver) for f in records_schema.fields]
  File "/Users/broz/src/records-mover/records_mover/records/schema/schema/sqlalchemy.py", line 20, in <listcomp>
    columns = [f.to_sqlalchemy_column(driver) for f in records_schema.fields]
  File "/Users/broz/src/records-mover/records_mover/records/schema/field/__init__.py", line 157, in to_sqlalchemy_column
    return field_to_sqlalchemy_column(self, driver)
  File "/Users/broz/src/records-mover/records_mover/records/schema/field/sqlalchemy.py", line 200, in field_to_sqlalchemy_column
    type_=field.to_sqlalchemy_type(driver),
  File "/Users/broz/src/records-mover/records_mover/records/schema/field/__init__.py", line 152, in to_sqlalchemy_type
    return field_to_sqlalchemy_type(self, driver)
  File "/Users/broz/src/records-mover/records_mover/records/schema/field/sqlalchemy.py", line 132, in field_to_sqlalchemy_type
    raise SyntaxError(f"Incorrect constraint type in {field.name}: {field.constraints}")
SyntaxError: Incorrect constraint type in Population: RecordsSchemaFieldStringConstraints({'required': False, 'unique': False})

tests/component/records/schema/field/test_pandas.py

tests/unit/records/sources/test_dataframes.py

tests/component/records/schema/field/test_pandas.py

records_mover/records/schema/field/constraints/constraints.py

bluelabsbutler · 2020-10-01T16:32:45Z

records_mover/records/schema/field/pandas.py

+                field = RecordsSchemaField(name=field.name,
+                                           field_type=field_type,
+                                           constraints=constraints,
+                                           # TODO: Can I get this


Uncompleted punchlist item detected--consider resolving or moving this to your issue tracker

records_mover/records/schema/field/pandas.py

vinceatbluelabs · 2020-10-01T21:30:40Z

metrics/bigfiles_high_water_mark

@@ -1 +1 @@
-1015
+1036


This is from records_mover/records/schema/field/__init__.py increasing in size. This is a known issue - there's a lot of stuff in there: #110

Given that this bugfix already involved splitting apart our test suites, I'm inclined to leave that in the backlog for now and ask forgiveness for the bigfiles slip.

vinceatbluelabs · 2020-10-01T21:32:19Z

records_mover/records/schema/field/__init__.py

@@ -64,9 +64,9 @@ def __init__(self,
    def refine_from_series(self,
                           series: 'Series',
                           total_rows: int,
-                           rows_sampled: int) -> None:
+                           rows_sampled: int) -> 'RecordsSchemaField':


This previously mutated the series passed in; it now either returns the original or a new object. Other methods in this chain of calls are changed similarly.

I love immutability.

vinceatbluelabs · 2020-10-01T21:33:20Z

records_mover/records/schema/field/__init__.py

+
+            datetime.datetime: 'datetime',
+
+            pd.Timestamp: 'datetime',


The component test I added revealed a couple of types we could handle a little better when downcasting dataframes.

vinceatbluelabs · 2020-10-01T21:36:10Z

records_mover/records/schema/field/field_types.py

+                    'datetimetz']
+
+# Be sure to add new things below in FieldType, too
+RECORDS_FIELD_TYPES: List[str] = list(get_args(FieldType))  # type: ignore


I renamed this from types.py as mypy doesn't support having a file named that when it's working in the mode it's used in for single-file IDE support.

I also dropped the guards, as we now include both typing_inspect and typing_extensions as dependencies.

vinceatbluelabs · 2020-10-01T21:37:29Z

records_mover/records/schema/field/pandas.py

-                raise SyntaxError("Did not expect to see existing statistics "
-                                  f"for string type: {field.statistics}")
+                raise ValueError("Did not expect to see existing statistics "
+                                 f"for string type: {field.statistics}")


SyntaxError has some special rules around it; it's documented that the 'filename' field must be set. As a result, when it's thrown out of code, the unit tests blow up when they try to read that field

ValueError works fine here, though.

vinceatbluelabs · 2020-10-01T21:38:17Z

records_mover/records/schema/field/sqlalchemy.py

+        if field.constraints and\
+           not isinstance(field.constraints, RecordsSchemaFieldIntegerConstraints):
+            raise ValueError(f"Incorrect constraint type in {field.name}: {field.constraints}")
+


We were missing the runtime check before doing a cast() below, which is why folks received a type error in our internal use.

vinceatbluelabs · 2020-10-01T21:40:10Z

tests/component/records/schema/field/test_pandas.py

+                'statistics_type': type(None),
+            }
+        }
+        for field_type in RECORDS_FIELD_TYPES:


Pulling this list from the canonical Python type ensures we're testing each of the valid types.

vinceatbluelabs · 2020-10-01T21:41:20Z

tests/unit/records/schema/field/test_pandas.py

-                               rows_sampled=mock_rows_sampled,
-                               total_rows=mock_total_rows,)
-        self.assertEqual(mock_field.statistics, mock_statistics)
-        self.assertEqual(mock_field.field_type, 'string')


These tests were actually non-sensical - you'd never see a more specific type determined from this data.

vinceatbluelabs · 2020-10-01T21:42:06Z

Build failure is awaiting a merge of #109

cwegrzyn

This makes sense to me!

cwegrzyn · 2020-10-02T00:18:59Z

records_mover/records/schema/field/__init__.py

@@ -64,9 +64,9 @@ def __init__(self,
    def refine_from_series(self,
                           series: 'Series',
                           total_rows: int,
-                           rows_sampled: int) -> None:
+                           rows_sampled: int) -> 'RecordsSchemaField':


I love immutability.

cwegrzyn · 2020-10-02T00:23:09Z

records_mover/records/schema/schema/pandas.py

                                 total_rows=total_rows,
                                 rows_sampled=rows_sampled)
+        for field in records_schema.fields
+    ]
+    return RecordsSchema(fields=fields,


So much immutability. This PR has made my evening.

cwegrzyn · 2020-10-02T00:24:59Z

tests/component/records/test_dataframe_schema_sql_creation.py

+from pandas import DataFrame
+
+
+class TestDataframeSchemaSqlCreation(unittest.TestCase):


Or is this a regression test? 🤔 😁

Re: regression - let me know if you want to talk more about test suite classification!

Noooooope, just making a joke :)

* Document test suite differences * Refine definition of component test * Drop unneeded patch * Move tests to component suite * Split up test file between suites * Add __init__.py in new directories * Move test to component suite * Fix missing resources * Add missing __init__.py files * Combine coverage Co-authored-by: Vince Broz <[email protected]>

DRY up types now that we depend on typing_inspect and typing_extensions Add initial unit test SyntaxError has required fields like 'filename' Unit tests blow up when a SyntaxError without that is raised Fix existing unit test Add another missing import Make RecordsSchema#refine_from_dataframe create a new schema Fix formatting Implement rest of types Drop unit test This scenario was non-sensical, as we wouldn't downcast a string field type to string Drop debugging prints Fix unneeded imports Fix a flake8 issue because I can't find the one I introduced Add TODOs Drop debugging print Drop debugging print Add datetime downcast support Factor out field downcast Make if/else typesafe Factor out method to statistics hierarchy Factor out overall cast of fields Ratchet flake8 Ratchet coverage Unratchet bigfiles 351: tests/integration/itest 344: setup.py 341: records_mover/records/schema/field/__init__.py Reduce total number of bigfiles violations to 1015 or below! Rename types.py to field_types.py to keep mypy happy when used via editor Unratchet mypy_high_water_mark The added code relies on Pandas, which is not well covered by stubs. There aren't a lot of easy opportunities to improve other type coverage that I can see.. Ratchet mypy Ratchet mypy coverage Fix use of refine_from_dataframe in integration tests

…aints_and_stats