Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Also downcast constraints and statistics when downcasting field types #103

Merged
merged 4 commits into from
Oct 2, 2020

Conversation

vinceatbluelabs
Copy link
Contributor

@vinceatbluelabs vinceatbluelabs commented Sep 15, 2020

A scenario involving a Pandas dataframe with an object dtype but with integers inside was causing an issue - it looks like the code meant to handle situations like that wasn't complete. The rest of the code is expecting type-consistent constraints and statistics, but this code wasn't adjusting those to match.

Here's the scenario:

#!/usr/bin/env python3

from records_mover import Session
from records_mover.records import existing_table_handling
import pandas as pd

# Specify which file to upload to which table
database = ...
survey_schema = ...
table_name = "dummy_population_vmb"


def upload_file(db, survey_schema, table_name):
    data = {'Population': [11190846, 1303171035, 207847528]}
    df = pd.DataFrame(data,
                      columns=['Population'])

    df['Population'] = df['Population'].astype("Int64")
    df['Population'] = df['Population'].astype("object")

    session = Session()
    session.set_stream_logging()
    records = session.records

    db_engine = session.get_db_engine(db)

    table_handling = existing_table_handling.ExistingTableHandling.APPEND

    source = session.records.sources.dataframe(df=df)
    target = session.records.targets.table(
        schema_name=survey_schema,
        table_name=table_name,
        db_engine=db_engine,
        existing_table_handling=table_handling,
    )
    results = records.move(source, target)
    print(results)

and what was happening:

Traceback (most recent call last):
  File "./upload_issue_3.py", line 43, in <module>
    upload_file(database, survey_schema, table_name)
  File "./upload_issue_3.py", line 39, in upload_file
    results = records.move(source, target)
  File "/Users/broz/src/records-mover/records_mover/records/mover.py", line 124, in move
    return move(fileobjs_source, records_target, processing_instructions)
  File "/Users/broz/src/records-mover/records_mover/records/mover.py", line 84, in move
    return records_target.move_from_fileobjs_source(records_source,
  File "/Users/broz/src/records-mover/records_mover/records/targets/table/target.py", line 77, in move_from_fileobjs_source
    return DoMoveFromFileobjsSource(self.prep,
  File "/Users/broz/src/records-mover/records_mover/records/targets/table/move_from_fileobjs_source.py", line 59, in move
    schema_sql = self.schema_sql_for_load(schema_obj, self.records_format, driver)
  File "/Users/broz/src/records-mover/records_mover/records/targets/table/base.py", line 27, in schema_sql_for_load
    return tweaked_records_schema.to_schema_sql(driver,
  File "/Users/broz/src/records-mover/records_mover/records/schema/schema/__init__.py", line 126, in to_schema_sql
    return schema_to_schema_sql(records_schema=self,
  File "/Users/broz/src/records-mover/records_mover/records/schema/schema/sqlalchemy.py", line 20, in schema_to_schema_sql
    columns = [f.to_sqlalchemy_column(driver) for f in records_schema.fields]
  File "/Users/broz/src/records-mover/records_mover/records/schema/schema/sqlalchemy.py", line 20, in <listcomp>
    columns = [f.to_sqlalchemy_column(driver) for f in records_schema.fields]
  File "/Users/broz/src/records-mover/records_mover/records/schema/field/__init__.py", line 157, in to_sqlalchemy_column
    return field_to_sqlalchemy_column(self, driver)
  File "/Users/broz/src/records-mover/records_mover/records/schema/field/sqlalchemy.py", line 200, in field_to_sqlalchemy_column
    type_=field.to_sqlalchemy_type(driver),
  File "/Users/broz/src/records-mover/records_mover/records/schema/field/__init__.py", line 152, in to_sqlalchemy_type
    return field_to_sqlalchemy_type(self, driver)
  File "/Users/broz/src/records-mover/records_mover/records/schema/field/sqlalchemy.py", line 132, in field_to_sqlalchemy_type
    raise SyntaxError(f"Incorrect constraint type in {field.name}: {field.constraints}")
SyntaxError: Incorrect constraint type in Population: RecordsSchemaFieldStringConstraints({'required': False, 'unique': False})

field = RecordsSchemaField(name=field.name,
field_type=field_type,
constraints=constraints,
# TODO: Can I get this

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncompleted punchlist item detected--consider resolving or moving this to your issue tracker

@vinceatbluelabs vinceatbluelabs changed the title Clear string constraints and statistics when downcasting field types Also downcast constraints and statistics when downcasting field types Oct 1, 2020
@@ -1 +1 @@
1015
1036
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from records_mover/records/schema/field/__init__.py increasing in size. This is a known issue - there's a lot of stuff in there: #110

Given that this bugfix already involved splitting apart our test suites, I'm inclined to leave that in the backlog for now and ask forgiveness for the bigfiles slip.

@@ -64,9 +64,9 @@ def __init__(self,
def refine_from_series(self,
series: 'Series',
total_rows: int,
rows_sampled: int) -> None:
rows_sampled: int) -> 'RecordsSchemaField':
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This previously mutated the series passed in; it now either returns the original or a new object. Other methods in this chain of calls are changed similarly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love immutability.


datetime.datetime: 'datetime',

pd.Timestamp: 'datetime',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The component test I added revealed a couple of types we could handle a little better when downcasting dataframes.

'datetimetz']

# Be sure to add new things below in FieldType, too
RECORDS_FIELD_TYPES: List[str] = list(get_args(FieldType)) # type: ignore
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this from types.py as mypy doesn't support having a file named that when it's working in the mode it's used in for single-file IDE support.

I also dropped the guards, as we now include both typing_inspect and typing_extensions as dependencies.

raise SyntaxError("Did not expect to see existing statistics "
f"for string type: {field.statistics}")
raise ValueError("Did not expect to see existing statistics "
f"for string type: {field.statistics}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SyntaxError has some special rules around it; it's documented that the 'filename' field must be set. As a result, when it's thrown out of code, the unit tests blow up when they try to read that field

ValueError works fine here, though.

if field.constraints and\
not isinstance(field.constraints, RecordsSchemaFieldIntegerConstraints):
raise ValueError(f"Incorrect constraint type in {field.name}: {field.constraints}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were missing the runtime check before doing a cast() below, which is why folks received a type error in our internal use.

'statistics_type': type(None),
}
}
for field_type in RECORDS_FIELD_TYPES:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling this list from the canonical Python type ensures we're testing each of the valid types.

rows_sampled=mock_rows_sampled,
total_rows=mock_total_rows,)
self.assertEqual(mock_field.statistics, mock_statistics)
self.assertEqual(mock_field.field_type, 'string')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests were actually non-sensical - you'd never see a more specific type determined from this data.

@vinceatbluelabs
Copy link
Contributor Author

Build failure is awaiting a merge of #109

@vinceatbluelabs vinceatbluelabs marked this pull request as ready for review October 1, 2020 21:42
Copy link
Contributor

@cwegrzyn cwegrzyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me! :shipit:

@@ -64,9 +64,9 @@ def __init__(self,
def refine_from_series(self,
series: 'Series',
total_rows: int,
rows_sampled: int) -> None:
rows_sampled: int) -> 'RecordsSchemaField':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love immutability.

total_rows=total_rows,
rows_sampled=rows_sampled)
for field in records_schema.fields
]
return RecordsSchema(fields=fields,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much immutability. This PR has made my evening.

from pandas import DataFrame


class TestDataframeSchemaSqlCreation(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is this a regression test? 🤔 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: regression - let me know if you want to talk more about test suite classification!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noooooope, just making a joke :)

@vinceatbluelabs vinceatbluelabs force-pushed the clear_string_constraints_and_stats branch 3 times, most recently from 768b954 to 675b784 Compare October 2, 2020 14:30
vinceatbluelabs and others added 2 commits October 2, 2020 10:34
* Document test suite differences

* Refine definition of component test

* Drop unneeded patch

* Move tests to component suite

* Split up test file between suites

* Add __init__.py in new directories

* Move test to component suite

* Fix missing resources

* Add missing __init__.py files

* Combine coverage

Co-authored-by: Vince Broz <[email protected]>
DRY up types now that we depend on typing_inspect and typing_extensions

Add initial unit test

SyntaxError has required fields like 'filename'

Unit tests blow up when a SyntaxError without that is raised

Fix existing unit test

Add another missing import

Make RecordsSchema#refine_from_dataframe create a new schema

Fix formatting

Implement rest of types

Drop unit test

This scenario was non-sensical, as we wouldn't downcast a string field
type to string

Drop debugging prints

Fix unneeded imports

Fix a flake8 issue because I can't find the one I introduced

Add TODOs

Drop debugging print

Drop debugging print

Add datetime downcast support

Factor out field downcast

Make if/else typesafe

Factor out method to statistics hierarchy

Factor out overall cast of fields

Ratchet flake8

Ratchet coverage

Unratchet bigfiles

351: tests/integration/itest
344: setup.py
341: records_mover/records/schema/field/__init__.py

Reduce total number of bigfiles violations to 1015 or below!

Rename types.py to field_types.py to keep mypy happy when used via editor

Unratchet mypy_high_water_mark

The added code relies on Pandas, which is not well covered by stubs.
There aren't a lot of easy opportunities to improve other type
coverage that I can see..

Ratchet mypy

Ratchet mypy coverage

Fix use of refine_from_dataframe in integration tests
@vinceatbluelabs vinceatbluelabs force-pushed the clear_string_constraints_and_stats branch from 3c68c7c to 894c053 Compare October 2, 2020 14:37
@vinceatbluelabs vinceatbluelabs merged commit 34c3b03 into master Oct 2, 2020
@vinceatbluelabs vinceatbluelabs deleted the clear_string_constraints_and_stats branch October 2, 2020 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants