-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Also downcast constraints and statistics when downcasting field types #103
Conversation
field = RecordsSchemaField(name=field.name, | ||
field_type=field_type, | ||
constraints=constraints, | ||
# TODO: Can I get this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncompleted punchlist item detected--consider resolving or moving this to your issue tracker
metrics/bigfiles_high_water_mark
Outdated
@@ -1 +1 @@ | |||
1015 | |||
1036 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from records_mover/records/schema/field/__init__.py
increasing in size. This is a known issue - there's a lot of stuff in there: #110
Given that this bugfix already involved splitting apart our test suites, I'm inclined to leave that in the backlog for now and ask forgiveness for the bigfiles slip.
@@ -64,9 +64,9 @@ def __init__(self, | |||
def refine_from_series(self, | |||
series: 'Series', | |||
total_rows: int, | |||
rows_sampled: int) -> None: | |||
rows_sampled: int) -> 'RecordsSchemaField': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This previously mutated the series passed in; it now either returns the original or a new object. Other methods in this chain of calls are changed similarly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love immutability.
|
||
datetime.datetime: 'datetime', | ||
|
||
pd.Timestamp: 'datetime', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The component test I added revealed a couple of types we could handle a little better when downcasting dataframes.
'datetimetz'] | ||
|
||
# Be sure to add new things below in FieldType, too | ||
RECORDS_FIELD_TYPES: List[str] = list(get_args(FieldType)) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this from types.py
as mypy doesn't support having a file named that when it's working in the mode it's used in for single-file IDE support.
I also dropped the guards, as we now include both typing_inspect and typing_extensions as dependencies.
raise SyntaxError("Did not expect to see existing statistics " | ||
f"for string type: {field.statistics}") | ||
raise ValueError("Did not expect to see existing statistics " | ||
f"for string type: {field.statistics}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SyntaxError has some special rules around it; it's documented that the 'filename' field must be set. As a result, when it's thrown out of code, the unit tests blow up when they try to read that field
ValueError works fine here, though.
if field.constraints and\ | ||
not isinstance(field.constraints, RecordsSchemaFieldIntegerConstraints): | ||
raise ValueError(f"Incorrect constraint type in {field.name}: {field.constraints}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were missing the runtime check before doing a cast() below, which is why folks received a type error in our internal use.
'statistics_type': type(None), | ||
} | ||
} | ||
for field_type in RECORDS_FIELD_TYPES: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulling this list from the canonical Python type ensures we're testing each of the valid types.
rows_sampled=mock_rows_sampled, | ||
total_rows=mock_total_rows,) | ||
self.assertEqual(mock_field.statistics, mock_statistics) | ||
self.assertEqual(mock_field.field_type, 'string') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests were actually non-sensical - you'd never see a more specific type determined from this data.
Build failure is awaiting a merge of #109 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me!
@@ -64,9 +64,9 @@ def __init__(self, | |||
def refine_from_series(self, | |||
series: 'Series', | |||
total_rows: int, | |||
rows_sampled: int) -> None: | |||
rows_sampled: int) -> 'RecordsSchemaField': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love immutability.
total_rows=total_rows, | ||
rows_sampled=rows_sampled) | ||
for field in records_schema.fields | ||
] | ||
return RecordsSchema(fields=fields, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So much immutability. This PR has made my evening.
from pandas import DataFrame | ||
|
||
|
||
class TestDataframeSchemaSqlCreation(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is this a regression test? 🤔 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: regression - let me know if you want to talk more about test suite classification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noooooope, just making a joke :)
768b954
to
675b784
Compare
* Document test suite differences * Refine definition of component test * Drop unneeded patch * Move tests to component suite * Split up test file between suites * Add __init__.py in new directories * Move test to component suite * Fix missing resources * Add missing __init__.py files * Combine coverage Co-authored-by: Vince Broz <[email protected]>
DRY up types now that we depend on typing_inspect and typing_extensions Add initial unit test SyntaxError has required fields like 'filename' Unit tests blow up when a SyntaxError without that is raised Fix existing unit test Add another missing import Make RecordsSchema#refine_from_dataframe create a new schema Fix formatting Implement rest of types Drop unit test This scenario was non-sensical, as we wouldn't downcast a string field type to string Drop debugging prints Fix unneeded imports Fix a flake8 issue because I can't find the one I introduced Add TODOs Drop debugging print Drop debugging print Add datetime downcast support Factor out field downcast Make if/else typesafe Factor out method to statistics hierarchy Factor out overall cast of fields Ratchet flake8 Ratchet coverage Unratchet bigfiles 351: tests/integration/itest 344: setup.py 341: records_mover/records/schema/field/__init__.py Reduce total number of bigfiles violations to 1015 or below! Rename types.py to field_types.py to keep mypy happy when used via editor Unratchet mypy_high_water_mark The added code relies on Pandas, which is not well covered by stubs. There aren't a lot of easy opportunities to improve other type coverage that I can see.. Ratchet mypy Ratchet mypy coverage Fix use of refine_from_dataframe in integration tests
3c68c7c
to
894c053
Compare
A scenario involving a Pandas dataframe with an object dtype but with integers inside was causing an issue - it looks like the code meant to handle situations like that wasn't complete. The rest of the code is expecting type-consistent constraints and statistics, but this code wasn't adjusting those to match.
Here's the scenario:
and what was happening: