-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Allow nullable integer columns to be used from Pandas 1.0+ #138
Conversation
basename = dtype.base.name | ||
basename = str(dtype) | ||
if 'base' in dir(dtype) and 'name' in dir(dtype.base): | ||
basename = dtype.base.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new Integer64 type in Pandas is not actually a type that has a base type, so there's no 'base' attribute in it.
from records_mover.records.schema.schema.known_representation import ( | ||
RecordsSchemaKnownRepresentation | ||
) | ||
from records_mover.records.schema.errors import UnsupportedSchemaError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated change. Generally I'd like to start moving away from relative import addressing, as it gets confusing and painful to rebaseline when I move code around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable!
Can you think of any downsides to applying convert_dtypes here? I can't-- but just thinking about whether this is ultimately something that we'd need to make controllable.
Good question! I did not apply it on the path where folks pass us their own DataFrame to us, so folks can control their own behavior there. On the CSV side, the only thing that gives me a little pause is that someone might send us a CSV and an existing Records Schema that specifies something as a floating point column even though pandas.read_csv().convert_dtypes() would make it into an integer. However, I think that's a special case of a more general existing concern (what if we interpret a float but they specified a string?). I think we haven't seen any evidence of that being an issue because databases like Redshift are robust to that issue and will load an integer formatted CSV column into a float column. We also haven't yet had interoperability situations where we're reading the records schema data from another tool that might have different behavior interpreting data. Probably the way to address that longer term if it becomes a priority is to reformat the dataframe types based on the RecordsSchema object. We could replace the call to convert_dtypes() in fileobjs.py with a call into the records schema object - maybe a new method like |
I did a little more research - we have a method that might be appropriate: We use it currently after creating dataframes from tables - but it might also be appropriate when creating dataframes from files. It would also need to be tweaked to call This might also handle the case of selecting out from a table, which it's not clear to me that this PR as-is handles. Given that, I think I'll hold off from merging this PR as-is. |
Allow nullable columns to be used in Pandas 1.0+ - prior to that, Pandas would use e.g., a numpy floating point type for integers, representing nulls as NaN.
If Pandas <1.0 is being used, logs a warning message and proceeds with the raw dtypes.
Manual test I ran to verify we now create nullable integer columns and load correctly into them: