-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefer pandas nullable integers for int fields #159
Conversation
@apiology: A few questions for you on this one
|
Hi @cwegrzyn!
It certainly seems like it'd be necessary for at least some cases! I'm not entirely it's sufficient (but not sure it's not!), but it sounds like you're on the road to do the testing to verify that.
Yep! Generally if you can write it without feeling too much pain from introducing the real objects, I think migrating to a component test would be great!
The multi-db aspect could potentially be added in tests/integration/records/multi_db/test_records_table2table.py. I could see two ways of doing it:
As for testing the type inference on a CSV (which uses Pandas and would presumably suffer here), that seems squarely in the component test arena - there's no benefit to test that for each database type. If the system can be trusted to be handed a CSV with no schema and produce the right schema, the existing tests should handle the rest. Good luck with it! |
Ah! Yep, you are right. This solved my original problem (Mysql -> Redshift, which was using pandas as an intermediary). But it doesn't deal with what your original PR does-- having loads from CSV use the nullable types. I'm in the midst of a rabbit-hole of applying the
It's possible that number 2 ought to be solved by running convert_dtypes during type inference and leaning in on pandas type inference (although you've noted the challenges with dates/times there already) For number 1, I feel like I need to go deeper into the rabbit hole of dates/times anyway!
Great! Going to do that, especially since I suspect the logic is going to get even more complicated in there...
Thanks for those pointers. I attempted to add a column to the standard fixture. But then I ran up against the fact that we generally assume NULL == empty string, but mysql LOAD DATA INFILE only does NULL ==
Makes sense! I think I may get a better grasp on this as I noodle on the date/time situation.
Thanks! And I hope I expressed adequate appreciation for the complexity of the challenge you were taking on in building this... but in any event, I sure do have that appreciation now 😁 |
Ah, interesting. I guess in theory if you wanted to go further down the rabbit hole, you could start with a separate PR that introduces full support for the 'null_as' hint, describe to Records Mover what null is described as in the file, and have Records Mover rewrite the file to whatever MySQL requires by reading and writing using Pandas. Or maybe just forget the integration test for now and focus on a component test that locks in the behavior you want, and yeah, work around in practice for now until there's time to write the 'null_as' PR :) Half the game here is keeping to one PR at a time and not ending up with six features in at once...
Hahaha, I appreciate it! Figuring out how to provide monotonically increasing value with every PR was a challenge indeed. I don't have much to say about the timestamp vs date issues you referenced - that stuff isn't stuck in my head anymore, sadly, so I'd probably need more detail to be of any use to you. |
f604976
to
be21401
Compare
@apiology: okay-- finally had a bit of time to polish this one off, sorta. This PR very narrowly deals in the question of what pandas types do we use for a schema, and says "use the nullable pandas ones when available". I moved the unit tests to component tests and validated the behavior there. As written, this ONLY fixes the problem of exporting from a database that has NULL values via pandas; it does nothing for the problem of using the write pandas types when loading from a CSV file-- that's more complex and I think deserves its own PR. I also couldn't come up with a good test of that behavior alone, because I think it depends on addressing a lot more NULL handling issues if it's going to live in an integration test, which seemed too complex for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me, and I like the approach of keeping PRs small and focused!
As you note, the result is often a chain of dependencies to get to the high-level functionality you actually want to tweak - keeping track of that (and being handy with the workarounds while you work through the process!) is a challenge, for sure, but given the foundational aspect of features, it's very helpful as a reviewer to keep them separate.
return np.uint64 | ||
typ = integer_type_for_range(min_, max_, has_extension_types) | ||
if typ: | ||
return typ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small thing, but I'd recommend keeping to one single style for avoiding keywords (either dropping letters or adding an underscore after) - there's two used in this function.
(UINT32_MIN, UINT32_MAX, 'uint32', 'UInt32'), | ||
(INT64_MIN, INT64_MAX, 'int64', 'Int64'), | ||
(UINT64_MIN, UINT64_MAX, 'uint64', 'UInt64'), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is feeling like it may be evolving into a class / enum or the like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I spent a ridiculous amount of time on writing and rewriting this style-wise, and searching for public numpy or pandas functions related to types. And then I finally gave up and did this. I didn't feel good but it got the job done without the branching complexity that flake8 complained about.
I feel like I want to see how other schema->pandas conversion stuff goes, see what patterns arise, and then revisit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I went for it.
@@ -7,7 +7,7 @@ | |||
}, | |||
"numstr": { | |||
"type": "string", | |||
"index": 1, | |||
"index": 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll bet there was a story behind finding this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah, I think it might have just been red herrings while I was originally working on a NULL integration test, but figured it might as well get fixed before I forget again...
set_stream_logging(level=logging.DEBUG) | ||
logging.getLogger('botocore').setLevel(logging.INFO) | ||
logging.getLogger('boto3').setLevel(logging.INFO) | ||
logging.getLogger('urllib3').setLevel(logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you did some grunt work to find the noisier external libraries so you could chase down DEBUG-level messages in Records Mover. It might be worth rolling control of those into a set_stream_logging() parameter at some point, or adding a comment at least describing the getLogger() syntax. Not everyone using this is an experienced Pythonista and this is a good signal they might have this need as well.
(also just a reminder that the volume is still turned up on these integration tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the normal behavior of set stream logging appears to be correct-- it's just setting the records_mover logger. I couldn't quite figure out what was making the root logger DEBUG-- maybe something in nose itself? Whatever it is, I couldn't find it-- but this worked to make it less noisy.
|
||
def to_numpy_dtype(self) -> Union[Type[Any], str]: | ||
def to_pandas_dtype(self) -> 'Dtype': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C901 'RecordsSchemaField.to_pandas_dtype' is too complex (22)
6ac2a00
to
0ba11c3
Compare
When inferring pandas format from schema metadata, we previously used the most specific possible standard numpy/pandas integer format. Unfortunately, these do not support null values, so some data cannot be encoded correctly. This would cause exceptions when pandas was part of the pipeline (i.e., when reading from file or sourcing from mysql).
This adjusts the logic so that, if on a recent enough pandas, a nullable int type is used UNLESS the field is explicitly designated as required (i.e., non-null).
Replaces #138