Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer pandas nullable integers for int fields #159

Merged
merged 1 commit into from
Apr 2, 2021

Conversation

cwegrzyn
Copy link
Contributor

@cwegrzyn cwegrzyn commented Jan 20, 2021

When inferring pandas format from schema metadata, we previously used the most specific possible standard numpy/pandas integer format. Unfortunately, these do not support null values, so some data cannot be encoded correctly. This would cause exceptions when pandas was part of the pipeline (i.e., when reading from file or sourcing from mysql).

This adjusts the logic so that, if on a recent enough pandas, a nullable int type is used UNLESS the field is explicitly designated as required (i.e., non-null).

Replaces #138

@cwegrzyn cwegrzyn self-assigned this Jan 20, 2021
@cwegrzyn
Copy link
Contributor Author

@apiology: A few questions for you on this one

  1. Does this look basically right? I started from your last comment in the other PR-- and it did seem like a compelling and straightforward place to add this behavior.
  2. Do you think it would fit within the overall idea of the introduction of component tests to migrate some of the tests from test_field.py here into a component test? Generally, looks like the mocks are on schema classes, and it feels like it would be more natural to just write these using the schema classes themselves.
  3. I'm trying to figure out how to introduce an integration test that would fail without this behavior. Basically, something that replicates the two places I've seen this: in loading a plain text file and in copying from mysql to redshift. I can't quite see the right way to insert this (basically, we'd be loading a file that has an integer column with nulls in it). Any pointers?

@apiology
Copy link
Collaborator

Hi @cwegrzyn!

Does this look basically right? I started from your last comment in the other PR-- and it did seem like a compelling and straightforward place to add this behavior.

It certainly seems like it'd be necessary for at least some cases! I'm not entirely it's sufficient (but not sure it's not!), but it sounds like you're on the road to do the testing to verify that.

Do you think it would fit within the overall idea of the introduction of component tests to migrate some of the tests from test_field.py here into a component test? Generally, looks like the mocks are on schema classes, and it feels like it would be more natural to just write these using the schema classes themselves.

Yep! Generally if you can write it without feeling too much pain from introducing the real objects, I think migrating to a component test would be great!

I'm trying to figure out how to introduce an integration test that would fail without this behavior. Basically, something that replicates the two places I've seen this: in loading a plain text file and in copying from mysql to redshift. I can't quite see the right way to insert this (basically, we'd be loading a file that has an integer column with nulls in it). Any pointers?

The multi-db aspect could potentially be added in tests/integration/records/multi_db/test_records_table2table.py. I could see two ways of doing it:

  1. You could add nullable cases to RecordsDatabaseFixture and RecordsTableValidator that includes more cases, including this one. The bummer is that you'd need to also enhance everything in tests/integration/resources/, as these three things are shared with the tests in single_db/
  2. You could create a new and improved version of the fixture and validator that are used just in multi_db.
  3. You could leave that fixture and validator alone and create a new fixture, validator and test that covers the nullable columns. It'd have the downside of more test overhead; twice as many CREATE/DROP/SELECT statements going on, whereas the overhead of an additional column would be more minimal.

As for testing the type inference on a CSV (which uses Pandas and would presumably suffer here), that seems squarely in the component test arena - there's no benefit to test that for each database type. If the system can be trusted to be handed a CSV with no schema and produce the right schema, the existing tests should handle the rest.

Good luck with it!

@cwegrzyn
Copy link
Contributor Author

It certainly seems like it'd be necessary for at least some cases! I'm not entirely it's sufficient (but not sure it's not!), but it sounds like you're on the road to do the testing to verify that.

Ah! Yep, you are right. This solved my original problem (Mysql -> Redshift, which was using pandas as an intermediary). But it doesn't deal with what your original PR does-- having loads from CSV use the nullable types.

I'm in the midst of a rabbit-hole of applying the cast_series_type transformations on CSV loads. But this turning out to be non-trivial for two reasons:

  1. As you no doubt know, pandas.read_csv is loading dates as timestamps, which is not the expectation of our cast_series_type, so it's causing errors.
  2. I don't think it solves the type inference problem-- we still need to convert to the right types on type inference.

It's possible that number 2 ought to be solved by running convert_dtypes during type inference and leaning in on pandas type inference (although you've noted the challenges with dates/times there already)

For number 1, I feel like I need to go deeper into the rabbit hole of dates/times anyway!

Yep! Generally if you can write it without feeling too much pain from introducing the real objects, I think migrating to a component test would be great!

Great! Going to do that, especially since I suspect the logic is going to get even more complicated in there...

The multi-db aspect could potentially be added in tests/integration/records/multi_db/test_records_table2table.py. I could see two ways of doing it:

  1. You could add nullable cases to RecordsDatabaseFixture and RecordsTableValidator that includes more cases, including this one. The bummer is that you'd need to also enhance everything in tests/integration/resources/, as these three things are shared with the tests in single_db/
  2. You could create a new and improved version of the fixture and validator that are used just in multi_db.
  3. You could leave that fixture and validator alone and create a new fixture, validator and test that covers the nullable columns. It'd have the downside of more test overhead; twice as many CREATE/DROP/SELECT statements going on, whereas the overhead of an additional column would be more minimal.

Thanks for those pointers. I attempted to add a column to the standard fixture. But then I ran up against the fact that we generally assume NULL == empty string, but mysql LOAD DATA INFILE only does NULL == \N. mysql reads an empty string in an int column as the value 0 instead of NULL, no matter what. I guess I could special case that in the table validator... currently, that's the reality of the mysql load support until we do some file pre-processing. Maybe that's the answer: allow that nulls will load as 0 for now, and add an issue to pre-process files to fix that? What do you think?

As for testing the type inference on a CSV (which uses Pandas and would presumably suffer here), that seems squarely in the component test arena - there's no benefit to test that for each database type. If the system can be trusted to be handed a CSV with no schema and produce the right schema, the existing tests should handle the rest.

Makes sense! I think I may get a better grasp on this as I noodle on the date/time situation.

Good luck with it!

Thanks! And I hope I expressed adequate appreciation for the complexity of the challenge you were taking on in building this... but in any event, I sure do have that appreciation now 😁

@apiology
Copy link
Collaborator

Thanks for those pointers. I attempted to add a column to the standard fixture. But then I ran up against the fact that we generally assume NULL == empty string, but mysql LOAD DATA INFILE only does NULL == \N. mysql reads an empty string in an int column as the value 0 instead of NULL, no matter what. I guess I could special case that in the table validator... currently, that's the reality of the mysql load support until we do some file pre-processing. Maybe that's the answer: allow that nulls will load as 0 for now, and add an issue to pre-process files to fix that? What do you think?

Ah, interesting. I guess in theory if you wanted to go further down the rabbit hole, you could start with a separate PR that introduces full support for the 'null_as' hint, describe to Records Mover what null is described as in the file, and have Records Mover rewrite the file to whatever MySQL requires by reading and writing using Pandas.

Or maybe just forget the integration test for now and focus on a component test that locks in the behavior you want, and yeah, work around in practice for now until there's time to write the 'null_as' PR :)

Half the game here is keeping to one PR at a time and not ending up with six features in at once...

Thanks! And I hope I expressed adequate appreciation for the complexity of the challenge you were taking on in building this... but in any event, I sure do have that appreciation now 😁

Hahaha, I appreciate it! Figuring out how to provide monotonically increasing value with every PR was a challenge indeed.

I don't have much to say about the timestamp vs date issues you referenced - that stuff isn't stuck in my head anymore, sadly, so I'd probably need more detail to be of any use to you.

@cwegrzyn cwegrzyn force-pushed the use-nullable-int-type branch from f604976 to be21401 Compare January 30, 2021 18:36
@cwegrzyn cwegrzyn marked this pull request as ready for review February 12, 2021 20:34
@cwegrzyn cwegrzyn requested a review from apiology February 12, 2021 20:34
@cwegrzyn
Copy link
Contributor Author

@apiology: okay-- finally had a bit of time to polish this one off, sorta. This PR very narrowly deals in the question of what pandas types do we use for a schema, and says "use the nullable pandas ones when available". I moved the unit tests to component tests and validated the behavior there. As written, this ONLY fixes the problem of exporting from a database that has NULL values via pandas; it does nothing for the problem of using the write pandas types when loading from a CSV file-- that's more complex and I think deserves its own PR. I also couldn't come up with a good test of that behavior alone, because I think it depends on addressing a lot more NULL handling issues if it's going to live in an integration test, which seemed too complex for this PR.

@cwegrzyn cwegrzyn changed the title Prefer pandas.Int64Dtype for non-required int fields Prefer pandas nullable integers for non-required int fields Feb 12, 2021
Copy link
Collaborator

@apiology apiology left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, and I like the approach of keeping PRs small and focused!

As you note, the result is often a chain of dependencies to get to the high-level functionality you actually want to tweak - keeping track of that (and being handy with the workarounds while you work through the process!) is a challenge, for sure, but given the foundational aspect of features, it's very helpful as a reviewer to keep them separate.

return np.uint64
typ = integer_type_for_range(min_, max_, has_extension_types)
if typ:
return typ
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing, but I'd recommend keeping to one single style for avoiding keywords (either dropping letters or adding an underscore after) - there's two used in this function.

(UINT32_MIN, UINT32_MAX, 'uint32', 'UInt32'),
(INT64_MIN, INT64_MAX, 'int64', 'Int64'),
(UINT64_MIN, UINT64_MAX, 'uint64', 'UInt64'),
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is feeling like it may be evolving into a class / enum or the like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I spent a ridiculous amount of time on writing and rewriting this style-wise, and searching for public numpy or pandas functions related to types. And then I finally gave up and did this. I didn't feel good but it got the job done without the branching complexity that flake8 complained about.

I feel like I want to see how other schema->pandas conversion stuff goes, see what patterns arise, and then revisit this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I went for it.

@@ -7,7 +7,7 @@
},
"numstr": {
"type": "string",
"index": 1,
"index": 2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll bet there was a story behind finding this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, I think it might have just been red herrings while I was originally working on a NULL integration test, but figured it might as well get fixed before I forget again...

set_stream_logging(level=logging.DEBUG)
logging.getLogger('botocore').setLevel(logging.INFO)
logging.getLogger('boto3').setLevel(logging.INFO)
logging.getLogger('urllib3').setLevel(logging.INFO)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you did some grunt work to find the noisier external libraries so you could chase down DEBUG-level messages in Records Mover. It might be worth rolling control of those into a set_stream_logging() parameter at some point, or adding a comment at least describing the getLogger() syntax. Not everyone using this is an experienced Pythonista and this is a good signal they might have this need as well.

(also just a reminder that the volume is still turned up on these integration tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the normal behavior of set stream logging appears to be correct-- it's just setting the records_mover logger. I couldn't quite figure out what was making the root logger DEBUG-- maybe something in nose itself? Whatever it is, I couldn't find it-- but this worked to make it less noisy.


def to_numpy_dtype(self) -> Union[Type[Any], str]:
def to_pandas_dtype(self) -> 'Dtype':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C901 'RecordsSchemaField.to_pandas_dtype' is too complex (22)

@cwegrzyn cwegrzyn requested a review from apiology February 14, 2021 16:03
@cwegrzyn cwegrzyn changed the title Prefer pandas nullable integers for non-required int fields Prefer pandas nullable integers for int fields Feb 14, 2021
@cwegrzyn cwegrzyn force-pushed the use-nullable-int-type branch from 6ac2a00 to 0ba11c3 Compare February 14, 2021 18:39
@cwegrzyn cwegrzyn merged commit 3fc894c into master Apr 2, 2021
@cwegrzyn cwegrzyn deleted the use-nullable-int-type branch April 2, 2021 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants