Prefer pandas nullable integers for int fields #159

cwegrzyn · 2021-01-20T22:06:53Z

When inferring pandas format from schema metadata, we previously used the most specific possible standard numpy/pandas integer format. Unfortunately, these do not support null values, so some data cannot be encoded correctly. This would cause exceptions when pandas was part of the pipeline (i.e., when reading from file or sourcing from mysql).

This adjusts the logic so that, if on a recent enough pandas, a nullable int type is used UNLESS the field is explicitly designated as required (i.e., non-null).

Replaces #138

cwegrzyn · 2021-01-22T19:29:48Z

@apiology: A few questions for you on this one

Does this look basically right? I started from your last comment in the other PR-- and it did seem like a compelling and straightforward place to add this behavior.
Do you think it would fit within the overall idea of the introduction of component tests to migrate some of the tests from test_field.py here into a component test? Generally, looks like the mocks are on schema classes, and it feels like it would be more natural to just write these using the schema classes themselves.
I'm trying to figure out how to introduce an integration test that would fail without this behavior. Basically, something that replicates the two places I've seen this: in loading a plain text file and in copying from mysql to redshift. I can't quite see the right way to insert this (basically, we'd be loading a file that has an integer column with nulls in it). Any pointers?

apiology · 2021-01-23T17:02:48Z

Hi @cwegrzyn!

Does this look basically right? I started from your last comment in the other PR-- and it did seem like a compelling and straightforward place to add this behavior.

It certainly seems like it'd be necessary for at least some cases! I'm not entirely it's sufficient (but not sure it's not!), but it sounds like you're on the road to do the testing to verify that.

Do you think it would fit within the overall idea of the introduction of component tests to migrate some of the tests from test_field.py here into a component test? Generally, looks like the mocks are on schema classes, and it feels like it would be more natural to just write these using the schema classes themselves.

Yep! Generally if you can write it without feeling too much pain from introducing the real objects, I think migrating to a component test would be great!

I'm trying to figure out how to introduce an integration test that would fail without this behavior. Basically, something that replicates the two places I've seen this: in loading a plain text file and in copying from mysql to redshift. I can't quite see the right way to insert this (basically, we'd be loading a file that has an integer column with nulls in it). Any pointers?

The multi-db aspect could potentially be added in tests/integration/records/multi_db/test_records_table2table.py. I could see two ways of doing it:

You could add nullable cases to RecordsDatabaseFixture and RecordsTableValidator that includes more cases, including this one. The bummer is that you'd need to also enhance everything in tests/integration/resources/, as these three things are shared with the tests in single_db/
You could create a new and improved version of the fixture and validator that are used just in multi_db.
You could leave that fixture and validator alone and create a new fixture, validator and test that covers the nullable columns. It'd have the downside of more test overhead; twice as many CREATE/DROP/SELECT statements going on, whereas the overhead of an additional column would be more minimal.

As for testing the type inference on a CSV (which uses Pandas and would presumably suffer here), that seems squarely in the component test arena - there's no benefit to test that for each database type. If the system can be trusted to be handed a CSV with no schema and produce the right schema, the existing tests should handle the rest.

Good luck with it!

cwegrzyn · 2021-01-25T22:26:48Z

It certainly seems like it'd be necessary for at least some cases! I'm not entirely it's sufficient (but not sure it's not!), but it sounds like you're on the road to do the testing to verify that.

Ah! Yep, you are right. This solved my original problem (Mysql -> Redshift, which was using pandas as an intermediary). But it doesn't deal with what your original PR does-- having loads from CSV use the nullable types.

I'm in the midst of a rabbit-hole of applying the cast_series_type transformations on CSV loads. But this turning out to be non-trivial for two reasons:

As you no doubt know, pandas.read_csv is loading dates as timestamps, which is not the expectation of our cast_series_type, so it's causing errors.
I don't think it solves the type inference problem-- we still need to convert to the right types on type inference.

It's possible that number 2 ought to be solved by running convert_dtypes during type inference and leaning in on pandas type inference (although you've noted the challenges with dates/times there already)

For number 1, I feel like I need to go deeper into the rabbit hole of dates/times anyway!

Yep! Generally if you can write it without feeling too much pain from introducing the real objects, I think migrating to a component test would be great!

Great! Going to do that, especially since I suspect the logic is going to get even more complicated in there...

The multi-db aspect could potentially be added in tests/integration/records/multi_db/test_records_table2table.py. I could see two ways of doing it:

You could add nullable cases to RecordsDatabaseFixture and RecordsTableValidator that includes more cases, including this one. The bummer is that you'd need to also enhance everything in tests/integration/resources/, as these three things are shared with the tests in single_db/

You could create a new and improved version of the fixture and validator that are used just in multi_db.

You could leave that fixture and validator alone and create a new fixture, validator and test that covers the nullable columns. It'd have the downside of more test overhead; twice as many CREATE/DROP/SELECT statements going on, whereas the overhead of an additional column would be more minimal.

Thanks for those pointers. I attempted to add a column to the standard fixture. But then I ran up against the fact that we generally assume NULL == empty string, but mysql LOAD DATA INFILE only does NULL == \N. mysql reads an empty string in an int column as the value 0 instead of NULL, no matter what. I guess I could special case that in the table validator... currently, that's the reality of the mysql load support until we do some file pre-processing. Maybe that's the answer: allow that nulls will load as 0 for now, and add an issue to pre-process files to fix that? What do you think?

As for testing the type inference on a CSV (which uses Pandas and would presumably suffer here), that seems squarely in the component test arena - there's no benefit to test that for each database type. If the system can be trusted to be handed a CSV with no schema and produce the right schema, the existing tests should handle the rest.

Makes sense! I think I may get a better grasp on this as I noodle on the date/time situation.

Good luck with it!

Thanks! And I hope I expressed adequate appreciation for the complexity of the challenge you were taking on in building this... but in any event, I sure do have that appreciation now 😁

apiology · 2021-01-26T04:02:27Z

Thanks for those pointers. I attempted to add a column to the standard fixture. But then I ran up against the fact that we generally assume NULL == empty string, but mysql LOAD DATA INFILE only does NULL == \N. mysql reads an empty string in an int column as the value 0 instead of NULL, no matter what. I guess I could special case that in the table validator... currently, that's the reality of the mysql load support until we do some file pre-processing. Maybe that's the answer: allow that nulls will load as 0 for now, and add an issue to pre-process files to fix that? What do you think?

Ah, interesting. I guess in theory if you wanted to go further down the rabbit hole, you could start with a separate PR that introduces full support for the 'null_as' hint, describe to Records Mover what null is described as in the file, and have Records Mover rewrite the file to whatever MySQL requires by reading and writing using Pandas.

Or maybe just forget the integration test for now and focus on a component test that locks in the behavior you want, and yeah, work around in practice for now until there's time to write the 'null_as' PR :)

Half the game here is keeping to one PR at a time and not ending up with six features in at once...

Thanks! And I hope I expressed adequate appreciation for the complexity of the challenge you were taking on in building this... but in any event, I sure do have that appreciation now 😁

Hahaha, I appreciate it! Figuring out how to provide monotonically increasing value with every PR was a challenge indeed.

I don't have much to say about the timestamp vs date issues you referenced - that stuff isn't stuck in my head anymore, sadly, so I'd probably need more detail to be of any use to you.

records_mover/records/pandas/prep_for_csv.py

records_mover/records/schema/field/__init__.py

tests/component/records/schema/field/test_dtype.py

records_mover/records/schema/field/__init__.py

cwegrzyn · 2021-02-12T20:39:13Z

@apiology: okay-- finally had a bit of time to polish this one off, sorta. This PR very narrowly deals in the question of what pandas types do we use for a schema, and says "use the nullable pandas ones when available". I moved the unit tests to component tests and validated the behavior there. As written, this ONLY fixes the problem of exporting from a database that has NULL values via pandas; it does nothing for the problem of using the write pandas types when loading from a CSV file-- that's more complex and I think deserves its own PR. I also couldn't come up with a good test of that behavior alone, because I think it depends on addressing a lot more NULL handling issues if it's going to live in an integration test, which seemed too complex for this PR.

apiology

This makes sense to me, and I like the approach of keeping PRs small and focused!

As you note, the result is often a chain of dependencies to get to the high-level functionality you actually want to tweak - keeping track of that (and being handy with the workarounds while you work through the process!) is a challenge, for sure, but given the foundational aspect of features, it's very helpful as a reviewer to keep them separate.

apiology · 2021-02-13T15:25:02Z

records_mover/records/schema/field/__init__.py

-                    return np.uint64
+                typ = integer_type_for_range(min_, max_, has_extension_types)
+                if typ:
+                    return typ


Small thing, but I'd recommend keeping to one single style for avoiding keywords (either dropping letters or adding an underscore after) - there's two used in this function.

apiology · 2021-02-13T15:26:29Z

records_mover/records/schema/field/pandas.py

+    (UINT32_MIN, UINT32_MAX, 'uint32', 'UInt32'),
+    (INT64_MIN, INT64_MAX, 'int64', 'Int64'),
+    (UINT64_MIN, UINT64_MAX, 'uint64', 'UInt64'),
+]


This is feeling like it may be evolving into a class / enum or the like.

Yeah, I spent a ridiculous amount of time on writing and rewriting this style-wise, and searching for public numpy or pandas functions related to types. And then I finally gave up and did this. I didn't feel good but it got the job done without the branching complexity that flake8 complained about.

I feel like I want to see how other schema->pandas conversion stuff goes, see what patterns arise, and then revisit this.

Well, I went for it.

apiology · 2021-02-13T15:29:40Z

tests/integration/resources/_schema.json

@@ -7,7 +7,7 @@
    },
    "numstr": {
      "type": "string",
-      "index": 1,
+      "index": 2,


I'll bet there was a story behind finding this!

Hah, I think it might have just been red herrings while I was originally working on a NULL integration test, but figured it might as well get fixed before I forget again...

apiology · 2021-02-13T15:42:32Z

tests/integration/records/single_db/base_records_test.py

+set_stream_logging(level=logging.DEBUG)
+logging.getLogger('botocore').setLevel(logging.INFO)
+logging.getLogger('boto3').setLevel(logging.INFO)
+logging.getLogger('urllib3').setLevel(logging.INFO)


Looks like you did some grunt work to find the noisier external libraries so you could chase down DEBUG-level messages in Records Mover. It might be worth rolling control of those into a set_stream_logging() parameter at some point, or adding a comment at least describing the getLogger() syntax. Not everyone using this is an experienced Pythonista and this is a good signal they might have this need as well.

(also just a reminder that the volume is still turned up on these integration tests)

In this case the normal behavior of set stream logging appears to be correct-- it's just setting the records_mover logger. I couldn't quite figure out what was making the root logger DEBUG-- maybe something in nose itself? Whatever it is, I couldn't find it-- but this worked to make it less noisy.

bluelabsbutler · 2021-02-14T15:57:02Z

records_mover/records/schema/field/__init__.py


-    def to_numpy_dtype(self) -> Union[Type[Any], str]:
+    def to_pandas_dtype(self) -> 'Dtype':


C901 'RecordsSchemaField.to_pandas_dtype' is too complex (22)

cwegrzyn self-assigned this Jan 20, 2021

cwegrzyn force-pushed the use-nullable-int-type branch from f604976 to be21401 Compare January 30, 2021 18:36