WIP: Allow nullable integer columns to be used from Pandas 1.0+ #138

vinceatbluelabs · 2020-12-03T13:50:49Z

Allow nullable columns to be used in Pandas 1.0+ - prior to that, Pandas would use e.g., a numpy floating point type for integers, representing nulls as NaN.

If Pandas <1.0 is being used, logs a warning message and proceeds with the raw dtypes.

Manual test I ran to verify we now create nullable integer columns and load correctly into them:

(records-mover-3.8.5) �]0;bigbookpro�broz@bigbookpro:~/src/records-mover$ mvrec file2table --source.header_row null-int.csv --target.existing_table drop_and_recreate redshift vbroz nullints
10:11:40 - Using session_type=lpass from config file
10:11:40 - Starting...
10:11:50 - Mover: copying from DataUrlRecordsSource(None) to TableRecordsTarget(redshift) by first writing DataUrlRecordsSource(None) to DelimitedRecordsFormat(csv - {'dateformat': 'YYYY-MM-DD', 'datetimeformat': 'YYYY-MM-DD HH:MI:SS', 'datetimeformattz': 'YYYY-MM-DD HH:MI:SSOF'}) records format (if easy to rewrite)...
10:11:50 - Determining records format with initial_hints={'header-row': True, 'compression': None}
10:11:50 - Inferred record terminator as '\n'
10:11:50 - Python csv.Dialect sniffed: {'doublequote': False, 'field-delimiter': ',', 'header-row': False, 'quotechar': '"'}
10:11:50 - Attempting to parse with quoting: minimal
10:11:50 - Inferred hints from combined sources: {'compression': None, 'quoting': 'minimal', 'doublequote': False, 'field-delimiter': ',', 'header-row': True, 'quotechar': '"', 'encoding': 'UTF8', 'record-terminator': '\n'}
10:11:50 - Mover: FileobjsSource(DelimitedRecordsFormat(csv - {'compression': None, 'doublequote': False})) is known to handle [DelimitedRecordsFormat(csv - {'compression': None, 'doublequote': False})] but is not able to directly export to TableRecordsTarget(redshift), which is known to handle [DelimitedRecordsFormat(csv - {'dateformat': 'YYYY-MM-DD', 'datetimeformat': 'YYYY-MM-DD HH:MI:SS', 'datetimeformattz': 'YYYY-MM-DD HH:MI:SSOF'}), DelimitedRecordsFormat(bigquery), DelimitedRecordsFormat(csv), DelimitedRecordsFormat(bluelabs - {'quoting': 'all'}), DelimitedRecordsFormat(bluelabs), AvroRecordsFormat]
10:11:50 - Mover: copying from FileobjsSource(DelimitedRecordsFormat(csv - {'compression': None, 'doublequote': False})) to TableRecordsTarget(redshift) by converting to dataframe...
10:11:50 - Loading CSV via Pandas with options: {'delimiter': ',', 'header': 0, 'engine': 'python', 'parse_dates': [], 'dayfirst': False, 'compression': None, 'quotechar': '"', 'quoting': 0, 'doublequote': False, 'error_bad_lines': True, 'warn_bad_lines': True}
10:11:50 - Exporting to CSV with these Pandas options: {'encoding': 'UTF8', 'compression': 'gzip', 'quoting': 0, 'doublequote': True, 'quotechar': '"', 'header': True, 'date_format': '%Y-%m-%d %H:%M:%S.%f%z', 'sep': ',', 'line_terminator': '\n'}
10:11:50 - Writing CSV file to /var/folders/6m/lwg6ctb51gv7v0vxrkzjps7h0000gn/T/mover_seralized_dataframes_narfh4
10:11:50 - CSV file written
10:11:50 - Uploading s3://bluelabs-scratch/vince.broz/OyAI_0eEm4I/data001.csv.gz
10:11:51 - Storing manifest into s3://bluelabs-scratch/vince.broz/OyAI_0eEm4I/manifest
10:11:51 - Putting into s3://bluelabs-scratch/vince.broz/OyAI_0eEm4I/_schema.json
10:11:51 - Storing format info into s3://bluelabs-scratch/vince.broz/OyAI_0eEm4I/_format_delimited
10:11:52 - Renamed s3://bluelabs-scratch/vince.broz/OyAI_0eEm4I/manifest to s3://bluelabs-scratch/vince.broz/OyAI_0eEm4I/_manifest
10:11:52 - Connecting to database...
10:11:53 - Connecting to database...
10:11:53 - Looking for existing table..
10:11:53 - Table already exists.
10:11:53 - Dropping and recreating...
10:11:53 - Just ran DROP TABLE vbroz.nullints
10:11:53 - Creating table...
10:11:53 - Just ran 
CREATE TABLE vbroz.nullints (
	foo BIGINT, 
	bar VARCHAR(16)
)


10:11:53 - Table prepped
10:11:55 - Copying to Redshift with options: {'compression': <Compression.gzip: 'GZIP'>, 'date_format': 'YYYY-MM-DD', 'encoding': <Encoding.utf8: 'UTF8'>, 'quote': '"', 'format': <Format.csv: 'CSV'>, 'time_format': 'auto', 'max_error': 0, 'ignore_header': 1}
10:11:55 - Starting Redshift COPY from RecordsDirectory(s3://bluelabs-scratch/vince.broz/OyAI_0eEm4I/)...
10:11:55 - Redshift COPY complete.
(records-mover-3.8.5) �]0;bigbookpro�broz@bigbookpro:~/src/records-mover$ rm -fr foodir; mkdir foodir
(records-mover-3.8.5) �]0;bigbookpro�broz@bigbookpro:~/src/records-mover$ mvrec file2recordsdir --source.header_row null-int.csv file:///$(pwd)/foodir/
10:12:02 - Using session_type=lpass from config file
10:12:02 - Starting...
10:12:02 - Mover: copying from DataUrlRecordsSource(None) to DirectoryFromUrlRecordsTarget by first writing DataUrlRecordsSource(None) to None records format (if easy to rewrite)...
10:12:02 - Determining records format with initial_hints={'header-row': True, 'compression': None}
10:12:02 - Inferred record terminator as '\n'
10:12:02 - Python csv.Dialect sniffed: {'doublequote': False, 'field-delimiter': ',', 'header-row': False, 'quotechar': '"'}
10:12:02 - Attempting to parse with quoting: minimal
10:12:02 - Inferred hints from combined sources: {'compression': None, 'quoting': 'minimal', 'doublequote': False, 'field-delimiter': ',', 'header-row': True, 'quotechar': '"', 'encoding': 'UTF8', 'record-terminator': '\n'}
10:12:02 - Mover: copying from FileobjsSource(DelimitedRecordsFormat(csv - {'compression': None, 'doublequote': False})) to DirectoryFromUrlRecordsTarget by writing to file records directory
10:12:02 - Uploading file:////Users/broz/src/records-mover/foodir/null-int.csv
10:12:02 - Storing manifest into file:////Users/broz/src/records-mover/foodir/manifest
10:12:02 - Putting into file:////Users/broz/src/records-mover/foodir/_schema.json
10:12:02 - Storing format info into file:////Users/broz/src/records-mover/foodir/_format_delimited
(records-mover-3.8.5) �]0;bigbookpro�broz@bigbookpro:~/src/records-mover$ db redshift
Connecting to database analytics on bl-int-analytics1.cxtyzogmmhiv.us-east-1.redshift.amazonaws.com:5439 as vbroz
\d+ tablename -- describe table
\dt *             -- list all tables in default schema
\dn               -- list all schemas
\dt schemaname.*  -- list all tables in schema
\dg               -- list all users and roles
psql (13.1, server 8.0.2)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

analytics=> \d+ vbroz.nullints
                                         Table "vbroz.nullints"
 Column |         Type          | Collation | Nullable | Default | Storage  | Stats target | Description 
--------+-----------------------+-----------+----------+---------+----------+--------------+-------------
 foo    | bigint                |           |          |         | plain    |              | 
 bar    | character varying(16) |           |          |         | extended |              | 
Has OIDs: yes

analytics=> select * from nullints
analytics-> ;
 foo | bar  
-----+------
   1 | baz
     | bing
(2 rows)

analytics=> \q
(records-mover-3.8.5) �]0;bigbookpro�broz@bigbookpro:~/src/records-mover$ jq .fields.foo < foodir/_schema.json
{
  "type": "integer",
  "constraints": {
    "required": false,
    "unique": false,
    "min": "-9223372036854775808",
    "max": "9223372036854775807"
  },
  "representations": {
    "origin": {
      "rep_type": "dataframe/pandas",
      "pd_df_dtype": {
        "base": null,
        "is_signed_integer": true,
        "itemsize": 8,
        "na_value": null,
        "names": null
      },
      "pd_df_coltype": "series"
    }
  },
  "index": 1
}
(records-mover-3.8.5) �]0;bigbookpro�broz@bigbookpro:~/src/records-mover$

records_mover/records/sources/fileobjs.py

records_mover/records/schema/schema/__init__.py

records_mover/records/sources/fileobjs.py

records_mover/pandas/__init__.py

vinceatbluelabs · 2020-12-03T13:51:49Z

records_mover/records/schema/field/numpy.py

-    basename = dtype.base.name
+    basename = str(dtype)
+    if 'base' in dir(dtype) and 'name' in dir(dtype.base):
+        basename = dtype.base.name


The new Integer64 type in Pandas is not actually a type that has a base type, so there's no 'base' attribute in it.

vinceatbluelabs · 2020-12-03T15:10:03Z

records_mover/records/schema/schema/__init__.py

+from records_mover.records.schema.schema.known_representation import (
+    RecordsSchemaKnownRepresentation
+)
+from records_mover.records.schema.errors import UnsupportedSchemaError


Unrelated change. Generally I'd like to start moving away from relative import addressing, as it gets confusing and painful to rebaseline when I move code around.

cwegrzyn

This seems reasonable!

Can you think of any downsides to applying convert_dtypes here? I can't-- but just thinking about whether this is ultimately something that we'd need to make controllable.

vinceatbluelabs · 2020-12-03T16:00:06Z

Good question! I did not apply it on the path where folks pass us their own DataFrame to us, so folks can control their own behavior there.

On the CSV side, the only thing that gives me a little pause is that someone might send us a CSV and an existing Records Schema that specifies something as a floating point column even though pandas.read_csv().convert_dtypes() would make it into an integer.

However, I think that's a special case of a more general existing concern (what if we interpret a float but they specified a string?). I think we haven't seen any evidence of that being an issue because databases like Redshift are robust to that issue and will load an integer formatted CSV column into a float column. We also haven't yet had interoperability situations where we're reading the records schema data from another tool that might have different behavior interpreting data.

Probably the way to address that longer term if it becomes a priority is to reformat the dataframe types based on the RecordsSchema object. We could replace the call to convert_dtypes() in fileobjs.py with a call into the records schema object - maybe a new method like assign_dataframe_types() that accepts and returns a DataFrame object. Maybe we even have code like that around in a less obvious place that I didn't spot?

vinceatbluelabs · 2020-12-10T13:46:59Z

I did a little more research - we have a method that might be appropriate: RecordsSchema#cast_dataframe_types()

We use it currently after creating dataframes from tables - but it might also be appropriate when creating dataframes from files.

It would also need to be tweaked to call convert_dtypes(), and ideally in a way that only applies to columns marked as integer in the RecordsSchema.

This might also handle the case of selecting out from a table, which it's not clear to me that this PR as-is handles.

Given that, I think I'll hold off from merging this PR as-is.

Allow nullable integer columns to be used from Pandas 1.0+

fb11eb5