Discussion: how to handle the new Int64 (nullable integer) dtype with pandas 0.24.0 #242

tswast · 2019-01-04T20:46:52Z

Currently unreleased, but pandas 0.24.0 will add an extension dtype to allow a nullable integer dtype: http://pandas-docs.github.io/pandas-docs-travis/integer_na.html#integer-na Unfortunately, we won't use it with our current logic of deferring to the DataFrame constructor for type inference.

It [Int64, nullable integer] is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series.

The question is how can we support this dtype in pandas-gbq? I see a few options.

Use pd.Int64Dtype() by default for nullable integer columns, similar to how previously pandas-gbq defaulted to string for integer columns.
- Con: ties new versions of pandas-gbq to 0.24.0+
Use pd.Int64Dtype() for nullable integer columns when pandas-gbq 0.24.0+ is installed.
- Con: inconsistent with pandas.
- Con: unable to turn this feature off when float is desired (perhaps for performance reasons).
Add an argument to read_gbq which is a map of column names to dtypes, overriding the dtype of any column present.
- Con: float isn't the safest default for nullable integer columns, but at least it's consistent with pandas.
- Con: will require reading rows into separate Series before constructing a DataFrame, as the DataFrame constructor only accepts a single dtype.

The text was updated successfully, but these errors were encountered:

tswast · 2019-01-04T20:49:20Z

FWIW: I'm leaning towards "Add an argument to read_gbq which is a map of column names to dtypes" because I think that'd be a useful feature in general (perhaps to support other extension dtypes such as for GEOGRAPHY) and also because I'd prefer to stay consistent with pandas in our default behavior.

tswast · 2019-01-04T21:04:31Z

Related to #149, does to_parquet support the new nullable integer type?

tswast · 2019-01-17T00:18:47Z

Just added dtypes option for to_dataframe in google-cloud-bigquery googleapis/google-cloud-python#7126

I think we can do the same here. We might even want to call to_dataframe to do so, per discussion in #149

max-sixty · 2019-01-17T18:21:05Z

Great! Agree with your suggestion. We can always choose to default to pd.Int64Dtype() in the future, but I would lean towards waiting until pandas uses that type by default

(Sorry to be out of the discussion for a bit)

tswast mentioned this issue Jan 4, 2019

BigQuery: Support user-overridable dtypes in to_dataframe method. googleapis/google-cloud-python#7049

Closed

tswast mentioned this issue Jan 25, 2019

CLN: Use to_dataframe to download query results. #247

Merged

tswast mentioned this issue May 10, 2019

ENH: Use Fletcher to better support strings, structs, and arrays #274

Closed

This was referenced Oct 2, 2020

Test failure on Python 3.8 -- Integer NULL represented as NaN instead of None #332

Closed

ENH: add dtypes argument to read_gbq #333

Merged

tswast closed this as completed in #333 Oct 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: how to handle the new Int64 (nullable integer) dtype with pandas 0.24.0 #242

Discussion: how to handle the new Int64 (nullable integer) dtype with pandas 0.24.0 #242

tswast commented Jan 4, 2019

tswast commented Jan 4, 2019

tswast commented Jan 4, 2019

tswast commented Jan 17, 2019 •

edited

Loading

max-sixty commented Jan 17, 2019

Discussion: how to handle the new Int64 (nullable integer) dtype with pandas 0.24.0 #242

Discussion: how to handle the new Int64 (nullable integer) dtype with pandas 0.24.0 #242

Comments

tswast commented Jan 4, 2019

tswast commented Jan 4, 2019

tswast commented Jan 4, 2019

tswast commented Jan 17, 2019 • edited Loading

max-sixty commented Jan 17, 2019

tswast commented Jan 17, 2019 •

edited

Loading