Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: how to handle the new Int64 (nullable integer) dtype with pandas 0.24.0 #242

Closed
tswast opened this issue Jan 4, 2019 · 4 comments · Fixed by #333
Closed

Comments

@tswast
Copy link
Collaborator

tswast commented Jan 4, 2019

Currently unreleased, but pandas 0.24.0 will add an extension dtype to allow a nullable integer dtype: http://pandas-docs.github.io/pandas-docs-travis/integer_na.html#integer-na Unfortunately, we won't use it with our current logic of deferring to the DataFrame constructor for type inference.

It [Int64, nullable integer] is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series.

The question is how can we support this dtype in pandas-gbq? I see a few options.

  • Use pd.Int64Dtype() by default for nullable integer columns, similar to how previously pandas-gbq defaulted to string for integer columns.
    • Con: ties new versions of pandas-gbq to 0.24.0+
  • Use pd.Int64Dtype() for nullable integer columns when pandas-gbq 0.24.0+ is installed.
    • Con: inconsistent with pandas.
    • Con: unable to turn this feature off when float is desired (perhaps for performance reasons).
  • Add an argument to read_gbq which is a map of column names to dtypes, overriding the dtype of any column present.
    • Con: float isn't the safest default for nullable integer columns, but at least it's consistent with pandas.
    • Con: will require reading rows into separate Series before constructing a DataFrame, as the DataFrame constructor only accepts a single dtype.
@tswast
Copy link
Collaborator Author

tswast commented Jan 4, 2019

FWIW: I'm leaning towards "Add an argument to read_gbq which is a map of column names to dtypes" because I think that'd be a useful feature in general (perhaps to support other extension dtypes such as for GEOGRAPHY) and also because I'd prefer to stay consistent with pandas in our default behavior.

@tswast
Copy link
Collaborator Author

tswast commented Jan 4, 2019

Related to #149, does to_parquet support the new nullable integer type?

@tswast
Copy link
Collaborator Author

tswast commented Jan 17, 2019

Just added dtypes option for to_dataframe in google-cloud-bigquery googleapis/google-cloud-python#7126

I think we can do the same here. We might even want to call to_dataframe to do so, per discussion in #149

@max-sixty
Copy link
Contributor

Great! Agree with your suggestion. We can always choose to default to pd.Int64Dtype() in the future, but I would lean towards waiting until pandas uses that type by default

(Sorry to be out of the discussion for a bit)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants