Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Support user-overridable dtypes in to_dataframe method. #7049

Closed
tswast opened this issue Jan 4, 2019 · 1 comment
Closed

BigQuery: Support user-overridable dtypes in to_dataframe method. #7049

tswast opened this issue Jan 4, 2019 · 1 comment
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tswast
Copy link
Contributor

tswast commented Jan 4, 2019

With pandas 0.24.0 (unreleased), a new dtype is available to support nullable integer columns. http://pandas-docs.github.io/pandas-docs-travis/integer_na.html#integer-na The default behavior is to convert to float, but this can result in data loss (#6177). This new dtype extension avoids that.

I propose we allow the user to provide a map from column names to dtypes for any columns for which they'd like to override the default behavior. This argument could be called dtype_overrides. This would also be useful for other extension dtypes in the future, such as for GEOGRAPHY columns.

See googleapis/python-bigquery-pandas#242 for additional discussion.

Alternatives

  • Make the new dtype for nullable integer the default for integer columns.
    • Con: Not compatible with older versions of pandas.
    • Con: Inconsistent with pandas's default behavior.
@tswast tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jan 4, 2019
@tseaver tseaver removed the priority: p2 Moderately-important priority. Fix may not be included in next release. label Jan 5, 2019
@tswast
Copy link
Contributor Author

tswast commented Jan 12, 2019

I experimented with this in master...tswast:b122674716-bqstorage-types for the BigQuery Storage API. Similar work is needed for the BigQuery API.

I think the dictionary of column names to dtypes works well. I don't see any problems with using pandas Series constructor for type casting.

Should it be an error if a dtype was supplied by the column isn't actually in the DataFrame? I think we might be able to parse the avro_schema to see what columns are available ahead of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

3 participants