-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: Use BigQuery schema (from LoadJobConfig) if available when converting to Parquet in load_table_from_dataframe #7370
Comments
|
It is somewhat common in the pandas community to use The As I follow the breadcrumbs through that stack, I think the problem is likely in pyarrow's schema inference. That said, in this case we know the desired BigQuery schema. As a workaround for issues with type inference, this library should probably look at the schema in the load job config passed to the |
The reason that this issue does not occur in pandas-gbq is (1) pandas-gbq serializes to CSV and (2) pandas-gbq does it's own dtypes to BQ schema translation. |
Possibly related: https://issues.apache.org/jira/browse/ARROW-2298 and https://issues.apache.org/jira/browse/ARROW-2135 (I believe this means NaNs are/were always treated as floats in pyarrow). As a workaround, try using |
tangentially related: apache/arrow#3479 I think the best design would be to use the Pandas schema to build a schema object and send that to BigQuery. That still relies on arrow encoding the df correctly. |
Except that internally, a Pandas column with NaN will probably default to FLOAT if I understand it correctly.
Setting a column to None changes the issue to "google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table. Field application has changed type from STRING to INTEGER", even after first setting the whole column to strings. This fix probably works best from our end, though not the easiest to build:
However, I am not sure if there is a monetary cost to requesting that data. |
Agree on the resolution of using
TBC the schema that should be looked at is a python object |
Some thoughts on possible implementation. When using the pyarrow engine for to parquet, we basically convert to an arrow table first. It's likely at this conversion that the object dtype for all None gets treated as a nullable float column, since there's no other information given to pyarrow about the types besides the dtype and the actual values. Perhaps we should always use pyarrow and explicitly convert to the table ourselves, giving ourselves the chance to adjust dtypes. If we do end up with an intermediate pyarrow table, we could also use this to determine if there are any array or struct columns to deal with. If not, we could serialize to CSV, instead of parquet. Anecdotally, CSV is faster to serialize and for BigQuery to parse and load, so there could be performance benefits. If it turns out to be difficult to actually change the dtype during the conversion to a pyarrow table, then if we use CSV it will matter less what dtype we actually ended up with since null values will serialize the same. That means this bug would only creep in the rarer case where structs / arrays are used in the same DataFrame as a completely null column. |
I've investigated this approach (manually converting to Arrow and specifying needed column types via BigQuery column type). I think it will work. I've come up with the Arrow types we'll need to use for a given column type. pandas -> Arrow -> Parquet -> BigQueryWe need to be able to use BQ schema to override behavior of the pandas -> Arrow + Parquet conversion, especially when the pandas type is ambiguous (object dtype containing all None values, for example). What happens if the number of columns (in LoadJobConfig and DataFrame) doesn't match? Error! But, don't forget about indexes, those could be the difference between a schema matching and not. Maybe don't support writing indexes if schema is supplied? Can we tell what the desired schema is if they don't provide it? Not always. If it's an append job, maybe we can make a GET call to compare schemas if we get ambiguity? Let's do that as a follow-up if explicitly passing in a schema doesn't work. pandas -> Arrow type conversions: https://arrow.apache.org/docs/python/pandas.html#pandas-arrow-conversion Arrow -> Parquet type conversions: https://arrow.apache.org/docs/python/parquet.html#data-type-handling Parquet -> BigQuery type conversions: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#parquet_conversions How should we change the Arrow types if a BigQuery schema is available in the LoadJobConfig?https://cloud.google.com/bigquery/data-types#legacy_sql_data_types
I've manually constructed a parquet file with both non-null values and all nulls and confirmed I'm able to append a file with null columns to an existing table. What about pandas indexes?For now: Leave them off if schema is supplied. How could we use the schema to match DataFrame indexes + columns? Maybe assume the indexes are the first "columns". |
Environment details
OS = Windows 7
Python = 3.7.2
google-cloud-bigquery 1.8.1
Steps to reproduce
Code example
Error
So I've been in contact with enterprise support, but they were unable to help me. Apparently there's a bug in google.bigquery.Client in the line dataframe.to_parquet(buffer) that causes columns with all NaN values to be interpreted as FLOAT or INTEGER instead of STRING. This prevents the dataframe from being uploaded and there is no other way to introduce NULLs into the table in BigQuery. The issue does not occur in pandas-gbq. Support (ticket 18371705) advised me to use that as a workaround instead until this is fixed and report the issue here. If you have any questions or need more information, feel free to ask.
The text was updated successfully, but these errors were encountered: