-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: load_table_from_dataframe function fails with Unable to determine type of column #9228
Comments
@lopezvit Thanks for the report! A column whose type cannot be automatically determined issues a warning (the first few lines of the output), but that should not result in a 404 error. It appears that the target dataset does not exist, or there might be a typo in its name. Can you please check that the dataset exists and its name is indeed correct? Column type warningIt is generally recommended to use an explicit schema, as auto-detecting column types is not always reliable, and has thus been deprecated recently. In order to enable deprecation warnings, the following lines can be placed at the top of the script: import warnings
warnings.simplefilter("always", category=PendingDeprecationWarning)
warnings.simplefilter("always", category=DeprecationWarning) With these lines added, loading dataframe data into a new table would produce the following in the output:
Providing a schemaIf the target table does not exist yet, and its schema cannot be fetched, and the dataframe contains columns whose type cannot be autodetected, one needs to provide the (partial) schema for these columns. I managed to get the sample script working with the following modifications: # make sure that the sale_date column is recognized as date
import datetime as dt
...
def from_iso_date(date_str):
if not date_str:
return None
return dt.datetime.strptime(date_str, '%Y-%m-%d').date()
dfr['sale_date'] = dfr['sale_date'].apply(from_iso_date)
...
# provide an explicit schema for columns that need it
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField(name="neighborhood", field_type="STRING"),
bigquery.SchemaField(name="building_class_category", field_type="STRING"),
bigquery.SchemaField(name="tax_class_at_present", field_type="STRING"),
bigquery.SchemaField(name="ease_ment", field_type="STRING"),
bigquery.SchemaField(name="building_class_at_present", field_type="STRING"),
bigquery.SchemaField(name="address", field_type="STRING"),
bigquery.SchemaField(name="apartment_number", field_type="STRING"),
bigquery.SchemaField(name="building_class_at_time_of_sale", field_type="STRING"),
bigquery.SchemaField(name="sale_date", field_type="DATE"),
]
)
client.load_table_from_dataframe(dfr, table_ref, job_config=job_config).result() Let us know if this solves your issue. |
Thank you, it does solve my problem. I wasn't aware of the auto-detecting being deprecated. |
No worries, it is a fairly recent change. It would indeed be convenient if the schema could always be reliably autodetected, but it turned out that in practice, this is not always the case, unfortunately.
For existing tables, the schema, if not given, is already automatically fetched, and an explicit schema is not necessary. For new tables, when inferring the schema solely from a given dataframe, however, it might indeed be useful to show how an inferred schema would look like, giving users a chance to see what subset of columns needs to be explicitly specified. @tswast what do you think about such tool / helper method?
Do you mean automatically creating the dataset if it does not exist yet? Right now the BigQuery backend only creates a table, if it does not exist yet when loading data into it. I suppose a dataset could be created automatically, too, although that would have to be done with the additional client logic. This would require implementing the same in the BQ clients for other languages, though, as consistency across implementations is desired. @tswast Do you know if there has been such feature request in the past? (creating datasets automatically) |
I'm lukewarm on the idea of making a public function. If we do this, it'd mean having a public There's some precedent for a helper method ( I think the current dataframe_to_bq_schema function will only be useful as a public method once we implement the changes I suggest in #9206 (comment) to continue when object dtypes are detected.
I'm wary of this request. Datasets contain many important properties that need to be set at creation time, such as location and KMS keys. If we automatically create these, it'd default to the API default (which is usually US location and Google-managed encryption). I'm not aware of any other requests for this feature. |
Closing this, as the solution has been found, and there are reservations about both suggestions for the reasons state above (still appreciated the proposals, though). |
Environment details
Steps to reproduce
load_table_from_dataframe
Code example
Stack trace
I think that the column type extractor should use the function
is_string_dtype
frompandas.api.types
to determine if it is a string column, because the dtype isobject
.The text was updated successfully, but these errors were encountered: