BigQuery: load_table_from_dataframe function fails with Unable to determine type of column #9228

lopezvit · 2019-09-13T19:45:43Z

Environment details

Debian GNU/Linux 9 (stretch)
Python 3.5.3
google-cloud-bigquery version: 1.19.0

Steps to reproduce

Create a pandas dataframe from csv with text columns
Execute the function load_table_from_dataframe

Code example

import pandas as pd
dfr = pd.read_csv('https://storage.googleapis.com/asl-testing/data/nyc_open_data_real_estate.csv')
from google.cloud import bigquery
import pandas
DATASET = 'nyc_real_estate'
TABLE = 'residential_sales'
client = bigquery.Client()
dataset_ref = client.dataset(DATASET)
table_ref = dataset_ref.table(TABLE)
client.load_table_from_dataframe(dfr, table_ref).result()

Stack trace

/home/jupyter/.local/lib/python3.5/site-packages/google/cloud/bigquery/_pandas_helpers.py:275: UserWarning: Unable to determine type of column 'neighborhood'.
  warnings.warn(u"Unable to determine type of column '{}'.".format(column))
---------------------------------------------------------------------------
InvalidResponse                           Traceback (most recent call last)
~/.local/lib/python3.5/site-packages/google/cloud/bigquery/client.py in load_table_from_file(self, file_obj, destination, rewind, size, num_retries, job_id, job_id_prefix, location, project, job_config)
   1447                 response = self._do_resumable_upload(
-> 1448                     file_obj, job_resource, num_retries
   1449                 )

~/.local/lib/python3.5/site-packages/google/cloud/bigquery/client.py in _do_resumable_upload(self, stream, metadata, num_retries)
   1695         upload, transport = self._initiate_resumable_upload(
-> 1696             stream, metadata, num_retries
   1697         )

~/.local/lib/python3.5/site-packages/google/cloud/bigquery/client.py in _initiate_resumable_upload(self, stream, metadata, num_retries)
   1738         upload.initiate(
-> 1739             transport, stream, metadata, _GENERIC_CONTENT_TYPE, stream_final=False
   1740         )

~/.local/lib/python3.5/site-packages/google/resumable_media/requests/upload.py in initiate(self, transport, stream, metadata, content_type, total_bytes, stream_final)
    326             retry_strategy=self._retry_strategy)
--> 327         self._process_initiate_response(response)
    328         return response

~/.local/lib/python3.5/site-packages/google/resumable_media/_upload.py in _process_initiate_response(self, response)
    452             self._get_status_code,
--> 453             callback=self._make_invalid,
    454         )

~/.local/lib/python3.5/site-packages/google/resumable_media/_helpers.py in require_status_code(response, status_codes, get_status_code, callback)
     92             response, u'Request failed with status code',
---> 93             status_code, u'Expected one of', *status_codes)
     94     return status_code

InvalidResponse: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>)

During handling of the above exception, another exception occurred:

NotFound                                  Traceback (most recent call last)
<ipython-input-7-3222cebc83b5> in <module>
      5 dataset_ref = client.dataset(DATASET)
      6 table_ref = dataset_ref.table(TABLE)
----> 7 client.load_table_from_dataframe(dfr, table_ref).result()

~/.local/lib/python3.5/site-packages/google/cloud/bigquery/client.py in load_table_from_dataframe(self, dataframe, destination, num_retries, job_id, job_id_prefix, location, project, job_config, parquet_compression)
   1588                     location=location,
   1589                     project=project,
-> 1590                     job_config=job_config,
   1591                 )
   1592 

~/.local/lib/python3.5/site-packages/google/cloud/bigquery/client.py in load_table_from_file(self, file_obj, destination, rewind, size, num_retries, job_id, job_id_prefix, location, project, job_config)
   1453                 )
   1454         except resumable_media.InvalidResponse as exc:
-> 1455             raise exceptions.from_http_response(exc.response)
   1456 
   1457         return self.job_from_resource(response.json())

NotFound: 404 POST https://www.googleapis.com/upload/bigquery/v2/projects/<PROJECT-NAME>/jobs?uploadType=resumable: Not found: Dataset <PROJECT>:nyc_real_estate

I think that the column type extractor should use the function is_string_dtype from pandas.api.types to determine if it is a string column, because the dtype is object.

The text was updated successfully, but these errors were encountered:

plamut · 2019-09-14T11:51:29Z

@lopezvit Thanks for the report!

A column whose type cannot be automatically determined issues a warning (the first few lines of the output), but that should not result in a 404 error. It appears that the target dataset does not exist, or there might be a typo in its name.

Can you please check that the dataset exists and its name is indeed correct?

Column type warning

It is generally recommended to use an explicit schema, as auto-detecting column types is not always reliable, and has thus been deprecated recently.

In order to enable deprecation warnings, the following lines can be placed at the top of the script:

import warnings
warnings.simplefilter("always", category=PendingDeprecationWarning) 
warnings.simplefilter("always", category=DeprecationWarning)

With these lines added, loading dataframe data into a new table would produce the following in the output:

PendingDeprecationWarning: Schema could not be detected for all columns. Loading from a dataframe without a schema will be deprecated in the future, please provide a schema.

Providing a schema

If the target table does not exist yet, and its schema cannot be fetched, and the dataframe contains columns whose type cannot be autodetected, one needs to provide the (partial) schema for these columns.

I managed to get the sample script working with the following modifications:

# make sure that the sale_date column is recognized as date
import datetime as dt
...

def from_iso_date(date_str):
    if not date_str:
        return None
    return dt.datetime.strptime(date_str, '%Y-%m-%d').date()

dfr['sale_date'] = dfr['sale_date'].apply(from_iso_date)
...

# provide an explicit schema for columns that need it
job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField(name="neighborhood", field_type="STRING"),
        bigquery.SchemaField(name="building_class_category", field_type="STRING"),
        bigquery.SchemaField(name="tax_class_at_present", field_type="STRING"),
        bigquery.SchemaField(name="ease_ment", field_type="STRING"),
        bigquery.SchemaField(name="building_class_at_present", field_type="STRING"),
        bigquery.SchemaField(name="address", field_type="STRING"),
        bigquery.SchemaField(name="apartment_number", field_type="STRING"),
        bigquery.SchemaField(name="building_class_at_time_of_sale", field_type="STRING"),
        bigquery.SchemaField(name="sale_date", field_type="DATE"),
    ]
)

client.load_table_from_dataframe(dfr, table_ref, job_config=job_config).result()

Let us know if this solves your issue.

lopezvit · 2019-09-17T08:08:10Z

Thank you, it does solve my problem. I wasn't aware of the auto-detecting being deprecated.
It causes me a mixed feelings, because it just means that in the future, developers have to make extra effort defining the schema, but of course, it will mean a safer implementation.
Could it be possible to create a small tool that could infer the schema and print it in the console output? That way it could be easily copied and pasted when ones need to define the schema, just modifying the parts where the automatic guessing tool would guess wrong.
Regarding the 404 error, I just copied the code from pandas_gbq example and migrate it to this library... Apparently, pandas_gbq automatically creates the dataset if it doesn't exists, that is what got me confused, to be honest I didn't even read the whole trace, and I just focused in the first error, Could it be possible to recreate this behaviour in future releases?

plamut · 2019-09-17T09:22:15Z

I wasn't aware of the auto-detecting being deprecated.
It causes me a mixed feelings, because it just means that in the future, developers have to make extra effort defining the schema, but of course, it will mean a safer implementation.

No worries, it is a fairly recent change. It would indeed be convenient if the schema could always be reliably autodetected, but it turned out that in practice, this is not always the case, unfortunately.

Could it be possible to create a small tool that could infer the schema and print it in the console output? That way it could be easily copied and pasted when ones need to define the schema, just modifying the parts where the automatic guessing tool would guess wrong.

For existing tables, the schema, if not given, is already automatically fetched, and an explicit schema is not necessary.

For new tables, when inferring the schema solely from a given dataframe, however, it might indeed be useful to show how an inferred schema would look like, giving users a chance to see what subset of columns needs to be explicitly specified.

@tswast what do you think about such tool / helper method?

Apparently, pandas_gbq automatically creates the dataset if it doesn't exists, that is what got me confused, to be honest I didn't even read the whole trace, and I just focused in the first error, Could it be possible to recreate this behaviour in future releases?

Do you mean automatically creating the dataset if it does not exist yet? Right now the BigQuery backend only creates a table, if it does not exist yet when loading data into it.

I suppose a dataset could be created automatically, too, although that would have to be done with the additional client logic. This would require implementing the same in the BQ clients for other languages, though, as consistency across implementations is desired.

@tswast Do you know if there has been such feature request in the past? (creating datasets automatically)

tswast · 2019-09-25T23:02:04Z

what do you think about such tool / helper method?

I'm lukewarm on the idea of making a public function. If we do this, it'd mean having a public pandas_helpers module, since we want to somewhat isolate our dependency on pandas.

There's some precedent for a helper method (pandas_gbq.gbq._generate_bq_schema). I hope that it is less necessary since partial schemas are supported, though. #8140

I think the current dataframe_to_bq_schema function will only be useful as a public method once we implement the changes I suggest in #9206 (comment) to continue when object dtypes are detected.

I suppose a dataset could be created automatically, too, although that would have to be done with the additional client logic.

I'm wary of this request. Datasets contain many important properties that need to be set at creation time, such as location and KMS keys. If we automatically create these, it'd default to the API default (which is usually US location and Google-managed encryption). I'm not aware of any other requests for this feature.

plamut · 2019-10-19T10:30:34Z

Closing this, as the solution has been found, and there are reservations about both suggestions for the reasons state above (still appreciated the proposals, though).

plamut added api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue. labels Sep 14, 2019

plamut self-assigned this Sep 23, 2019

plamut closed this as completed Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: load_table_from_dataframe function fails with Unable to determine type of column #9228

BigQuery: load_table_from_dataframe function fails with Unable to determine type of column #9228

lopezvit commented Sep 13, 2019

plamut commented Sep 14, 2019

lopezvit commented Sep 17, 2019

plamut commented Sep 17, 2019

tswast commented Sep 25, 2019

plamut commented Oct 19, 2019

BigQuery: load_table_from_dataframe function fails with Unable to determine type of column #9228

BigQuery: load_table_from_dataframe function fails with Unable to determine type of column #9228

Comments

lopezvit commented Sep 13, 2019

Environment details

Steps to reproduce

Code example

Stack trace

plamut commented Sep 14, 2019

Column type warning

Providing a schema

lopezvit commented Sep 17, 2019

plamut commented Sep 17, 2019

tswast commented Sep 25, 2019

plamut commented Oct 19, 2019