Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: load_table_from_dataframe function fails with Unable to determine type of column #9228

Closed
lopezvit opened this issue Sep 13, 2019 · 5 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue.

Comments

@lopezvit
Copy link

Environment details

  1. Debian GNU/Linux 9 (stretch)
  2. Python 3.5.3
  3. google-cloud-bigquery version: 1.19.0

Steps to reproduce

  1. Create a pandas dataframe from csv with text columns
  2. Execute the function load_table_from_dataframe

Code example

import pandas as pd
dfr = pd.read_csv('https://storage.googleapis.com/asl-testing/data/nyc_open_data_real_estate.csv')
from google.cloud import bigquery
import pandas
DATASET = 'nyc_real_estate'
TABLE = 'residential_sales'
client = bigquery.Client()
dataset_ref = client.dataset(DATASET)
table_ref = dataset_ref.table(TABLE)
client.load_table_from_dataframe(dfr, table_ref).result()

Stack trace

/home/jupyter/.local/lib/python3.5/site-packages/google/cloud/bigquery/_pandas_helpers.py:275: UserWarning: Unable to determine type of column 'neighborhood'.
  warnings.warn(u"Unable to determine type of column '{}'.".format(column))
---------------------------------------------------------------------------
InvalidResponse                           Traceback (most recent call last)
~/.local/lib/python3.5/site-packages/google/cloud/bigquery/client.py in load_table_from_file(self, file_obj, destination, rewind, size, num_retries, job_id, job_id_prefix, location, project, job_config)
   1447                 response = self._do_resumable_upload(
-> 1448                     file_obj, job_resource, num_retries
   1449                 )

~/.local/lib/python3.5/site-packages/google/cloud/bigquery/client.py in _do_resumable_upload(self, stream, metadata, num_retries)
   1695         upload, transport = self._initiate_resumable_upload(
-> 1696             stream, metadata, num_retries
   1697         )

~/.local/lib/python3.5/site-packages/google/cloud/bigquery/client.py in _initiate_resumable_upload(self, stream, metadata, num_retries)
   1738         upload.initiate(
-> 1739             transport, stream, metadata, _GENERIC_CONTENT_TYPE, stream_final=False
   1740         )

~/.local/lib/python3.5/site-packages/google/resumable_media/requests/upload.py in initiate(self, transport, stream, metadata, content_type, total_bytes, stream_final)
    326             retry_strategy=self._retry_strategy)
--> 327         self._process_initiate_response(response)
    328         return response

~/.local/lib/python3.5/site-packages/google/resumable_media/_upload.py in _process_initiate_response(self, response)
    452             self._get_status_code,
--> 453             callback=self._make_invalid,
    454         )

~/.local/lib/python3.5/site-packages/google/resumable_media/_helpers.py in require_status_code(response, status_codes, get_status_code, callback)
     92             response, u'Request failed with status code',
---> 93             status_code, u'Expected one of', *status_codes)
     94     return status_code

InvalidResponse: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>)

During handling of the above exception, another exception occurred:

NotFound                                  Traceback (most recent call last)
<ipython-input-7-3222cebc83b5> in <module>
      5 dataset_ref = client.dataset(DATASET)
      6 table_ref = dataset_ref.table(TABLE)
----> 7 client.load_table_from_dataframe(dfr, table_ref).result()

~/.local/lib/python3.5/site-packages/google/cloud/bigquery/client.py in load_table_from_dataframe(self, dataframe, destination, num_retries, job_id, job_id_prefix, location, project, job_config, parquet_compression)
   1588                     location=location,
   1589                     project=project,
-> 1590                     job_config=job_config,
   1591                 )
   1592 

~/.local/lib/python3.5/site-packages/google/cloud/bigquery/client.py in load_table_from_file(self, file_obj, destination, rewind, size, num_retries, job_id, job_id_prefix, location, project, job_config)
   1453                 )
   1454         except resumable_media.InvalidResponse as exc:
-> 1455             raise exceptions.from_http_response(exc.response)
   1456 
   1457         return self.job_from_resource(response.json())

NotFound: 404 POST https://www.googleapis.com/upload/bigquery/v2/projects/<PROJECT-NAME>/jobs?uploadType=resumable: Not found: Dataset <PROJECT>:nyc_real_estate

I think that the column type extractor should use the function is_string_dtype from pandas.api.types to determine if it is a string column, because the dtype is object.

@plamut plamut added api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue. labels Sep 14, 2019
@plamut
Copy link
Contributor

plamut commented Sep 14, 2019

@lopezvit Thanks for the report!

A column whose type cannot be automatically determined issues a warning (the first few lines of the output), but that should not result in a 404 error. It appears that the target dataset does not exist, or there might be a typo in its name.

Can you please check that the dataset exists and its name is indeed correct?

Column type warning

It is generally recommended to use an explicit schema, as auto-detecting column types is not always reliable, and has thus been deprecated recently.

In order to enable deprecation warnings, the following lines can be placed at the top of the script:

import warnings
warnings.simplefilter("always", category=PendingDeprecationWarning) 
warnings.simplefilter("always", category=DeprecationWarning) 

With these lines added, loading dataframe data into a new table would produce the following in the output:

PendingDeprecationWarning: Schema could not be detected for all columns. Loading from a dataframe without a schema will be deprecated in the future, please provide a schema.

Providing a schema

If the target table does not exist yet, and its schema cannot be fetched, and the dataframe contains columns whose type cannot be autodetected, one needs to provide the (partial) schema for these columns.

I managed to get the sample script working with the following modifications:

# make sure that the sale_date column is recognized as date
import datetime as dt
...

def from_iso_date(date_str):
    if not date_str:
        return None
    return dt.datetime.strptime(date_str, '%Y-%m-%d').date()

dfr['sale_date'] = dfr['sale_date'].apply(from_iso_date)
...

# provide an explicit schema for columns that need it
job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField(name="neighborhood", field_type="STRING"),
        bigquery.SchemaField(name="building_class_category", field_type="STRING"),
        bigquery.SchemaField(name="tax_class_at_present", field_type="STRING"),
        bigquery.SchemaField(name="ease_ment", field_type="STRING"),
        bigquery.SchemaField(name="building_class_at_present", field_type="STRING"),
        bigquery.SchemaField(name="address", field_type="STRING"),
        bigquery.SchemaField(name="apartment_number", field_type="STRING"),
        bigquery.SchemaField(name="building_class_at_time_of_sale", field_type="STRING"),
        bigquery.SchemaField(name="sale_date", field_type="DATE"),
    ]
)

client.load_table_from_dataframe(dfr, table_ref, job_config=job_config).result()

Let us know if this solves your issue.

@lopezvit
Copy link
Author

Thank you, it does solve my problem. I wasn't aware of the auto-detecting being deprecated.
It causes me a mixed feelings, because it just means that in the future, developers have to make extra effort defining the schema, but of course, it will mean a safer implementation.
Could it be possible to create a small tool that could infer the schema and print it in the console output? That way it could be easily copied and pasted when ones need to define the schema, just modifying the parts where the automatic guessing tool would guess wrong.
Regarding the 404 error, I just copied the code from pandas_gbq example and migrate it to this library... Apparently, pandas_gbq automatically creates the dataset if it doesn't exists, that is what got me confused, to be honest I didn't even read the whole trace, and I just focused in the first error, Could it be possible to recreate this behaviour in future releases?

@plamut
Copy link
Contributor

plamut commented Sep 17, 2019

I wasn't aware of the auto-detecting being deprecated.
It causes me a mixed feelings, because it just means that in the future, developers have to make extra effort defining the schema, but of course, it will mean a safer implementation.

No worries, it is a fairly recent change. It would indeed be convenient if the schema could always be reliably autodetected, but it turned out that in practice, this is not always the case, unfortunately.

Could it be possible to create a small tool that could infer the schema and print it in the console output? That way it could be easily copied and pasted when ones need to define the schema, just modifying the parts where the automatic guessing tool would guess wrong.

For existing tables, the schema, if not given, is already automatically fetched, and an explicit schema is not necessary.

For new tables, when inferring the schema solely from a given dataframe, however, it might indeed be useful to show how an inferred schema would look like, giving users a chance to see what subset of columns needs to be explicitly specified.

@tswast what do you think about such tool / helper method?

Apparently, pandas_gbq automatically creates the dataset if it doesn't exists, that is what got me confused, to be honest I didn't even read the whole trace, and I just focused in the first error, Could it be possible to recreate this behaviour in future releases?

Do you mean automatically creating the dataset if it does not exist yet? Right now the BigQuery backend only creates a table, if it does not exist yet when loading data into it.

I suppose a dataset could be created automatically, too, although that would have to be done with the additional client logic. This would require implementing the same in the BQ clients for other languages, though, as consistency across implementations is desired.

@tswast Do you know if there has been such feature request in the past? (creating datasets automatically)

@plamut plamut self-assigned this Sep 23, 2019
@tswast
Copy link
Contributor

tswast commented Sep 25, 2019

what do you think about such tool / helper method?

I'm lukewarm on the idea of making a public function. If we do this, it'd mean having a public pandas_helpers module, since we want to somewhat isolate our dependency on pandas.

There's some precedent for a helper method (pandas_gbq.gbq._generate_bq_schema). I hope that it is less necessary since partial schemas are supported, though. #8140

I think the current dataframe_to_bq_schema function will only be useful as a public method once we implement the changes I suggest in #9206 (comment) to continue when object dtypes are detected.

I suppose a dataset could be created automatically, too, although that would have to be done with the additional client logic.

I'm wary of this request. Datasets contain many important properties that need to be set at creation time, such as location and KMS keys. If we automatically create these, it'd default to the API default (which is usually US location and Google-managed encryption). I'm not aware of any other requests for this feature.

@plamut
Copy link
Contributor

plamut commented Oct 19, 2019

Closing this, as the solution has been found, and there are reservations about both suggestions for the reasons state above (still appreciated the proposals, though).

@plamut plamut closed this as completed Oct 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

3 participants