pandas.io.gbq verify_schema seems to be too strict. #11359

FlxVctr · 2015-10-18T03:31:36Z

This line seems for me to be too strict for repeated insertion, because apparently GBQ is not consistent in the order of fields in the schema (or my application screws the order of fields up, anyway, I would say the verification is too strict).

So for example:

dict1 = {'fields': [{'name': 'coordinates_0', 'type': 'FLOAT'}, {'name': 'created_at', 'type': 'STRING'}]}
dict2 = {'fields': [{'name': 'created_at', 'type': 'STRING'}, {'name': 'coordinates_0', 'type': 'FLOAT'}]}
dict1 == dict2  # gives False

would make verification fail, though insertion would work, as the insert as JSON makes the order of fields irrelevant.

Solved that for myself for the moment with:

    def verify_schema(self, dataset_id, table_id, schema):
        from apiclient.errors import HttpError

        try:
            bq_schema = (self.service.tables().get(
                projectId=self.project_id,
                datasetId=dataset_id,
                tableId=table_id
                ).execute()['schema'])
            return set(
                       [json.dumps(x) for x in bq_schema['fields']]  # dump necessary to make dicts hashable
                      ) == set(
                               [json.dumps(x) for x in schema['fields']]
            )  # this still fails if key order is different. But GBQ seems to keep key order.

        except HttpError as ex:
            self.process_http_error(ex)

jorisvandenbossche · 2015-10-18T09:34:06Z

cc @parthea

parthea · 2015-10-18T14:53:37Z

The strict ordering of fields is by design (also in 0.16.2). From the docs:

The dataframe must match the destination table in column order, structure, and data types.

because apparently GBQ is not consistent in the order of fields in the schema (or my application screws the order of fields up

Can you confirm whether it is your application that is changing the order of fields? If we can understand how the ordering of fields is changing, we may be better off fixing that problem instead.

My personal preference would be to fail if the column order of the DataFrame being inserted is different. It is trivial to alter the column order of a DataFrame prior to inserting.

Certainly, the proposed changed would be more flexible, but I think that the user should be aware of the column order, in case the BigQuery table schema has actually changed.

FlxVctr · 2015-10-19T02:49:09Z

Thanks for your answer. Wasn't reading the docs closely enough apparently. It's very likely that it is my application as I am transforming raw JSON from the Twitter API to a dataframe and the order there is arbitrary. I can try to verify that though. But order of columns seemed (before formatting anything for readable output) pretty unnecessary to me and only costing additional programming time as well as introducing more possibilities for errors to sneak in.

On the other hand, but now we are talking about particular project requirements and personal preferences, this use case it not that unlikely. And I don't know how the order of fields in BigQuery would matter for most applications of BigQuery in the end, as you query for fields by name anyway. For my project it's just another step processing the data (which will come in large volume) before inserting, and I need to minimise those. Have to admit that ordering is trivial, indeed. So yes, that's a design decision. I would plead for a less strict option (maybe with an optional argument in to_gbq?) though.

parthea · 2015-10-19T11:40:29Z

The number of optional arguments in to_gbq is growing, so it may be better to decide whether to incorporate the change by default or not. The proposed change would certainly make inserting data into BigQuery easier, as long as we are ok with not maintaining column order.

I'd like @jacobschaer to comment. The note in the docs to maintain column order was added in #6937.

jreback · 2015-10-19T11:45:17Z

just to note that HDF5 requires the orderings to be exactly the same as does SQL. I don't think this should be relaxes. Ok to have a 'better' error message though.

FlxVctr · 2015-10-20T02:45:51Z

Yes, a more detailed error message would be an improvement. Because neither Python dicts, nor pandas DataFrames nor GBQ really cares about column order, so I did not expect it to matter, it took me quite a while to find out what is wrong with my schema.

jorisvandenbossche · 2015-10-20T20:35:33Z

Note that for the SQL functions the ordering does not need to be the same (as we use named parameters and a dict with the data to insert)

…es not match pandas-dev#11359

ENH: Improve the error message in to_gbq when schema does not match #11359

This includes updates to 3 Excel files, plus a test in test_excel.py, plus the fix in parsers.py issue when read_html with previous fix With read_html, the fix didn't work on Python 2.7. Handle the string conversion correctly Add bug fixed to what's new Revert "Add bug fixed to what's new" This reverts commit 05b2344. Revert "issue when read_html with previous fix" This reverts commit d1bc296. Add what's new to describe bug. fix issue with original fix Added text to describe the bug. Fixed issue so that it works correctly in Python 2.7 Add round trip test Added round trip test and fixed error in writing sheets when merge_cells=false and columns have multi index DEPR: deprecate pandas.io.ga, pandas-dev#11308 DEPR: deprecate engine keyword from to_csv pandas-dev#11274 remove warnings from the tests for deprecation of engine in to_csv PERF: Checking monotonic-ness before sorting on an index pandas-dev#11080 BUG: Bug in list-like indexing with a mixed-integer Index, pandas-dev#11320 Add hex color strings test CLN: GH11271 move _get_handle, UTF encoders to io.common TST: tests for list skiprows in read_excel BUG: Fix to_dict() problem when using only datetime pandas-dev#11247 Fix a bug where to_dict() does not return Timestamp when there is only datetime dtype present. Undo change for when columns are multiindex There is still something wrong here in the format of the file when there are multiindex columns, but that's for another day Fix formatting in test_excel and remove spurious test See title BUG: bug in comparisons vs tuples, pandas-dev#11339 bug#10442 : fix, adding note and test BUG pandas-dev#10442(test) : Convert datetimelike index to strings with astype(str) BUG#10422: note added bug#10442 : tests added bug#10442 : note udated BUG pandas-dev#10442(test) : Convert datetimelike index to strings with astype(str) bug#10442: fix, adding note and test bug#10442: fix, adding note and test Adjust test so that merge_cells=False works correctly Adjust the test so that if merge_cells=false, it does a proper formatting of the columns in the single row header, and puts the row header in the first row Fix test for Python 2.7 and 3.5 The test is failing on Python 2.7 and 3.5, which appears to read in the values as floats, and I cannot replicate. So force the tests to pass by just making the column names equal when merge_cells=False Fix for openpyxl < 2, and for issue pandas-dev#11408 If using openpyxl < 2, and value is a string that could be a number, force a string to be written out. If using openpyxl >= 2.2, then fix issue pandas-dev#11408 to do with merging cells Use set_value_explicit instead of set_explicit_value set_value_explicit is in openpyxl 1.6, changed in openpyxl 1.8, but there is code in 1.8 to set set_value_explicit to set_explicit_value for compatibility Add line in whatsnew for issue 11408 ENH: added capability to handle Path/LocalPath objects, pandas-dev#11033 DOC: typo in whatsnew/0.17.1.txt PERF: Release GIL on some datetime ops BUG: Bug in DataFrame.replace with a datetime64[ns, tz] and a non-compat to_replace pandas-dev#11326 CLN: clean up internal impl of fillna/replace, xref pandas-dev#11153 PERF: fast inf checking in to_excel PERF: Series.dropna with non-nan dtypes fixed pathlib tests on windows DEPR: remove some SparsePanel deprecation warnings in testing DEPR: avoid numpy comparison to None warnings API: indexing with a null key will raise a TypeError rather than a ValueError, pandas-dev#11356 WARN: elementwise comparisons with index names, xref pandas-dev#11162 DEPR warning in io/data.py w.r.t. order->sort_values WARN: more elementwise comparisons to object WARN: more uncomparables of numeric array vs object BUG: quick fix for pandas-dev#10989 TST: add test case from Issue pandas-dev#10989 API: add _to_safe_for_reshape to allow safe insert/append with embedded CategoricalIndexes Signed-off-by: Jeff Reback <[email protected]> BLD: conda Revert "BLD: conda" This reverts commit 0c8a8e1. TST: remove invalid symbol warnings TST: move some tests to slow TST: fix some warnings filters TST: import pandas_datareader, use for tests TST: remove some deprecation warnings from imports DEPR: fix VisibleDeprecationWarnings in sparse TST: remove some warnings in test_nanops ENH: Improve the error message in to_gbq when the DataFrame schema does not match pandas-dev#11359 add libgfortran to 1.8.1 build binstar -> anaconda remove link to issue 11328 in whatsnew Fixes to document issue in code, small efficiency fix Try to resolve rebase conflict in whats new

parthea · 2016-09-10T13:50:54Z

@jreback Can this be closed, or do you expect a PR for this? I improved the error message in #11401 (already merged)

jreback · 2016-09-10T13:53:04Z

does thi example work?
eg maybe just need a test to validate this

jreback · 2016-09-10T14:01:57Z

is schema ordered?

parthea · 2016-09-10T14:05:35Z

The verify_schema function is currently very strict. Column order matters in the schema. The example solution works. I'll submit a PR to make the verify_schema function less strict and add tests to validate this. I'll also update the docs.

…as-dev#11359

…andas-dev#11359

FlxVctr changed the title ~~pandas.io.gbq verify_schema is too strict.~~ pandas.io.gbq verify_schema seems to be too strict. Oct 18, 2015

jorisvandenbossche added the Google I/O label Oct 18, 2015

jreback added the Error Reporting Incorrect or improved errors from pandas label Oct 19, 2015

parthea added a commit to parthea/pandas that referenced this issue Oct 21, 2015

ENH: Improve the error message in to_gbq when the DataFrame schema do…

21494bb

…es not match pandas-dev#11359

parthea mentioned this issue Oct 21, 2015

ENH: Improve the error message in to_gbq when schema does not match #11359 #11401

Merged

jreback added a commit that referenced this issue Oct 21, 2015

Merge pull request #11401 from parthea/bq-improve-schema-error-msg

b09b84e

ENH: Improve the error message in to_gbq when schema does not match #11359

parthea added a commit to parthea/pandas that referenced this issue Sep 11, 2016

In gbq.to_gbq allow the DataFrame column order to differ from df pand…

c37ff46

…as-dev#11359

parthea mentioned this issue Sep 11, 2016

In gbq.to_gbq allow the DataFrame column order to differ from schema #14202

Closed

4 tasks

parthea added a commit to parthea/pandas that referenced this issue Sep 11, 2016

In gbq.to_gbq allow the DataFrame column order to differ from schema p…

7e6a073

…andas-dev#11359

jreback added this to the 0.19.0 milestone Sep 11, 2016

jreback closed this as completed in f363236 Sep 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.io.gbq verify_schema seems to be too strict. #11359

pandas.io.gbq verify_schema seems to be too strict. #11359

FlxVctr commented Oct 18, 2015

jorisvandenbossche commented Oct 18, 2015

parthea commented Oct 18, 2015

FlxVctr commented Oct 19, 2015

parthea commented Oct 19, 2015

jreback commented Oct 19, 2015

FlxVctr commented Oct 20, 2015

jorisvandenbossche commented Oct 20, 2015

parthea commented Sep 10, 2016

jreback commented Sep 10, 2016

jreback commented Sep 10, 2016

parthea commented Sep 10, 2016 •

edited

Loading

pandas.io.gbq verify_schema seems to be too strict. #11359

pandas.io.gbq verify_schema seems to be too strict. #11359

Comments

FlxVctr commented Oct 18, 2015

jorisvandenbossche commented Oct 18, 2015

parthea commented Oct 18, 2015

FlxVctr commented Oct 19, 2015

parthea commented Oct 19, 2015

jreback commented Oct 19, 2015

FlxVctr commented Oct 20, 2015

jorisvandenbossche commented Oct 20, 2015

parthea commented Sep 10, 2016

jreback commented Sep 10, 2016

jreback commented Sep 10, 2016

parthea commented Sep 10, 2016 • edited Loading

parthea commented Sep 10, 2016 •

edited

Loading