BigQuery: 'RowIterator.to_dataframe' surprisingly consumes / merges all pages. #7293

kykrueger · 2019-02-06T14:01:04Z

Environment details

python version: 3.7.2
virtual environment: Conda managed

pip freeze | grep 'google

google-api-core==1.7.0
google-auth==1.6.2
google-auth-oauthlib==0.2.0
google-cloud-bigquery==1.9.0
google-cloud-core==0.29.1
google-resumable-media==0.3.2
googleapis-common-protos==1.5.6

Problem

The strategy for paginating through a table in BigQuery with RowIterator.to_dataframe() does not work as expected.
Either

the documentation should be updated to show that it will iterate automatically over all pages,
or it should only fill the DataFrame with the first page, and let the user iterate and join the frames as needed.

Steps to reproduce

Check the instructions for paginating through a query.
Note the suggestion for how to read subsequent pages

A page is a subset of the total number of rows. If your results are more than one page of data, the result data will have a pageToken property. To retrieve the next page of results, make another list call and include the token value as a URL parameter named pageToken.

Check the response for the Client.list_rows() method.
See that the Iterator may be transformed to a dataframe, and assume that it will only transform the page which was already loaded.
Call RowIterator.to_dataframe() and see that it loads the rest of the pages, and unions them into a single dataframe: more API queries are made . . .
Be surprised that the rest of the pages were loaded without warning, and that you cannot work with dataframes of pages at a time with the native implementation.
Check the source code and confirm your suspicions.

    def _to_dataframe_tabledata_list(self, dtypes):
        """Use (slower, but free) tabledata.list to construct a DataFrame."""
        column_names = [field.name for field in self.schema]
        frames = []
        for page in iter(self.pages):
            frames.append(self._to_dataframe_dtypes(page, column_names, dtypes))
        return pandas.concat(frames)

    def to_dataframe(self, bqstorage_client=None, dtypes=None):
        . . .
        if pandas is None:
            raise ValueError(_NO_PANDAS_ERROR)
        if dtypes is None:
            dtypes = {}

        if bqstorage_client is not None:
            return self._to_dataframe_bqstorage(bqstorage_client, dtypes)
        else:
            return self._to_dataframe_tabledata_list(dtypes)

Code example

table = client.get_table(destination_table)
row_iterator = client.list_rows(destination_table,
                                selected_fields=table.schema,
                                page_size=5,
                                max_results=20)
df = row_iterator.to_dataframe()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to X
Data columns (total X columns):

The text was updated successfully, but these errors were encountered:

tseaver · 2019-02-06T18:45:21Z

@tswast I don't know whether this is just a docs issue (clarify that PageIterator.to_dataset consumes / merges all results into a single dataframe), or whether we need to add a feature which allows getting a dataframe from only a subset (the page, perhaps?).

tswast · 2019-02-06T18:49:52Z

That whole https://cloud.google.com/bigquery/docs/paging-results page is pretty irrelevent when using the client library. The RowIterator takes care of pagination for you.

What is the reason that you'd want a dataframe for only a single page? Potentially this could be a feature request to provide a helper on the page property of the RowIterator to construct a dataframe if there is a reason this is needed.

kykrueger · 2019-02-07T12:15:41Z

I couldn't find any documentation about how the RowIterator takes care of pagination. At least not anything other than using the next_page_token at the time of writing this issue. Now I have found the documentation about it.

With that in mind, this is really just a documentation problem. However, I'll describe what I was trying to do, and if you think that it falls in line with the good-practices, we can open a feature request (I am also willing to make the pull request myself, but might need some help since I have not contributed to this repo before).

What I was doing

We have a small dataset in the cloud which is normally processed with C, and we have some tools for that C library which wrap it in Python3. It is required that we re-process this data for an analysis, but since Python3 is not yet supported by DataFlow, the plan was to just process it by paging through it, and uploading results one page at a time since we will be doing it locally on a RAM limited machine.

I tried to load a page, and wanted to process just that one page fro the sake of keeping memory free for the analysis. However, working with a dataframe would be ideal since some of our methods support dataframes as input. Then when using the to_dataframe method. We ended up with a lot more memory in use, and when increasing the amount of data being imported by this tool, we would not have enough memory left to do anything with it.

Maybe this is worth opening the feature request that @tswast suggested..

However for now,

Documentation fixes

I think we need to

document in the docstring for to_dataframe that it will load the rest of the pages.
add a relevant link to the Python example code at the paging results page
We should also add a more obvious connection to the documentation for how to use the ListIterator's base class HTTPIterator which I linked to above.

kykrueger · 2019-02-11T16:54:28Z

The most needed change is in, for the generated documentation, I am unsure about how to add a hyperlink, or what the best solution is. Shouldn't the documentation language automatically find the HTTPIterator, and let us click on the name to go to it since it belongs to the same repo?

tswast · 2019-02-12T17:33:44Z

Shouldn't the documentation language automatically find the HTTPIterator.

Sphinx doesn't show inheritance by default. I believe we'd need to add :show-inheritance: to the autoclass at the bottom of the template at https://github.com/googleapis/google-cloud-python/blob/master/third_party/sphinx/sphinx/ext/autosummary/templates/autosummary/class.rst

tswast · 2019-02-13T19:11:41Z

I've sent #7338 to fix the docs issue and created #7339 to track the feature request to get a DataFrame for each page.

mrn-aglic · 2023-05-12T14:58:37Z

Hi, I just wanted to check whether I understood correctly:

When iterating over an instance of RowIterator, the an API request is sent to BQ for each row (element) in the iterator?
When iterating over RowIterator.pages an API request is sent to BQ for each page (batch of elements/rows)?

I'm wondering how much data is loaded into application memory. The first case - first iteration loads only one row, in the second case the application loads for the entire batch?

Thank you!

tseaver added the api: bigquery Issues related to the BigQuery API. label Feb 6, 2019

tseaver changed the title ~~BigQuery list_rows and RowIterator Docs~~ BigQuery: 'RowIterator.to_dataframe' surprisingly consumes / merges all pages. Feb 6, 2019

tseaver added the type: question Request for information or clarification. Not an issue. label Feb 6, 2019

tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. type: docs Improvement to the documentation for an API. and removed type: question Request for information or clarification. Not an issue. labels Feb 7, 2019

kykrueger mentioned this issue Feb 10, 2019

Update docstring of RowIterator's to_dataframe #7306

Merged

tswast self-assigned this Feb 13, 2019

tswast mentioned this issue Feb 13, 2019

Show inheritance and inherited members in BQ docs. #7338

Merged

tswast removed the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Feb 13, 2019

tswast mentioned this issue Feb 13, 2019

BigQuery: Make RowIterator._to_dataframe_iterable public #7339

Closed

tswast closed this as completed in #7338 Feb 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: 'RowIterator.to_dataframe' surprisingly consumes / merges all pages. #7293

BigQuery: 'RowIterator.to_dataframe' surprisingly consumes / merges all pages. #7293

kykrueger commented Feb 6, 2019

tseaver commented Feb 6, 2019

tswast commented Feb 6, 2019

kykrueger commented Feb 7, 2019 •

edited by tswast

Loading

kykrueger commented Feb 11, 2019

tswast commented Feb 12, 2019

tswast commented Feb 13, 2019

mrn-aglic commented May 12, 2023

BigQuery: 'RowIterator.to_dataframe' surprisingly consumes / merges all pages. #7293

BigQuery: 'RowIterator.to_dataframe' surprisingly consumes / merges all pages. #7293

Comments

kykrueger commented Feb 6, 2019

Environment details

Problem

Steps to reproduce

Code example

tseaver commented Feb 6, 2019

tswast commented Feb 6, 2019

kykrueger commented Feb 7, 2019 • edited by tswast Loading

What I was doing

Documentation fixes

kykrueger commented Feb 11, 2019

tswast commented Feb 12, 2019

tswast commented Feb 13, 2019

mrn-aglic commented May 12, 2023

kykrueger commented Feb 7, 2019 •

edited by tswast

Loading