Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: 'RowIterator.to_dataframe' surprisingly consumes / merges all pages. #7293

Closed
2 tasks
kykrueger opened this issue Feb 6, 2019 · 7 comments · Fixed by #7338
Closed
2 tasks

BigQuery: 'RowIterator.to_dataframe' surprisingly consumes / merges all pages. #7293

kykrueger opened this issue Feb 6, 2019 · 7 comments · Fixed by #7338
Assignees
Labels
api: bigquery Issues related to the BigQuery API. type: docs Improvement to the documentation for an API.

Comments

@kykrueger
Copy link
Contributor

Environment details

python version: 3.7.2
virtual environment: Conda managed

pip freeze | grep 'google

  • google-api-core==1.7.0
  • google-auth==1.6.2
  • google-auth-oauthlib==0.2.0
  • google-cloud-bigquery==1.9.0
  • google-cloud-core==0.29.1
  • google-resumable-media==0.3.2
  • googleapis-common-protos==1.5.6

Problem

The strategy for paginating through a table in BigQuery with RowIterator.to_dataframe() does not work as expected.
Either

  • the documentation should be updated to show that it will iterate automatically over all pages,
  • or it should only fill the DataFrame with the first page, and let the user iterate and join the frames as needed.

Steps to reproduce

  1. Check the instructions for paginating through a query.
  2. Note the suggestion for how to read subsequent pages

A page is a subset of the total number of rows. If your results are more than one page of data, the result data will have a pageToken property. To retrieve the next page of results, make another list call and include the token value as a URL parameter named pageToken.

  1. Check the response for the Client.list_rows() method.
  2. See that the Iterator may be transformed to a dataframe, and assume that it will only transform the page which was already loaded.
  3. Call RowIterator.to_dataframe() and see that it loads the rest of the pages, and unions them into a single dataframe: more API queries are made . . .
  4. Be surprised that the rest of the pages were loaded without warning, and that you cannot work with dataframes of pages at a time with the native implementation.
  5. Check the source code and confirm your suspicions.
    def _to_dataframe_tabledata_list(self, dtypes):
        """Use (slower, but free) tabledata.list to construct a DataFrame."""
        column_names = [field.name for field in self.schema]
        frames = []
        for page in iter(self.pages):
            frames.append(self._to_dataframe_dtypes(page, column_names, dtypes))
        return pandas.concat(frames)

    def to_dataframe(self, bqstorage_client=None, dtypes=None):
        . . .
        if pandas is None:
            raise ValueError(_NO_PANDAS_ERROR)
        if dtypes is None:
            dtypes = {}

        if bqstorage_client is not None:
            return self._to_dataframe_bqstorage(bqstorage_client, dtypes)
        else:
            return self._to_dataframe_tabledata_list(dtypes)

Code example

table = client.get_table(destination_table)
row_iterator = client.list_rows(destination_table,
                                selected_fields=table.schema,
                                page_size=5,
                                max_results=20)
df = row_iterator.to_dataframe()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to X
Data columns (total X columns):

@tseaver tseaver added the api: bigquery Issues related to the BigQuery API. label Feb 6, 2019
@tseaver tseaver changed the title BigQuery list_rows and RowIterator Docs BigQuery: 'RowIterator.to_dataframe' surprisingly consumes / merges all pages. Feb 6, 2019
@tseaver tseaver added the type: question Request for information or clarification. Not an issue. label Feb 6, 2019
@tseaver
Copy link
Contributor

tseaver commented Feb 6, 2019

@tswast I don't know whether this is just a docs issue (clarify that PageIterator.to_dataset consumes / merges all results into a single dataframe), or whether we need to add a feature which allows getting a dataframe from only a subset (the page, perhaps?).

@tswast
Copy link
Contributor

tswast commented Feb 6, 2019

That whole https://cloud.google.com/bigquery/docs/paging-results page is pretty irrelevent when using the client library. The RowIterator takes care of pagination for you.

What is the reason that you'd want a dataframe for only a single page? Potentially this could be a feature request to provide a helper on the page property of the RowIterator to construct a dataframe if there is a reason this is needed.

@kykrueger
Copy link
Contributor Author

kykrueger commented Feb 7, 2019

I couldn't find any documentation about how the RowIterator takes care of pagination. At least not anything other than using the next_page_token at the time of writing this issue. Now I have found the documentation about it.

With that in mind, this is really just a documentation problem. However, I'll describe what I was trying to do, and if you think that it falls in line with the good-practices, we can open a feature request (I am also willing to make the pull request myself, but might need some help since I have not contributed to this repo before).

What I was doing

We have a small dataset in the cloud which is normally processed with C, and we have some tools for that C library which wrap it in Python3. It is required that we re-process this data for an analysis, but since Python3 is not yet supported by DataFlow, the plan was to just process it by paging through it, and uploading results one page at a time since we will be doing it locally on a RAM limited machine.

I tried to load a page, and wanted to process just that one page fro the sake of keeping memory free for the analysis. However, working with a dataframe would be ideal since some of our methods support dataframes as input. Then when using the to_dataframe method. We ended up with a lot more memory in use, and when increasing the amount of data being imported by this tool, we would not have enough memory left to do anything with it.

Maybe this is worth opening the feature request that @tswast suggested..

However for now,

Documentation fixes

I think we need to

  • document in the docstring for to_dataframe that it will load the rest of the pages.
  • add a relevant link to the Python example code at the paging results page
  • We should also add a more obvious connection to the documentation for how to use the ListIterator's base class HTTPIterator which I linked to above.

@tswast tswast added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. type: docs Improvement to the documentation for an API. and removed type: question Request for information or clarification. Not an issue. labels Feb 7, 2019
@kykrueger
Copy link
Contributor Author

The most needed change is in, for the generated documentation, I am unsure about how to add a hyperlink, or what the best solution is. Shouldn't the documentation language automatically find the HTTPIterator, and let us click on the name to go to it since it belongs to the same repo?

@tswast
Copy link
Contributor

tswast commented Feb 12, 2019

Shouldn't the documentation language automatically find the HTTPIterator.

Sphinx doesn't show inheritance by default. I believe we'd need to add :show-inheritance: to the autoclass at the bottom of the template at https://github.com/googleapis/google-cloud-python/blob/master/third_party/sphinx/sphinx/ext/autosummary/templates/autosummary/class.rst

@tswast tswast self-assigned this Feb 13, 2019
@tswast tswast removed the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Feb 13, 2019
@tswast
Copy link
Contributor

tswast commented Feb 13, 2019

I've sent #7338 to fix the docs issue and created #7339 to track the feature request to get a DataFrame for each page.

@mrn-aglic
Copy link

Hi, I just wanted to check whether I understood correctly:

When iterating over an instance of RowIterator, the an API request is sent to BQ for each row (element) in the iterator?
When iterating over RowIterator.pages an API request is sent to BQ for each page (batch of elements/rows)?

I'm wondering how much data is loaded into application memory. The first case - first iteration loads only one row, in the second case the application loads for the entire batch?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. type: docs Improvement to the documentation for an API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants