Provide support for synchronous queries through the v2/projects/{projectId}/queries endpoint #589

kdeggelman · 2021-04-06T16:13:39Z

Is your feature request related to a problem? Please describe.
We are part of the BI Engine preview so we're trying to minimize the latency between our python application and BigQuery. The current approach of creating an asynchronous job and then waiting for it to complete adds significant latency and hurts the value proposition of BI Engine.
Describe the solution you'd like
We'd like a way to utilize the v2/projects/{projectId}/queries endpoint to execute a query and wait for the response.
Describe alternatives you've considered
We could call the REST API directly. The main drawbacks are 1) authentication with a service account 2) the existing python client does a nice job of creating Row objects with the results.
I've also used the following snippet:

bq_client = bigquery.Client(project=project_id)
synchronous_api_response = bq_client._connection.api_request(
    "POST", 
    f"/projects/{project_id}/queries",
    data={"query": query, "useLegacySql": False},
)

The text was updated successfully, but these errors were encountered:

plamut · 2021-04-09T07:27:24Z

@kdeggelman I'm not sure it would be feasible to change how the said endpoint behaves on the backend. Even if POST-ing to it directly, it just creates a new job to run the actual query, and the code would still need to fetch query results using the job ID obtained from the first call's response.

As an alternative, have you maybe considered using the faster BigQuery Storage API? It's more performant than the REST API, allows streaming the rows over multiple streams, etc.

Could that be something that can help with your use case?

kdeggelman · 2021-04-12T18:33:22Z

Thanks for the help @plamut!

and the code would still need to fetch query results using the job ID obtained from the first call's response.

The doc you linked to for jobs.query says "Runs a BigQuery SQL query synchronously and returns query results if the query completes within a specified timeout." which is the behavior that I've observed. I am able to grab the rows straight from the response body of that endpoint without hitting getQueryResults.

As an alternative, have you maybe considered using the faster BigQuery Storage API? It's more performant than the REST API, allows streaming the rows over multiple streams, etc.

Could that be something that can help with your use case?

The Storage API definitely looks promising, as I'm sure we'll run into cases where our queries go beyond the timeout of the jobs.query call, or return a large dataset that would benefit from the storage API.

That being said, I think it could still make sense to support the jobs.query method given its speed improvement over jobs.insert + jobs.getQueryResults and the simplicity over the Storage API.

tswast · 2021-04-13T16:39:42Z

This is a tricky one. jobs.query can be faster for some result sizes, but there are a great deal of result sizes where it is much much slower and causes problems like #403

tswast · 2021-04-13T16:50:48Z

We got partway through implementing this in #362 before reverting a few of those PRs.

The other problem we encountered is that to do this right, we'd have to cache that first page of results, which was unexpected and undesired for a large class of use cases: #394

That said, I think we could support this ask, but only if it's behind a feature flag. Perhaps we add a use_query_api flag to query that runs jobs.query and caches the first page of results if set?

kdeggelman · 2021-04-13T18:19:44Z

@tswast thanks for providing all those links for context. It's clear that there is no one-size-fits-all answer to the problem of optimally retrieving query results. I think adding some sort of explicit flag to use jobs.query makes sense.

In the meantime, I'm pretty happy with my workaround:

from google.cloud.bigquery.query import _QueryResults

synchronous_api_response = bq_client._connection.api_request(
    "POST",
    f"/projects/{bq_client.project}/queries",
    data={"query": query, "useLegacySql": False},
)
return _QueryResults.from_api_repr(synchronous_api_response).rows

plamut · 2021-04-20T11:30:38Z

@tswast Could you elaborate on how you imagine this addition should look like from the user perspective?

Client.query() returns a QueryJob, but if taking a shortcut and using the jobs.query endpoint directly, the response is not a QueryJob object. The response only includes a job reference, but fetching the job itself to return it would kind of beat the purpose of making fewer API requests?

If Client.query() is changed to optionally return query results instead, that would seem convoluted - perhaps a new specialized method would be a cleaner approach?

Also asking because if the existing method is updated, we need to think about what to do with parameters such as job_config that are kind of irrelevant when sending a QueryRequest?

tswast · 2021-04-21T19:21:16Z

Could you elaborate on how you imagine this addition should look like from the user perspective?

@plamut I have a sketch of a plan in #362

Client.query() returns a QueryJob, but if taking a shortcut and using the jobs.query endpoint directly, the response is not a QueryJob object.

Very true! I'd still imagine returning a QueryJob object, but it contains very minimal information. For one thing, the query might not complete within the jobs.query() timeout, so we'd still need logic to support waiting for these jobs to finish. QueryJob already does this.

Whether the job is finished or not, we'd return a QueryJob that is basically just the job reference and a cached _query_results object. To avoid reloading the job information unnecessarily, we might also set the job state to DONE if the query is "completed".

Also asking because if the existing method is updated, we need to think about what to do with parameters such as job_config that are kind of irrelevant when sending a QueryRequest?

These are not irrelevant. Many (but not all, unfortunately) of the parameters in job_config are also supported by the jobs.query() API. In #362 I was thinking we'd autoselect jobs.query or jobs.insert based on how complex the job_config argument is. But now I think that's too "magical" / hard to debug. I think just a flag to call jobs.query is all we desire.

As you can see from https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query#queryrequest the request object to jobs.query is kind of a flattened and smaller version of https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfigurationquery A potential implementation is to use the "api representation" of job_config as a starting point and rearrange a few keys such as dryRun. A benefit of this is that as the backend folks add parameters to jobs.query to bring it into parity with jobs.insert, we won't have to update the client to support them.

tswast · 2021-04-21T19:22:58Z

Oh, and the other thing we'll require is the "first page" hack that I initially implemented in RowIterator. We might need some additional hooks there because we won't have the destination table ID available. This means that to_dataframe and to_arrow's BigQuery Storage API implementation can't be used when jobs.query is used.

tswast · 2022-03-25T15:48:57Z

This is 1/2 done on the v3 branch, with the new api_method="QUERY" parameter to the query() method.

That said, it's missing all the optimizations to make it useful. It calls jobs.query, but then it still calls the same set of functions as api_method="INSERT" to wait for the query to finish, populate job stats, and download results (to iterable and to dataframe/arrow).

chalmerlowe · 2023-09-19T14:31:25Z

This issue has been open for three years and while some effort has been made there appears to be a significant amount of work that needs to be done. Besides the initial request three years ago, there does not appear to be a large user base requesting this feature. Closing due to workload and other priorities.

tswast · 2023-11-16T17:04:51Z

#1723 finished up the work planned for this project. With that change, when you run Client.query(..., api_method="QUERY") it does a call to jobs.query REST API and uses the results in RowIterator.

Note: it still does a call to the jobs.get API to get the job metadata (importantly, the destination table) so that we can fall back to fetching large results with the BQ Storage API. In #1722 I plan to introduce a new method which can avoid that jobs.get API call in some cases.

nickmarx12345678 · 2023-12-03T04:39:05Z

@tswast thank you for working on this! I am reading your PR and wanted to clarify the usage, as we're working to reduce the latency between query creation time and serving of results from (typically) single page queries.

Our usage at the moment looks roughy like:

job = client.query(query=sql, job_config=job_config)
return [result for result in job.result(
                start_index=start_index, max_results=limit, retry=retry, timeout=timeout)
]

My understanding is we will need to update our query call to set api_method="QUERY", after which, if the job created by the query call completes in less time than timeout, we will pull back the first page immediately, instead of issuing a second getQueryResults request under the hood?

tswast · 2023-12-06T17:05:29Z

@nickmarx12345678 Yes, to avoid an extra call to getQueryResults, use api_method="QUERY". Note: setting a start index and max results might invalidate what we have cached depending on how many results were originally returned.

Also note: my in flight PR #1722 optimized this further by introducing a query_and_wait method that can avoid an unnecessary call to get the job metadata as well.

nickmarx12345678 · 2023-12-06T17:20:47Z

Thanks! We have already started testing w/ prerelease version and can see the difference in traces. One observation I wanted to share:

It appears in the optimized case, the resultant row_iterator may have its total_rows field set to None, where previously it would be the aggregate number of all results. Still working through the code to understand if this is expected, but wanted to share in case it is not. (Noticed as we use this field to prepare pagination metadata)

e.g.

job = client.query(query=sql, job_config=job_config, api_method="QUERY")
result = job.result(start_index=start_index, max_results=limit, retry=retry, timeout=timeout)
rows = [row for rows in result]
print(result.total_rows)
# None

tswast · 2023-12-08T01:41:20Z

@nickmarx12345678 I noticed that issue too in https://github.com/googleapis/python-bigquery/pull/1722/files#diff-1172a63bba9d23e3a24e43736ef1275452ef2ef2d4a50d054c209bdd2792fdf5 and have a fix included in that PR. I can pull that fix out into a separate PR so it's not blocked if you prefer.

tswast · 2023-12-08T02:03:36Z

#1748 fix for total_rows is awaiting review.

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Apr 6, 2021

plamut added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Apr 7, 2021

plamut self-assigned this Apr 14, 2021

bacheson mentioned this issue Apr 16, 2021

Provide support for synchronous queries through the v2/projects/{projectId}/queries endpoint googleapis/google-cloud-php#3960

Open

tswast mentioned this issue May 25, 2021

The page_iterator API proble, and proposed fix googleapis/python-api-core#194

Closed

tswast assigned tswast and unassigned plamut Jul 20, 2021

tswast mentioned this issue Jul 20, 2021

deps!: BigQuery Storage and pyarrow are required dependencies #776

Merged

4 tasks

tswast mentioned this issue Aug 5, 2021

feat: retry failed query jobs in result() #837

Merged

4 tasks

tswast mentioned this issue Sep 13, 2021

feat: add api_method parameter to Client.query to select INSERT or QUERY API #967

Merged

4 tasks

tswast removed their assignment Apr 14, 2022

dazuma mentioned this issue Aug 18, 2023

Why BigQuery service.query_job calls service.insert_job? googleapis/google-cloud-ruby#22781

Closed

chalmerlowe closed this as completed Sep 19, 2023

chalmerlowe added the status: will not fix Invalid (untrue/unsound/erroneous), inconsistent with product, not on roadmap. label Sep 19, 2023

tswast reopened this Nov 14, 2023

tswast removed the status: will not fix Invalid (untrue/unsound/erroneous), inconsistent with product, not on roadmap. label Nov 14, 2023

tswast self-assigned this Nov 14, 2023

This was referenced Nov 15, 2023

feat: add Client.query_and_wait which directly returns a RowIterator of results #1722

Merged

perf: use the first page a results when query(api_method="QUERY") #1723

Merged

tswast mentioned this issue Nov 17, 2023

feat: add job_id, location, project, and query_id properties on RowIterator #1733

Merged

4 tasks

tswast closed this as completed in #1723 Nov 21, 2023

r1b mentioned this issue Dec 6, 2023

dbapi: Skip storage client fetch when results cached #1745

Closed

This was referenced Dec 8, 2023

use faster query_and_wait method when available to skip BQ Storage API if all query results are already available googleapis/python-bigquery-pandas#710

Closed

fix: keep RowIterator.total_rows populated after iteration #1748

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide support for synchronous queries through the v2/projects/{projectId}/queries endpoint #589

Provide support for synchronous queries through the v2/projects/{projectId}/queries endpoint #589

kdeggelman commented Apr 6, 2021

plamut commented Apr 9, 2021

kdeggelman commented Apr 12, 2021

tswast commented Apr 13, 2021

tswast commented Apr 13, 2021

kdeggelman commented Apr 13, 2021

plamut commented Apr 20, 2021

tswast commented Apr 21, 2021

tswast commented Apr 21, 2021

tswast commented Mar 25, 2022

chalmerlowe commented Sep 19, 2023

tswast commented Nov 16, 2023

nickmarx12345678 commented Dec 3, 2023 •

edited

Loading

tswast commented Dec 6, 2023

nickmarx12345678 commented Dec 6, 2023 •

edited

Loading

tswast commented Dec 8, 2023

tswast commented Dec 8, 2023

Provide support for synchronous queries through the v2/projects/{projectId}/queries endpoint #589

Provide support for synchronous queries through the v2/projects/{projectId}/queries endpoint #589

Comments

kdeggelman commented Apr 6, 2021

plamut commented Apr 9, 2021

kdeggelman commented Apr 12, 2021

tswast commented Apr 13, 2021

tswast commented Apr 13, 2021

kdeggelman commented Apr 13, 2021

plamut commented Apr 20, 2021

tswast commented Apr 21, 2021

tswast commented Apr 21, 2021

tswast commented Mar 25, 2022

chalmerlowe commented Sep 19, 2023

tswast commented Nov 16, 2023

nickmarx12345678 commented Dec 3, 2023 • edited Loading

tswast commented Dec 6, 2023

nickmarx12345678 commented Dec 6, 2023 • edited Loading

tswast commented Dec 8, 2023

tswast commented Dec 8, 2023

nickmarx12345678 commented Dec 3, 2023 •

edited

Loading

nickmarx12345678 commented Dec 6, 2023 •

edited

Loading