Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQueryInsertJobOperator sometimes fails to acquire impersonated credentials when in deferred mode #38532

Closed
1 of 2 tasks
nathadfield opened this issue Mar 27, 2024 · 10 comments · Fixed by #38961
Closed
1 of 2 tasks
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues

Comments

@nathadfield
Copy link
Collaborator

nathadfield commented Mar 27, 2024

Apache Airflow version

main (development)

If "Other Airflow 2 version" selected, which one?

No response

What happened?

Occasionally, a BigQueryInsertJobOperator task can fail when in deferred mode due to an inability to acquire impersonated credentials when checking the job status.

Here is an example of the task log.

[2024-03-26, 05:38:04 GMT] {credentials_provider.py:353} INFO - Getting connection using `google.auth.default()` since no explicit credentials are provided.
[2024-03-26, 05:38:04 GMT] {bigquery.py:62} INFO - Using the connection  google_cloud_default .
[2024-03-26, 05:38:04 GMT] {taskinstance.py:2370} INFO - Pausing task as DEFERRED. dag_id=my-dag, task_id=my-task, execution_date=20240325T000000, start_date=20240326T053802
[2024-03-26, 05:38:04 GMT] {local_task_job_runner.py:231} INFO - Task exited with return code 100 (task deferral)
[2024-03-26, 05:38:07 GMT] {bigquery.py:110} INFO - Bigquery job status is running. Sleeping for 4.0 seconds.
[2024-03-26, 05:38:08 GMT] {gcs_task_handler.py:158} INFO - Error checking for previous log; if exists, may be overwritten: 'NoneType' object has no attribute 'decode'
[2024-03-26, 05:38:12 GMT] {bigquery.py:110} INFO - Bigquery job status is running. Sleeping for 4.0 seconds.
[2024-03-26, 05:38:16 GMT] {bigquery.py:110} INFO - Bigquery job status is running. Sleeping for 4.0 seconds.
[2024-03-26, 05:38:21 GMT] {bigquery.py:110} INFO - Bigquery job status is running. Sleeping for 4.0 seconds.
...
...
[2024-03-26, 06:05:44 GMT] {bigquery.py:110} INFO - Bigquery job status is running. Sleeping for 4.0 seconds.
[2024-03-26, 06:05:48 GMT] {bigquery.py:117} ERROR - Exception occurred while checking for query completion
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/google/cloud/triggers/bigquery.py", line 94, in run
    job_status = await hook.get_job_status(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 3292, in get_job_status
    job = await self._get_job(job_id=job_id, project_id=project_id, location=location)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 3268, in _get_job
    job = await loop.run_in_executor(None, self._get_job_sync, job_id, project_id, location)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 3287, in _get_job_sync
    return hook.get_job(job_id=job_id, project_id=project_id, location=location)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/google/common/hooks/base_google.py", line 485, in inner_wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 1546, in get_job
    job = client.get_job(job_id=job_id, project=project_id, location=location)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/google/cloud/bigquery/client.py", line 2107, in get_job
    resource = self._call_api(
               ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/google/cloud/bigquery/client.py", line 827, in _call_api
    return call()
           ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py", line 293, in retry_wrapped_func
    return retry_target(
           ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py", line 153, in retry_target
    _retry_error_helper(
  File "/usr/local/lib/python3.11/site-packages/google/api_core/retry/retry_base.py", line 212, in _retry_error_helper
    raise final_exc from source_exc
  File "/usr/local/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py", line 144, in retry_target
    result = target()
             ^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/google/cloud/_http/__init__.py", line 482, in api_request
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/google/cloud/_http/__init__.py", line 341, in _make_request
    return self._do_request(
           ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/google/cloud/_http/__init__.py", line 379, in _do_request
    return self.http.request(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/google/auth/transport/requests.py", line 537, in request
    self.credentials.before_request(auth_request, method, url, request_headers)
  File "/usr/local/lib/python3.11/site-packages/google/auth/credentials.py", line 230, in before_request
    self._blocking_refresh(request)
  File "/usr/local/lib/python3.11/site-packages/google/auth/credentials.py", line 193, in _blocking_refresh
    self.refresh(request)
  File "/usr/local/lib/python3.11/site-packages/google/auth/impersonated_credentials.py", line 250, in refresh
    self._update_token(request)
  File "/usr/local/lib/python3.11/site-packages/google/auth/impersonated_credentials.py", line 282, in _update_token
    self.token, self.expiry = _make_iam_token_request(
                              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/google/auth/impersonated_credentials.py", line 100, in _make_iam_token_request
    raise exceptions.RefreshError(_REFRESH_ERROR, response_body)

This also returns an exception error such as the following:

 ('Unable to acquire impersonated credentials', '<!DOCTYPE html>\n<html lang=en>\n  <meta charset=utf-8>\n  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">\n  <title>Error 502 (Server Error)!!1</title>\n  <style>\n    {margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px} > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_colo

What you think should happen instead?

A problem with this is that, although the task can enter the retry state, the initial BQ job can still be running which can have secondary effects; such as the retry failing due to it trying to perform concurrent updates on the same table.

airflow.exceptions.AirflowException: Query error: Transaction is aborted due to concurrent update against table

Ideally, an issue with acquiring the impersonated credentials when checking the job status wouldn't immediately result in the task failing.

How to reproduce

Unfortunately, this is not possible to replicate consistently due to the unpredictable nature of the scenario.

Operating System

PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"

Versions of Apache Airflow Providers

apache-airflow-providers-google==10.16.0

Deployment

Astronomer

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@nathadfield nathadfield added kind:bug This is a clearly a bug area:core needs-triage label for new issues that we didn't triage yet provider:google Google (including GCP) related issues area:providers and removed area:core labels Mar 27, 2024
@eladkal eladkal added good first issue and removed needs-triage label for new issues that we didn't triage yet labels Mar 28, 2024
@collinmcnulty
Copy link
Contributor

As a workaround, setting force_rerun=False and setting retries should allow another retry to pick up and monitor the original job instead of starting a new one.

@shahar1
Copy link
Contributor

shahar1 commented Mar 30, 2024

What location do you provide as an input? (checking if it might be related to #37282)

@nathadfield
Copy link
Collaborator Author

As a workaround, setting force_rerun=False and setting retries should allow another retry to pick up and monitor the original job instead of starting a new one.

@collinmcnulty Thanks but this doesn't occur enough to warrant this change and would likely have it's own side effects because forcing a rerun is what we want to do in most scenarios.

@nathadfield
Copy link
Collaborator Author

What location do you provide as an input? (checking if it might be related to #37282)

It's EU, so I don't think it would be.

@dondaum
Copy link
Contributor

dondaum commented Apr 9, 2024

Maybe I have some time to look at it.

A few questions:

  • What authentication method do you use? Application Default Credentials, service account or a credential configuration file ?
  • Are you running other GCP-related asynchronous tasks on the triggerer when you see these exceptions (e.g. multiple BigQuery tasks at the same time)?
  • It seems that the exception is thrown after about 30 minutes. So do these exceptions only occur after some time or is it random and they also occur after a few minutes?

@nathadfield
Copy link
Collaborator Author

Maybe I have some time to look at it.

A few questions:

  • What authentication method do you use? Application Default Credentials, service account or a credential configuration file ?
  • Are you running other GCP-related asynchronous tasks on the triggerer when you see these exceptions (e.g. multiple BigQuery tasks at the same time)?
  • It seems that the exception is thrown after about 30 minutes. So do these exceptions only occur after some time or is it random and they also occur after a few minutes?

@dondaum This will be using ADC as authentication. Yes, there could be multiple BigQuery tasks running at the same time. I don't think that the time is a factor. The 30 mins in the log I gave might just be how long the query ran for because I have examples that occur in the space of a couple of minutes.

@dondaum
Copy link
Contributor

dondaum commented Apr 11, 2024

I tried to reproduce the exact error but with no success. I tried to reproduce it with the following DAG:

import datetime
import os

from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow import DAG



WAIT_QUERY = """
DECLARE retry_count INT64;
DECLARE success BOOL;
DECLARE size_bytes INT64;
DECLARE row_count INT64;
DECLARE DELAY_TIME DATETIME;
DECLARE WAIT STRING;
SET retry_count = 2;
SET success = FALSE;


WHILE retry_count <=  3 AND success = FALSE DO
BEGIN
  SET row_count = (with a as (SELECT 1 as b) SELECT * FROM a WHERE 1 = 2);
  IF row_count > 0  THEN
    SELECT 'Table Exists!' as message, retry_count as retries;
    SET success = TRUE;
  ELSE
    SELECT 'Table does not exist' as message, retry_count as retries, row_count;
    SET retry_count = retry_count + 1;
--      WAITFOR DELAY '00:00:10';
    SET WAIT = 'TRUE';
    SET DELAY_TIME = DATETIME_ADD(CURRENT_DATETIME,INTERVAL 90 SECOND);
    WHILE WAIT = 'TRUE' DO
      IF (DELAY_TIME < CURRENT_DATETIME) THEN
         SET WAIT = 'FALSE';
      END IF;
    END WHILE;
  END IF;
END;
END WHILE;
"""


with DAG(
    dag_id=os.path.splitext(os.path.basename(__file__))[0],
    schedule=None,
    start_date=datetime.datetime(2024, 1, 1),
    catchup=False,
    tags=["testing"],
) as dag:
    
    for i in range(10):
        bq_task = BigQueryInsertJobOperator(
            task_id=f"debug_query_{i}",
            configuration={
                "query": {
                    "query": WAIT_QUERY,
                    "useLegacySql": False,
                    "priority": "BATCH",
                }
            },
            location="europe-west3",
            deferrable=True,
        )

Also, I set the retry option in the GCP connection to 0 so as not to implicitly retry on failure.

Could you perhaps create a DAG that reproduces the error? And maybe you could also check which apache-airflow-providers-google you are using?

My setup:

Apache Airflow
version                | 2.7.3                                                 
executor               | LocalExecutor                                         
task_logging_handler   | airflow.utils.log.file_task_handler.FileTaskHandler   
sql_alchemy_conn       | postgresql+psycopg2://airflow:airflow@postgres/airflow
dags_folder            | /opt/airflow/dags                                     
plugins_folder         | /opt/airflow/plugins                                  
base_log_folder        | /opt/airflow/logs                                     
remote_base_log_folder |                                                       
                                                                               

System info
OS              | Linux                                                                                                                                                         
architecture    | x86_64                                                                                                                                                        
uname           | uname_result(system='Linux', node='42d8cf034cfc', release='5.10.16.3-microsoft-standard-WSL2', version='#1 SMP Fri Apr 2 22:23:49 UTC 2021', machine='x86_64')
locale          | ('en_US', 'UTF-8')                                                                                                                                            
python_version  | 3.11.6 (main, Nov  1 2023, 14:02:22) [GCC 10.2.1 20210110]                                                                                                    
python_location | /usr/local/bin/python                                                                                                                                         
                                                                                                                                                                                

Tools info
git             | NOT AVAILABLE                                                                              
ssh             | OpenSSH_8.4p1 Debian-5+deb11u2, OpenSSL 1.1.1w  11 Sep 2023                                
kubectl         | NOT AVAILABLE                                                                              
gcloud          | NOT AVAILABLE                                                                              
cloud_sql_proxy | NOT AVAILABLE                                                                              
mysql           | mysql  Ver 8.0.35 for Linux on x86_64 (MySQL Community Server - GPL)                       
sqlite3         | 3.34.1 2021-01-20 14:10:07 10e20c0b43500cfb9bbc0eaa061c57514f715d87238f4d835880cd846b9ealt1
psql            | psql (PostgreSQL) 16.0 (Debian 16.0-1.pgdg110+1)                                           
                                                                                                             

Paths info
airflow_home    | /opt/airflow                                                                                                                                                            
system_path     | /root/bin:/home/airflow/.local/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin                                                          
python_path     | /home/airflow/.local/bin:/usr/local/lib/python311.zip:/usr/local/lib/python3.11:/usr/local/lib/python3.11/lib-dynload:/home/airflow/.local/lib/python3.11/site-packages:
                | /usr/local/lib/python3.11/site-packages:/opt/airflow/dags:/opt/airflow/config:/opt/airflow/plugins                                                                      
airflow_on_path | True                                                                                                                                                                    
                                                                                                                                                                                          

Providers info
apache-airflow-providers-amazon          | 8.10.0 
apache-airflow-providers-apache-beam     | 5.6.2  
apache-airflow-providers-celery          | 3.6.0  
apache-airflow-providers-cncf-kubernetes | 8.0.1  
apache-airflow-providers-common-sql      | 1.11.1 
apache-airflow-providers-daskexecutor    | 1.1.0  
apache-airflow-providers-dbt-cloud       | 3.7.0  
apache-airflow-providers-docker          | 3.8.0  
apache-airflow-providers-elasticsearch   | 5.1.0  
apache-airflow-providers-ftp             | 3.7.0  
apache-airflow-providers-google          | 10.16.0
apache-airflow-providers-grpc            | 3.3.0  
apache-airflow-providers-hashicorp       | 3.6.4  
apache-airflow-providers-http            | 4.10.0 
apache-airflow-providers-imap            | 3.5.0  
apache-airflow-providers-microsoft-azure | 8.1.0  
apache-airflow-providers-mysql           | 5.5.4  
apache-airflow-providers-odbc            | 4.1.0  
apache-airflow-providers-openlineage     | 1.2.0  
apache-airflow-providers-postgres        | 5.10.2 
apache-airflow-providers-redis           | 3.4.0  
apache-airflow-providers-sendgrid        | 3.4.0  
apache-airflow-providers-sftp            | 4.9.0  
apache-airflow-providers-slack           | 8.3.0  
apache-airflow-providers-snowflake       | 5.1.0  
apache-airflow-providers-sqlite          | 3.7.1  
apache-airflow-providers-ssh             | 3.10.1 

@nathadfield
Copy link
Collaborator Author

@dondaum Thanks but, as I mentioned, it isn't possible to replicate consistently. The error that is returned is a 502 HTTP error from Google which means the problem was on ultimately on their side when the trigger is trying to obtain impersonated credentials in order to check the status of a BigQuery job. It doesn't have anything to do with query times or whether the table exists or not.

Perhaps it is possible to simulate the exception that is received by the trigger though?

 ('Unable to acquire impersonated credentials', '<!DOCTYPE html>\n<html lang=en>\n  <meta charset=utf-8>\n  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">\n  <title>Error 502 (Server Error)!!1</title>\n  <style>\n    {margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px} > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_colo

Oh, and for clarity on the Google provider.

$ pip freeze | grep apache-airflow-providers-google
apache-airflow-providers-google==10.16.0

@dondaum
Copy link
Contributor

dondaum commented Apr 12, 2024

@nathadfield Thanks. I think I got it now.

I worked on a change that adds a retry in such cases. Can you perhaps have a look and check ?

@nathadfield
Copy link
Collaborator Author

@dondaum Looks reasonable from what I can tell. I'd suggest trying to get some eyes on it from the committers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants