-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQueryInsertJobOperator
sometimes fails to acquire impersonated credentials when in deferred mode
#38532
Comments
As a workaround, setting |
What |
@collinmcnulty Thanks but this doesn't occur enough to warrant this change and would likely have it's own side effects because forcing a rerun is what we want to do in most scenarios. |
It's EU, so I don't think it would be. |
Maybe I have some time to look at it. A few questions:
|
@dondaum This will be using ADC as authentication. Yes, there could be multiple BigQuery tasks running at the same time. I don't think that the time is a factor. The 30 mins in the log I gave might just be how long the query ran for because I have examples that occur in the space of a couple of minutes. |
I tried to reproduce the exact error but with no success. I tried to reproduce it with the following DAG: import datetime
import os
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow import DAG
WAIT_QUERY = """
DECLARE retry_count INT64;
DECLARE success BOOL;
DECLARE size_bytes INT64;
DECLARE row_count INT64;
DECLARE DELAY_TIME DATETIME;
DECLARE WAIT STRING;
SET retry_count = 2;
SET success = FALSE;
WHILE retry_count <= 3 AND success = FALSE DO
BEGIN
SET row_count = (with a as (SELECT 1 as b) SELECT * FROM a WHERE 1 = 2);
IF row_count > 0 THEN
SELECT 'Table Exists!' as message, retry_count as retries;
SET success = TRUE;
ELSE
SELECT 'Table does not exist' as message, retry_count as retries, row_count;
SET retry_count = retry_count + 1;
-- WAITFOR DELAY '00:00:10';
SET WAIT = 'TRUE';
SET DELAY_TIME = DATETIME_ADD(CURRENT_DATETIME,INTERVAL 90 SECOND);
WHILE WAIT = 'TRUE' DO
IF (DELAY_TIME < CURRENT_DATETIME) THEN
SET WAIT = 'FALSE';
END IF;
END WHILE;
END IF;
END;
END WHILE;
"""
with DAG(
dag_id=os.path.splitext(os.path.basename(__file__))[0],
schedule=None,
start_date=datetime.datetime(2024, 1, 1),
catchup=False,
tags=["testing"],
) as dag:
for i in range(10):
bq_task = BigQueryInsertJobOperator(
task_id=f"debug_query_{i}",
configuration={
"query": {
"query": WAIT_QUERY,
"useLegacySql": False,
"priority": "BATCH",
}
},
location="europe-west3",
deferrable=True,
) Also, I set the retry option in the GCP connection to 0 so as not to implicitly retry on failure. Could you perhaps create a DAG that reproduces the error? And maybe you could also check which apache-airflow-providers-google you are using? My setup:
|
@dondaum Thanks but, as I mentioned, it isn't possible to replicate consistently. The error that is returned is a 502 HTTP error from Google which means the problem was on ultimately on their side when the trigger is trying to obtain impersonated credentials in order to check the status of a BigQuery job. It doesn't have anything to do with query times or whether the table exists or not. Perhaps it is possible to simulate the exception that is received by the trigger though?
Oh, and for clarity on the Google provider.
|
@nathadfield Thanks. I think I got it now. I worked on a change that adds a retry in such cases. Can you perhaps have a look and check ? |
@dondaum Looks reasonable from what I can tell. I'd suggest trying to get some eyes on it from the committers. |
Apache Airflow version
main (development)
If "Other Airflow 2 version" selected, which one?
No response
What happened?
Occasionally, a
BigQueryInsertJobOperator
task can fail when in deferred mode due to an inability to acquire impersonated credentials when checking the job status.Here is an example of the task log.
This also returns an exception error such as the following:
What you think should happen instead?
A problem with this is that, although the task can enter the retry state, the initial BQ job can still be running which can have secondary effects; such as the retry failing due to it trying to perform concurrent updates on the same table.
Ideally, an issue with acquiring the impersonated credentials when checking the job status wouldn't immediately result in the task failing.
How to reproduce
Unfortunately, this is not possible to replicate consistently due to the unpredictable nature of the scenario.
Operating System
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"
Versions of Apache Airflow Providers
apache-airflow-providers-google==10.16.0
Deployment
Astronomer
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: