You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's not unheard of for requests to the GCE metadata server to fail with a transient error, like a 503, 500, or 429. These requests can and should be retried, and they are in certain code paths. However, one very important code path, the credentials.refresh(...) method for credentials taken from the GCE metadata server, does not.
Environment details
OS: Debian 11
Python version: 3.9.2
pip version: 20.3.4
google-auth version: 2.32.0
Steps to reproduce
Setup an http proxy on http://localhost:8080/ to inject transient 429 errors. Here is an example that will make every other request to a /token endpoint fail with status code 429.
Run while true ; do http_proxy=http://localhost:8080/ python3 getcreds.py; sleep 1; done
The 1st or second request should fail. Here is some example output:
Checking None for explicit credentials as part of auth process...
Checking Cloud SDK credentials as part of auth process...
Cloud SDK credentials not found on disk; not using them
Making request: GET http://169.254.169.254
Making request: GET http://metadata.google.internal/computeMetadata/v1/project/project-id
Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true
Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/REDACTED/token
Traceback (most recent call last):
File "/home/REDACTED/.local/lib/python3.9/site-packages/google/auth/compute_engine/credentials.py", line 127, in refresh
self.token, self.expiry = _metadata.get_service_account_token(
File "/home/REDACTED/.local/lib/python3.9/site-packages/google/auth/compute_engine/_metadata.py", line 351, in get_service_account_token
token_json = get(request, path, params=params, headers=metrics_header)
File "/home/REDACTED/.local/lib/python3.9/site-packages/google/auth/compute_engine/_metadata.py", line 243, in get
raise exceptions.TransportError(
google.auth.exceptions.TransportError: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/REDACTED/token from the Google Compute Engine metadata service. Status: 429 Response:\nb'Too many requests\\n'", <google.auth.transport.requests._Response object at 0x7fe0a13849d0>)
Note how only 1 attempt is logged for the /token endpoint.
But in the scenario I'm complaining about, the HTTP request completes successfully, but the response code indicates a transient error, so we hit this code path:
The following patch to google/auth/compute/_metadata.py "fixes" the reproduction:
--- _metadata.py.orig 2024-07-25 01:51:12.567167923 +0000
+++ _metadata.py 2024-07-25 01:51:57.026072333 +0000
@@ -28,6 +28,7 @@
from google.auth import environment_vars
from google.auth import exceptions
from google.auth import metrics
+from google.auth import transport
_LOGGER = logging.getLogger(__name__)
@@ -202,7 +203,11 @@
while retries < retry_count:
try:
response = request(url=url, method="GET", headers=headers_to_use)
- break
+ if response.status in transport.DEFAULT_RETRYABLE_STATUS_CODES:
+ retries += 1
+ continue
+ else:
+ break
except exceptions.TransportError as e:
_LOGGER.warning(
I put "fixes" in quotes because in a real failure, a transient error is likely caused by the GCE metadata server or one of its dependencies being overwhelmed, and some degree of exponential backoff should be used. The existing logic makes sense for a timeout, because some time has already been spent waiting.
A separate but related request would be for the RefreshError raised to have the retryable property set appropriately, so library users can decide what to do on transient failures.
The text was updated successfully, but these errors were encountered:
It's not unheard of for requests to the GCE metadata server to fail with a transient error, like a 503, 500, or 429. These requests can and should be retried, and they are in certain code paths. However, one very important code path, the
credentials.refresh(...)
method for credentials taken from the GCE metadata server, does not.Environment details
google-auth
version: 2.32.0Steps to reproduce
http://localhost:8080/
to inject transient 429 errors. Here is an example that will make every other request to a/token
endpoint fail with status code 429.getcreds.py
:while true ; do http_proxy=http://localhost:8080/ python3 getcreds.py; sleep 1; done
Note how only 1 attempt is logged for the
/token
endpoint.We do retry on certain types of errors:
google-auth-library-python/google/auth/compute_engine/_metadata.py
Lines 199 to 212 in d2ab3af
But in the scenario I'm complaining about, the HTTP request completes successfully, but the response code indicates a transient error, so we hit this code path:
google-auth-library-python/google/auth/compute_engine/_metadata.py
Lines 235 to 242 in d2ab3af
The following patch to
google/auth/compute/_metadata.py
"fixes" the reproduction:I put "fixes" in quotes because in a real failure, a transient error is likely caused by the GCE metadata server or one of its dependencies being overwhelmed, and some degree of exponential backoff should be used. The existing logic makes sense for a timeout, because some time has already been spent waiting.
A separate but related request would be for the
RefreshError
raised to have theretryable
property set appropriately, so library users can decide what to do on transient failures.The text was updated successfully, but these errors were encountered: