Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Over eager redirect caching in S3RegionRedirector? #1708

Closed
gardenia opened this issue Apr 4, 2019 · 3 comments
Closed

Over eager redirect caching in S3RegionRedirector? #1708

gardenia opened this issue Apr 4, 2019 · 3 comments

Comments

@gardenia
Copy link

gardenia commented Apr 4, 2019

Hi,

I have written a proxy between boto and S3. One of the features of the proxy is to send redirects (status 307) back to the client so they can go direct and offload large payloads from the proxy. examples of which are,. GETs on large files or PUTs of multipart upload parts.

NOTE: for reasons I won't go into: I am using the --endpoint feature to direct traffic to the proxy (rather than HTTP_PROXY).

The specific problem I'm having is that in the multipart case I have implemented the proxy to only send redirects for the PUTs of the individual parts but NOT the POST for the completion step (which I want the proxy to see for some stats reporting).

The problem I'm having with boto is due to the caching in S3RegionRedirector (botocore/utils.py) when the POST request comes for the multipart completion step boto doesn't send it to the proxy at all but uses cached knowledge to say that the associated bucket has a different endpoint and therefore bypasses the proxy endpoint.

Here is the relevant section of debug trace:

2019-04-04 14:32:02,750 - ThreadPoolExecutor-0_5 - s3transfer.tasks - DEBUG - Executing task CompleteMultipartUploadTask(transfer_id=0, {'bucket': 'redacted-01', 'key': '20MB.zip', 'extra_args': {}}) with kwargs {'client': <botocore.client.S3 object at 0x7f96fde987f0>, 'bucket': 'redacted-01', 'key': '20MB.zip', 'extra_args': {}, 'upload_id': 'BBBNeXssQ8Ic1A4MJgHTWJ6CDmNH2ccBL.LuDDcxvRK_pjV.SYfg_plrN7JAQaneO1xal_pPgQMBHDyueq8JUrGnOWUsqHc89_PU5f1LJ7nQYeEaIKCmqBd.tDJ4dB8d', 'parts': [{'ETag': '"1e75c940623dff7d67d1c37cf82921dd"', 'PartNumber': 1}, {'ETag': '"02f26c2176578d80975de57d73e95f5a"', 'PartNumber': 2}, {'ETag': '"1e75c940623dff7d67d1c37cf82921dd"', 'PartNumber': 3}, {'ETag': '"02f26c2176578d80975de57d73e95f5a"', 'PartNumber': 4}]}
2019-04-04 14:32:02,751 - ThreadPoolExecutor-0_5 - botocore.hooks - DEBUG - Event before-parameter-build.s3.CompleteMultipartUpload: calling handler <function validate_bucket_name at 0x7f96ff20f730>
2019-04-04 14:32:02,751 - ThreadPoolExecutor-0_5 - botocore.hooks - DEBUG - Event before-parameter-build.s3.CompleteMultipartUpload: calling handler <bound method S3RegionRedirector.redirect_from_cache of <botocore.utils.S3RegionRedirector object at 0x7f96fde98c88>>
2019-04-04 14:32:02,751 - ThreadPoolExecutor-0_5 - botocore.hooks - DEBUG - Event before-parameter-build.s3.CompleteMultipartUpload: calling handler <function generate_idempotent_uuid at 0x7f96ff20f378>
2019-04-04 14:32:02,753 - ThreadPoolExecutor-0_5 - botocore.hooks - DEBUG - Event before-call.s3.CompleteMultipartUpload: calling handler <function add_expect_header at 0x7f96ff20fbf8>
2019-04-04 14:32:02,753 - ThreadPoolExecutor-0_5 - botocore.hooks - DEBUG - Event before-call.s3.CompleteMultipartUpload: calling handler <bound method S3RegionRedirector.set_request_url of <botocore.utils.S3RegionRedirector object at 0x7f96fde98c88>>
2019-04-04 14:32:02,754 - ThreadPoolExecutor-0_5 - botocore.utils - DEBUG - Updating URI from http://localhost:8081/redacted-01/20MB.zip?uploadId=BBBNeXssQ8Ic1A4MJgHTWJ6CDmNH2ccBL.LuDDcxvRK_pjV.SYfg_plrN7JAQaneO1xal_pPgQMBHDyueq8JUrGnOWUsqHc89_PU5f1LJ7nQYeEaIKCmqBd.tDJ4dB8d to http://s3.eu-west-1.amazonaws.com/redacted-01/20MB.zip?uploadId=BBBNeXssQ8Ic1A4MJgHTWJ6CDmNH2ccBL.LuDDcxvRK_pjV.SYfg_plrN7JAQaneO1xal_pPgQMBHDyueq8JUrGnOWUsqHc89_PU5f1LJ7nQYeEaIKCmqBd.tDJ4dB8d

in S3RegionRedirector.redirect_from_cache it finds a cached endpoint remapping for the bucket in question. This cached remapping was created as a result of S3RegionRedirector.redirect_from_error processing the 307 redirect from the proxy for one of the earlier PUT responses .

My question is: is this a bug? I assume the point of the redirect caching is the case where people use the wrong region in their bucket request and S3 sends a 301 redirect to the correct region then this is work caching. However, in the specific case I have outlined it seems potentially not correct for it to use knowledge cached from a 307 redirect from a multipart PUT and apply that cached result to a later POST on the same bucket.

Shouldn't the bucket caching be 301 based only? Do server implementations even send 302 or 307 responses?

I have implemented a local workaround which solves the problem for me:

diff -Bru venv.orig/lib/python3.6/site-packages/botocore/utils.py venv/lib/python3.6/site-packages/botocore/utils.py
--- venv.orig/lib/python3.6/site-packages/botocore/utils.py	2019-04-04 14:56:41.033490779 +0100
+++ venv/lib/python3.6/site-packages/botocore/utils.py	2019-04-04 14:58:19.061330741 +0100
@@ -1167,7 +1167,8 @@
         }
         request_dict['context']['signing'] = signing_context
 
-        self._cache[bucket] = signing_context
+        if response[0].status_code not in [302, 307]:
+            self._cache[bucket] = signing_context
         self.set_request_url(request_dict, request_dict['context'])
 
         request_dict['context']['s3_redirected'] = True

Any feedback appreciated. Thanks.

@kyleknap
Copy link
Contributor

kyleknap commented Apr 5, 2019

I'm a little hesitant in changing the logic in the redirector unless there is an issue with the current implementation when communicating directly to S3. Region redirection logic is fairly complicated and could break a good amount of people if we miss an edge case in changing it. What's your thoughts instead of introducing a configuration option to disable region redirection altogether?

@gardenia
Copy link
Author

gardenia commented Apr 7, 2019

The difficulty with a clientside switch is it requires users to change their workflow. Also, maybe the caching is useful sometimes so it seems like it could be more granular somehow.

What about if there was a way to indicate on the redirected response that it should not be cached. e.g. by disabling region redirection if the redirect has a "cache-control: no-cache" on it or some such?

Thanks.

@github-actions
Copy link

Greetings! It looks like this issue hasn’t been active in longer than one year. We encourage you to check if this is still an issue in the latest release. Because it has been longer than one year since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment to prevent automatic closure, or if the issue is already closed, please feel free to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants