This is the Cloudflare Workers proxy component of Gargantuan Takeout Rocket (GTR), a toolkit to quickly backup Google Takeout archives to Azure Storage at extremely high speeds and low cost.
This proxy is required as:
- Microsoft's Azure Storage is unable to download from download URLs used in Google Takeout directly due to an URL Escaping issue in Google's URLs that Azure "helpfully" breaks. 3xx redirects are not accepted either.
- To transfer fast, we tell Cloudflare Workers to fetch from Google with 1000MB chunks simultaneously at nearly 50 connections at a time for 50GB files from the extension and put the data onto Azure as chunks. Unfortunately, talking to Azure's endpoints only support 6 connections and thus only 6 requests at a time from a web browser due to Azure Storage's endpoints only supporting HTTP/1.1.
Cloudflare Workers can be used to address these issues:
- By offloading downloading of the offending URLs to Cloudflare, encoding the Takeout URL's escaped characters specially to be decoded via the real URLs in Cloudflare, Azure's mangling of Google's URLs for its "server-to-server" download capabilities is circumvented. Cloudflare charges nothing for ingress and egress as well, there is little to no worker CPU usage, and the bandwidth to do this proxying is pretty much free.
- Cloudflare Workers are accessed over HTTP/3 or HTTP/2 which web browsers multiplex requests over a single connection and aren't bound by the 6 connections limit in the browser. This can be used to convert Azure's HTTP 1.1 endpoint to HTTP/3 or HTTP/2 and the GTR extension in the browser can command more chunks to be downloaded by Azure simultaneously through the proxy. Speeds of up to around 8.7GB/s can be achieved with this proxy from the browser versus 180MB/s with a direct connection to Azure's endpoint. For reliability reasons, this is limited to 1.0GB/s, but that's still fairly high speed.
A public instance of this service is provided, but you may want to run your own private instance of this proxy for privacy reasons. If so, here is the source.
In general, you are expected to use the Gargantuan Takeout Rocket (GTR) extension with this.
A public instance is hosted at https://gtr-proxy.677472.xyz that anybody may use with GTR. The front page of https://gtr-proxy.677472.xyz just goes to the GitHub repository for the proxy. The 677472.xyz (67=g
, 74=t
, and 72=r
from ASCII) domain was chosen because it was $0.75 every year for numeric only .xyz
domains and I wanted the bandwidth metrics for my personal site separated from this service. Visiting the domain will redirect to this GitHub repository.
You are welcome to use the public instance for any load. You should mind the privacy policy though.
Logs are not stored on this service but I reserve the right to stream the logs temporarily to observe and curb abuse if necessary.
You may be interested in running your own private instance so your data does not go through my public proxy.
Please try a Google Takeout with a small, non-sensitive, or already public data on your Google account to produce a non-sensitive Google Takeout test archive to test the public instance of the proxy to get familiar with the GTR toolkit first before setting up a private instance of this proxy for your actual sensitive and non-public takeout data.
Use this easy-to-use button:
Out of the box, you should be able to use your workers.dev
domain.
Updates to this proxy may or may not be required in the future. If so, simply delete the old repository and old worker and redeploy.
The proxy should be usable within the free tier limits of Cloudflare Workers at a personal scale.
A real Google Takeout URL would look like this:
- Get your original SAS URL from Azure and append a blob name to it in the path. For our example, we'll use this: https://urlcopytest.blob.core.windows.net/some-container/data.dat?sp=r&st=2022-04-02T18:23:20Z&se=2022-04-03T06:24:20Z&spr=https&sv=2020-08-04&sr=c&sig=KNz4a1xHnmfi7afzrnkBFtls52YIZ0xtzn1Y7udqXBw%3D
- The account name is
urlcopytest
. Construct a new proxyfied URL as such: https://gtr-proxy.677472.xyz/p-azb/urlcopytest/some-container/data.dat?sp=r&st=2022-04-02T18:23:20Z&se=2022-04-03T06:24:20Z&spr=https&sv=2020-08-04&sr=c&sig=KNz4a1xHnmfi7afzrnkBFtls52YIZ0xtzn1Y7udqXBw%3D - Construct a proxified Google Takeout URL.
- Replace all "%2F" with "%252F".
- Remove the scheme and prepend the proxy URL of
https://gtr-proxy.677472.xyz/p/
.
- Perform any
PUT
operations with ax-ms-copy-source
header with the proxified Google Takeout URL as the value as you wish through that URL as it will survive traversing Azure and hit the proxy where the URL will be converted back to the original takeout URL.- You can observe that the endpoint of the proxy is HTTP/3 after the first initial connection in the Network tab. This has a lot higher limits for simultaneous connections than HTTP/1.1.
The example URL has expired, but you can use the above steps to construct your own.
You can try an alternative URL that is not expired:
https://gtr-test.677472.xyz/200MB.zip
For anti-abuse reasons, the service is limited to test servers and Google Takeout download URLs for the aformentioned pathing issue and the Google Takeout URLs as unrestricted open proxies on the internet may be abused.
- One of the following must be true:
- The source URL is a test URL from
*-3vngqvvpoq-uc.a.run.app
which can respond with paths that can cause issues for Azure direct downloads. The source for this can be found at: https://github.com/nelsonjchen/put-block-from-url-esc-issue-demo-server/blob/master/main.go - The source URL is a test URL from a test download location from
gtr-test.677472.xyz
. - The URL must be a valid Google Takeout download URL. Regions may have different data policies. Please create an issue if your region is unsupported.
- The source URL is a test URL from
This tool is implemented to run on Cloudflare Workers as:
- Cloudflare does not charge for incoming or outgoing data. No egress or ingress charges.
- Cloudflare does not charge for CPU/Memory used while the request has finished processing, the response headers are sent, and the worker is just shoveling bytes between two sockets. Other providers may charge for allocated CPU usage while all that's being done is shoving bytes. Most connections in GTR tend to last about 50 seconds. You are "charged" 1 ms per connection but other providers may charge 50 seconds.
- Cloudflare has the peering, compute, and scalability to handle the massive transfer from Google Takeout to Azure Storage. Many of its peering points are peered with Azure and Google with high capacity links.
- Cloudflare Workers are serverless.
- Cloudflare's free tier is generous.
- The worker can be deployed with a button.
- Cloudflare allows fetching and streaming of data from other URLs programmatically.
- Cloudflare Worker endpoints are HTTP/3 compatible and workers can comfortably connect to HTTP 1.1 endpoints.
- Cloudflare Workers are globally deployed. If you transfer from Google in the EU to Azure in the EU, the worker proxy is also in the EU and your data stays in the EU for the whole time. Same for Australia, US, and so on. Other providers force users to choose and they better choose correctly or otherwise they get a large bandwidth bill or users are unknowingly transferring data across undesired borders.
I am not aware of any other provider with the same characteristics as Cloudflare.
graph LR
A[Google Takeout]--4. Download Data from Google .-> B[Cloudflare Worker]
B --2. Command to Download from CF Worker.-> C[Azure Storage]
B --3. Download from CF Worker.-> C[Azure Storage]
Browser -- 1. Control CF Worker / Azure Storage Signed SAS.-> B