Make HTTP.download support redirects when the redirected URL does not set Content-Disposition #761

oxinabox · 2021-09-18T12:04:08Z

Closes #760

I do not love this solution as it used HEAD requrests.
And I am pretty sure those are not supported everywhere.
Even though the standard requires that they are supported everywhere GET is.
But I don't have a counter-example handy.

Possibly the better way to do this is to first try and fully do the download with redirect off.
If that fails, remember the filename,
and then retry with redirect on.

Does anyone have thoughts (or a URL I can use to break this?)

This uses the URL from #760 as go-httpbin doesn't have a way to create a redirect chain that has the first with the content-disposition, and the last without.
postmanlabs/httpbin#652

codecov-commenter · 2021-09-18T12:08:40Z

Codecov Report

Merging #761 (f96574d) into master (10f408c) will decrease coverage by 0.09%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master     #761      +/-   ##
==========================================
- Coverage   77.59%   77.50%   -0.10%     
==========================================
  Files          38       38              
  Lines        2557     2560       +3     
==========================================
  Hits         1984     1984              
- Misses        573      576       +3

Impacted Files	Coverage Δ
src/download.jl	`0.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10f408c...f96574d. Read the comment docs.

oxinabox · 2021-09-19T12:09:30Z

Perhaps the ideal solution would be for a variant of request to return a collection of all headers from redirects

fredrikekre · 2021-09-21T07:56:16Z

Would it be better to just move it if the initial guess was wrong?

fredrikekre · 2021-09-21T07:57:29Z

I think I misunderstood. Is the case that the correct filename is only in the original 302 headers but not the actual 200 response header? That seems a bit strange of the server, no?

oxinabox · 2021-09-21T13:59:25Z

Is the case that the correct filename is only in the original 302 headers but not the actual 200 response header?

Correct.

That seems a bit strange of the server, no?

I am honestly not sure.
On the surface it sounds like one should expect the redirect to have this name.
But perhaps not.

The 302 comes from a FigShare server, it has a pretty rich header, with a fair bit of metadata.
It is probably running a bespoke webserver so they have a lot of control over what it tells us.
It redirected to web-facing AWS S3 Bucket, which are not very configurable at all.
Though it is possible to set it via the AWS commandline tool or API.

302 is nominally a temporary redirect.
So it could be that the AWS bucket is not the final storage location for this, and is a stop-gap til they set something up properly.

Regardless of reasonable or not, when i go to https://ndownloader.figshare.com/files/6294558
in FireFox, it does want to download with the filename rsta20150293_si_001.xlsx
and matching behavior of web-browser is basically my goal for HTTP.download

fredrikekre · 2021-09-21T14:04:37Z

I believe you have access to the original headers before the early return btw, so don't think you need the HEAD request.

oxinabox · 2021-09-21T14:13:40Z

I believe you have access to the original headers before the early return btw, so don't think you need the HEAD request.

Really? That would be great.
I couldn't find a way to get them
It looked to me like the redirect layer eats them up.

HTTP.jl/src/RedirectRequest.jl

Line 28 in 81ac504

    
           res = request(Next, method, url, headers, body; reached_redirect_limit=(count == redirect_limit), kw...)

Notes that we need the headers(res) which are the response headers,
not the headers which are the request headers (which are forwarded onwards by default)

fredrikekre · 2021-09-22T14:44:35Z

resp just before eofcontains the intermediate headers:

diff --git a/src/download.jl b/src/download.jl
index f75be99..1d7bdef 100644
--- a/src/download.jl
+++ b/src/download.jl
@@ -16,9 +16,8 @@ function safer_joinpath(basepart, parts...)
     joinpath(basepart, parts...)
 end
 
-function try_get_filename_from_headers(resp)
-    content_disp = header(resp, "Content-Disposition")
-    if content_disp != ""
+function try_get_filename_from_headers(hdrs)
+    for content_disp in hdrs
         # extract out of Content-Disposition line
         # rough version of what is needed in https://github.com/JuliaWeb/HTTP.jl/issues/179
         filename_part = match(r"filename\s*=\s*(.*)", content_disp)
@@ -55,16 +54,16 @@ function try_get_filename_from_request(req)
 end
 
 
-determine_file(::Nothing, resp) = determine_file(tempdir(), resp)
+determine_file(::Nothing, resp, hdrs) = determine_file(tempdir(), resp, hdrs)
 # ^ We want to the filename if possible because extension is useful for FileIO.jl
 
-function determine_file(path, resp)
+function determine_file(path, resp, hdrs)
     # get the name
     name = if isdir(path)
         # we have been given a path to a directory
         # got to to workout what file to put there
         filename = something(
-                        try_get_filename_from_headers(resp),
+                        try_get_filename_from_headers(hdrs),
                         try_get_filename_from_request(resp.request),
                         basename(tempname())  # fallback, basically a random string
                     )
@@ -107,11 +106,15 @@ function download(url::AbstractString, local_path=nothing, headers=Header[]; upd
 
     @debug 1 "downloading $url"
     local file
+    hdrs = String[]
     HTTP.open("GET", url, headers; kw...) do stream
         resp = startread(stream)
+        # Store intermediate header from redirects
+        content_disp = header(resp, "Content-Disposition")
+        !isempty(content_disp) && push!(hdrs, content_disp)
         eof(stream) && return  # don't do anything for streams we can't read (yet)
 
-        file = determine_file(local_path, resp)
+        file = determine_file(local_path, resp, hdrs)
         total_bytes = parse(Float64, header(resp, "Content-Length", "NaN"))
         downloaded_bytes = 0
         start_time = now()

oxinabox · 2021-09-22T17:28:48Z

Oh wow cool.
I clearly have no idea how HTTP.open(f, "GET", url) works.

in HTTP.download, fixes #760. Co-authored-by: Lyndon White <[email protected]> Co-authored-by: Fredrik Ekre <[email protected]>

oxinabox · 2021-09-26T13:07:00Z

Thanks @fredrikekre I meant to push those changes into this branch, but hasn't got around to it yet.

Use Content-Disposition for 3xx requests for filename detection

f96574d

in HTTP.download, fixes #760. Co-authored-by: Lyndon White <[email protected]> Co-authored-by: Fredrik Ekre <[email protected]>

fredrikekre force-pushed the ox/figshareagain branch from c5086c2 to f96574d Compare September 26, 2021 11:18

fredrikekre merged commit f5957aa into master Sep 26, 2021

fredrikekre deleted the ox/figshareagain branch September 26, 2021 11:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make HTTP.download support redirects when the redirected URL does not set Content-Disposition #761

Make HTTP.download support redirects when the redirected URL does not set Content-Disposition #761

oxinabox commented Sep 18, 2021 •

edited

Loading

codecov-commenter commented Sep 18, 2021 •

edited

Loading

oxinabox commented Sep 19, 2021

fredrikekre commented Sep 21, 2021

fredrikekre commented Sep 21, 2021

oxinabox commented Sep 21, 2021

fredrikekre commented Sep 21, 2021

oxinabox commented Sep 21, 2021 •

edited

Loading

fredrikekre commented Sep 22, 2021 •

edited

Loading

oxinabox commented Sep 22, 2021

oxinabox commented Sep 26, 2021

Make HTTP.download support redirects when the redirected URL does not set Content-Disposition #761

Make HTTP.download support redirects when the redirected URL does not set Content-Disposition #761

Conversation

oxinabox commented Sep 18, 2021 • edited Loading

codecov-commenter commented Sep 18, 2021 • edited Loading

Codecov Report

oxinabox commented Sep 19, 2021

fredrikekre commented Sep 21, 2021

fredrikekre commented Sep 21, 2021

oxinabox commented Sep 21, 2021

fredrikekre commented Sep 21, 2021

oxinabox commented Sep 21, 2021 • edited Loading

fredrikekre commented Sep 22, 2021 • edited Loading

oxinabox commented Sep 22, 2021

oxinabox commented Sep 26, 2021

oxinabox commented Sep 18, 2021 •

edited

Loading

codecov-commenter commented Sep 18, 2021 •

edited

Loading

oxinabox commented Sep 21, 2021 •

edited

Loading

fredrikekre commented Sep 22, 2021 •

edited

Loading