Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HTTP.download support redirects when the redirected URL does not set Content-Disposition #761

Merged
merged 1 commit into from
Sep 26, 2021

Conversation

oxinabox
Copy link
Member

@oxinabox oxinabox commented Sep 18, 2021

Closes #760

I do not love this solution as it used HEAD requrests.
And I am pretty sure those are not supported everywhere.
Even though the standard requires that they are supported everywhere GET is.
But I don't have a counter-example handy.

Possibly the better way to do this is to first try and fully do the download with redirect off.
If that fails, remember the filename,
and then retry with redirect on.

Does anyone have thoughts (or a URL I can use to break this?)

This uses the URL from #760 as go-httpbin doesn't have a way to create a redirect chain that has the first with the content-disposition, and the last without.
postmanlabs/httpbin#652

@codecov-commenter
Copy link

codecov-commenter commented Sep 18, 2021

Codecov Report

Merging #761 (f96574d) into master (10f408c) will decrease coverage by 0.09%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #761      +/-   ##
==========================================
- Coverage   77.59%   77.50%   -0.10%     
==========================================
  Files          38       38              
  Lines        2557     2560       +3     
==========================================
  Hits         1984     1984              
- Misses        573      576       +3     
Impacted Files Coverage Δ
src/download.jl 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10f408c...f96574d. Read the comment docs.

@oxinabox
Copy link
Member Author

Perhaps the ideal solution would be for a variant of request to return a collection of all headers from redirects

@fredrikekre
Copy link
Member

Would it be better to just move it if the initial guess was wrong?

@fredrikekre
Copy link
Member

I think I misunderstood. Is the case that the correct filename is only in the original 302 headers but not the actual 200 response header? That seems a bit strange of the server, no?

@oxinabox
Copy link
Member Author

Is the case that the correct filename is only in the original 302 headers but not the actual 200 response header?

Correct.

That seems a bit strange of the server, no?

I am honestly not sure.
On the surface it sounds like one should expect the redirect to have this name.
But perhaps not.

The 302 comes from a FigShare server, it has a pretty rich header, with a fair bit of metadata.
It is probably running a bespoke webserver so they have a lot of control over what it tells us.
It redirected to web-facing AWS S3 Bucket, which are not very configurable at all.
Though it is possible to set it via the AWS commandline tool or API.

302 is nominally a temporary redirect.
So it could be that the AWS bucket is not the final storage location for this, and is a stop-gap til they set something up properly.

Regardless of reasonable or not, when i go to https://ndownloader.figshare.com/files/6294558
in FireFox, it does want to download with the filename rsta20150293_si_001.xlsx
and matching behavior of web-browser is basically my goal for HTTP.download

@fredrikekre
Copy link
Member

I believe you have access to the original headers before the early return btw, so don't think you need the HEAD request.

@oxinabox
Copy link
Member Author

oxinabox commented Sep 21, 2021

I believe you have access to the original headers before the early return btw, so don't think you need the HEAD request.

Really? That would be great.
I couldn't find a way to get them
It looked to me like the redirect layer eats them up.

res = request(Next, method, url, headers, body; reached_redirect_limit=(count == redirect_limit), kw...)

Notes that we need the headers(res) which are the response headers,
not the headers which are the request headers (which are forwarded onwards by default)

@fredrikekre
Copy link
Member

fredrikekre commented Sep 22, 2021

resp just before eofcontains the intermediate headers:

diff --git a/src/download.jl b/src/download.jl
index f75be99..1d7bdef 100644
--- a/src/download.jl
+++ b/src/download.jl
@@ -16,9 +16,8 @@ function safer_joinpath(basepart, parts...)
     joinpath(basepart, parts...)
 end
 
-function try_get_filename_from_headers(resp)
-    content_disp = header(resp, "Content-Disposition")
-    if content_disp != ""
+function try_get_filename_from_headers(hdrs)
+    for content_disp in hdrs
         # extract out of Content-Disposition line
         # rough version of what is needed in https://github.com/JuliaWeb/HTTP.jl/issues/179
         filename_part = match(r"filename\s*=\s*(.*)", content_disp)
@@ -55,16 +54,16 @@ function try_get_filename_from_request(req)
 end
 
 
-determine_file(::Nothing, resp) = determine_file(tempdir(), resp)
+determine_file(::Nothing, resp, hdrs) = determine_file(tempdir(), resp, hdrs)
 # ^ We want to the filename if possible because extension is useful for FileIO.jl
 
-function determine_file(path, resp)
+function determine_file(path, resp, hdrs)
     # get the name
     name = if isdir(path)
         # we have been given a path to a directory
         # got to to workout what file to put there
         filename = something(
-                        try_get_filename_from_headers(resp),
+                        try_get_filename_from_headers(hdrs),
                         try_get_filename_from_request(resp.request),
                         basename(tempname())  # fallback, basically a random string
                     )
@@ -107,11 +106,15 @@ function download(url::AbstractString, local_path=nothing, headers=Header[]; upd
 
     @debug 1 "downloading $url"
     local file
+    hdrs = String[]
     HTTP.open("GET", url, headers; kw...) do stream
         resp = startread(stream)
+        # Store intermediate header from redirects
+        content_disp = header(resp, "Content-Disposition")
+        !isempty(content_disp) && push!(hdrs, content_disp)
         eof(stream) && return  # don't do anything for streams we can't read (yet)
 
-        file = determine_file(local_path, resp)
+        file = determine_file(local_path, resp, hdrs)
         total_bytes = parse(Float64, header(resp, "Content-Length", "NaN"))
         downloaded_bytes = 0
         start_time = now()

@oxinabox
Copy link
Member Author

Oh wow cool.
I clearly have no idea how HTTP.open(f, "GET", url) works.

in HTTP.download, fixes #760.

Co-authored-by: Lyndon White <[email protected]>
Co-authored-by: Fredrik Ekre <[email protected]>
@fredrikekre fredrikekre merged commit f5957aa into master Sep 26, 2021
@fredrikekre fredrikekre deleted the ox/figshareagain branch September 26, 2021 11:33
@oxinabox
Copy link
Member Author

Thanks @fredrikekre I meant to push those changes into this branch, but hasn't got around to it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong file name for download from a FigShare URL
3 participants