-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make HTTP.download support redirects when the redirected URL does not set Content-Disposition #761
Conversation
Codecov Report
@@ Coverage Diff @@
## master #761 +/- ##
==========================================
- Coverage 77.59% 77.50% -0.10%
==========================================
Files 38 38
Lines 2557 2560 +3
==========================================
Hits 1984 1984
- Misses 573 576 +3
Continue to review full report at Codecov.
|
Perhaps the ideal solution would be for a variant of |
Would it be better to just move it if the initial guess was wrong? |
I think I misunderstood. Is the case that the correct filename is only in the original 302 headers but not the actual 200 response header? That seems a bit strange of the server, no? |
Correct.
I am honestly not sure. The 302 comes from a FigShare server, it has a pretty rich header, with a fair bit of metadata. 302 is nominally a temporary redirect. Regardless of reasonable or not, when i go to https://ndownloader.figshare.com/files/6294558 |
I believe you have access to the original headers before the early return btw, so don't think you need the HEAD request. |
Really? That would be great. HTTP.jl/src/RedirectRequest.jl Line 28 in 81ac504
Notes that we need the |
diff --git a/src/download.jl b/src/download.jl
index f75be99..1d7bdef 100644
--- a/src/download.jl
+++ b/src/download.jl
@@ -16,9 +16,8 @@ function safer_joinpath(basepart, parts...)
joinpath(basepart, parts...)
end
-function try_get_filename_from_headers(resp)
- content_disp = header(resp, "Content-Disposition")
- if content_disp != ""
+function try_get_filename_from_headers(hdrs)
+ for content_disp in hdrs
# extract out of Content-Disposition line
# rough version of what is needed in https://github.com/JuliaWeb/HTTP.jl/issues/179
filename_part = match(r"filename\s*=\s*(.*)", content_disp)
@@ -55,16 +54,16 @@ function try_get_filename_from_request(req)
end
-determine_file(::Nothing, resp) = determine_file(tempdir(), resp)
+determine_file(::Nothing, resp, hdrs) = determine_file(tempdir(), resp, hdrs)
# ^ We want to the filename if possible because extension is useful for FileIO.jl
-function determine_file(path, resp)
+function determine_file(path, resp, hdrs)
# get the name
name = if isdir(path)
# we have been given a path to a directory
# got to to workout what file to put there
filename = something(
- try_get_filename_from_headers(resp),
+ try_get_filename_from_headers(hdrs),
try_get_filename_from_request(resp.request),
basename(tempname()) # fallback, basically a random string
)
@@ -107,11 +106,15 @@ function download(url::AbstractString, local_path=nothing, headers=Header[]; upd
@debug 1 "downloading $url"
local file
+ hdrs = String[]
HTTP.open("GET", url, headers; kw...) do stream
resp = startread(stream)
+ # Store intermediate header from redirects
+ content_disp = header(resp, "Content-Disposition")
+ !isempty(content_disp) && push!(hdrs, content_disp)
eof(stream) && return # don't do anything for streams we can't read (yet)
- file = determine_file(local_path, resp)
+ file = determine_file(local_path, resp, hdrs)
total_bytes = parse(Float64, header(resp, "Content-Length", "NaN"))
downloaded_bytes = 0
start_time = now() |
Oh wow cool. |
in HTTP.download, fixes #760. Co-authored-by: Lyndon White <[email protected]> Co-authored-by: Fredrik Ekre <[email protected]>
c5086c2
to
f96574d
Compare
Thanks @fredrikekre I meant to push those changes into this branch, but hasn't got around to it yet. |
Closes #760
I do not love this solution as it used HEAD requrests.
And I am pretty sure those are not supported everywhere.
Even though the standard requires that they are supported everywhere
GET
is.But I don't have a counter-example handy.
Possibly the better way to do this is to first try and fully do the download with redirect off.
If that fails, remember the filename,
and then retry with redirect on.
Does anyone have thoughts (or a URL I can use to break this?)
This uses the URL from #760 as go-httpbin doesn't have a way to create a redirect chain that has the first with the content-disposition, and the last without.
postmanlabs/httpbin#652