Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide more informative feedback when targets with format = "url" fail #303

Closed
11 tasks done
petrbouchal opened this issue Feb 7, 2021 · 2 comments
Closed
11 tasks done

Comments

@petrbouchal
Copy link

petrbouchal commented Feb 7, 2021

Prework

  • Read and agree to the code of conduct and contributing guidelines.
  • Confirm that your issue is most likely a genuine bug in the targets package itself and not a user error or known limitation. For usage issues and troubleshooting, please post to the discussions instead.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • Post a minimal reproducible example like this one so the maintainer can troubleshoot the problems you identify. A reproducible example is:
    • Runnable: post enough R code and data so any onlooker can create the error on their own computer.
    • Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
    • Readable: format your code according to the tidyverse style guide.

Description

Currently, when a URL-type target runs into any non-200 HTTP status code, it returns the same error ("could not access url"). This could be improved to provide more specific feedback, since it can get difficult to debug (see below). Also, the current wording could be read to suggest a 404 error, which is not always the case.

Specifically, I have come across a case where a server (a) does not permit the HEAD method and (b) unhelpfully returns 400 instead of the appropriate 405 status, so even just looking at the status code is not useful. This is quite hard to debug as using a plain curl command or a web browser returns the resource correctly with Etag and Last-modified headers.

Unfortunately this particular error relates to an instance of Socrata, a data publication solution which is used by quite a few open data publishers, so this issue may affect a broader range of users.

Reproducible example

  • Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
    • Runnable: post enough R code and data so any onlooker can create the error on their own computer.
    • Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
    • Readable: format your code according to the tidyverse style guide.

Run tar_make() using the following _targets.R:

library(targets)

tar_option_set(packages = c("curl"))

list(
  tar_target(euurl, "https://cohesiondata.ec.europa.eu/api/views/f5vn-zy5i/rows.csv?accessType=DOWNLOAD",
                   format = "url"),
  tar_target(eufile, curl_download(euurl, "x.csv"), format = "file")
)

Here is a quick investigation of what headers and content the server returns using different curl options:

library(curl)

eu_url <- "https://cohesiondata.ec.europa.eu/api/views/f5vn-zy5i/rows.csv?accessType=DOWNLOAD"

# same headers as in {targets}

handle_head <- curl::new_handle(nobody = TRUE)
eucurl_head <- curl::curl_fetch_memory(eu_url, handle = handle_head)
parse_headers(eucurl_head$headers)
#>  [1] "HTTP/1.1 400 Bad Request"                                      
#>  [2] "Server: nginx"                                                 
#>  [3] "Date: Sun, 07 Feb 2021 13:31:39 GMT"                           
#>  [4] "Connection: keep-alive"                                        
#>  [5] "Access-Control-Allow-Origin: *"                                
#>  [6] "X-Error-Code: invalid_request"                                 
#>  [7] "X-Error-Message: HEAD is not supported"                        
#>  [8] "Cache-Control: private, no-cache, must-revalidate"             
#>  [9] "Age: 0"                                                        
#> [10] "X-Socrata-Region: aws-eu-west-1-prod"                          
#> [11] "Strict-Transport-Security: max-age=31536000; includeSubDomains"
#> [12] "X-Socrata-RequestId: 3bfa74fbb6428f8041490b947faa3daa"

# try range

handle_range <- curl::new_handle(range = "0-500")
eucurl_range <- curl::curl_fetch_memory(eu_url, handle = handle_range)
parse_headers(eucurl_range$headers)
#>  [1] "HTTP/1.1 200 OK"                                                                                                 
#>  [2] "Server: nginx"                                                                                                   
#>  [3] "Date: Sun, 07 Feb 2021 13:31:39 GMT"                                                                             
#>  [4] "Content-Type: text/csv; charset=utf-8"                                                                           
#>  [5] "Transfer-Encoding: chunked"                                                                                      
#>  [6] "Connection: keep-alive"                                                                                          
#>  [7] "Access-Control-Allow-Origin: *"                                                                                  
#>  [8] "Content-disposition: attachment; filename=ESIF_2014-2020_CCI_Lookup_Table.csv"                                   
#>  [9] "Cache-Control: public, must-revalidate, max-age=21600"                                                           
#> [10] "ETag: \"YWxwaGEuODcyNl8yXzM0dXFuLV94THplQW5ibnpUbFZjb3RSdU9UQlE4---gzipaD5k_28eYLb6f_TDQ6CW1zFETdY--gzip--gzip\""
#> [11] "X-SODA2-Data-Out-Of-Date: false"                                                                                 
#> [12] "X-SODA2-Truth-Last-Modified: Mon, 15 Jun 2020 16:29:15 GMT"                                                      
#> [13] "X-SODA2-Secondary-Last-Modified: Mon, 15 Jun 2020 16:29:15 GMT"                                                  
#> [14] "Last-Modified: Mon, 15 Jun 2020 16:29:15 GMT"                                                                    
#> [15] "Vary: Accept-Encoding"                                                                                           
#> [16] "Content-Encoding: gzip"                                                                                          
#> [17] "Age: 0"                                                                                                          
#> [18] "X-Socrata-Region: aws-eu-west-1-prod"                                                                            
#> [19] "Strict-Transport-Security: max-age=31536000; includeSubDomains"                                                  
#> [20] "X-Socrata-RequestId: 2ef05052589a050482e28dd74ac665ae"

# setting range does not make a difference

object.size(eucurl_plain$content)
#> Error in structure(.Call(C_objectSize, x), class = "object_size"): object 'eucurl_plain' not found
object.size(eucurl_range$content)
#> 194248 bytes

# check that basic GET request works

eucurl_plain <- curl::curl_fetch_memory(eu_url)
parse_headers(eucurl_plain$headers)
#>  [1] "HTTP/1.1 200 OK"                                                                                                 
#>  [2] "Server: nginx"                                                                                                   
#>  [3] "Date: Sun, 07 Feb 2021 13:31:40 GMT"                                                                             
#>  [4] "Content-Type: text/csv; charset=utf-8"                                                                           
#>  [5] "Transfer-Encoding: chunked"                                                                                      
#>  [6] "Connection: keep-alive"                                                                                          
#>  [7] "Access-Control-Allow-Origin: *"                                                                                  
#>  [8] "Content-disposition: attachment; filename=ESIF_2014-2020_CCI_Lookup_Table.csv"                                   
#>  [9] "Cache-Control: public, must-revalidate, max-age=21600"                                                           
#> [10] "ETag: \"YWxwaGEuODcyNl8yXzM0dXFuLV94THplQW5ibnpUbFZjb3RSdU9UQlE4---gzipaD5k_28eYLb6f_TDQ6CW1zFETdY--gzip--gzip\""
#> [11] "X-SODA2-Data-Out-Of-Date: false"                                                                                 
#> [12] "X-SODA2-Truth-Last-Modified: Mon, 15 Jun 2020 16:29:15 GMT"                                                      
#> [13] "X-SODA2-Secondary-Last-Modified: Mon, 15 Jun 2020 16:29:15 GMT"                                                  
#> [14] "Last-Modified: Mon, 15 Jun 2020 16:29:15 GMT"                                                                    
#> [15] "Vary: Accept-Encoding"                                                                                           
#> [16] "Content-Encoding: gzip"                                                                                          
#> [17] "Age: 2"                                                                                                          
#> [18] "X-Socrata-Region: aws-eu-west-1-prod"                                                                            
#> [19] "Strict-Transport-Security: max-age=31536000; includeSubDomains"                                                  
#> [20] "X-Socrata-RequestId: 5880b7d90421809aa70dbca33080d81a"

Created on 2021-02-07 by the reprex package (v0.3.0)

Expected result

It would be useful to see more about what error occurred - status code and error text. Even the current text could be adapted so as not to suggest a 404 HTTP error.

I don't see a way to get around the fact that the server does not accept HEAD requests: the same server does not respect the range curl option telling the server to only return e.g. the first 500 bytes, which suggests using this option will not be reliable. Failing that, the only other option is to GET the whole resource from the URL, which would defeat most of the purpose of the URL-format target.

Diagnostic information

@wlandau wlandau closed this as completed in 0da7cd9 Feb 7, 2021
@wlandau
Copy link
Member

wlandau commented Feb 7, 2021

Thanks, good idea. Should be fixed now. The error message now returns the status code and the whole header.

library(targets)
tar_script(tar_target(abc, "https://httpbin.org/status/404", format = "url"))
tar_make()
#> ● run target abc
#> Error : HTTP response status code 404
#> Could not access url:
#>   https://httpbin.org/status/404
#> HTTP response headers:
#>   date = Sun, 07 Feb 2021 14:43:20 GMT
#>   content-type = text/html; charset=utf-8
#>   content-length = 0
#>   connection = keep-alive
#>   server = gunicorn/19.9.0
#>   access-control-allow-origin = *
#>   access-control-allow-credentials = true
#> Error: callr subprocess failed: HTTP response status code 404
#> Could not access url:
#>   https://httpbin.org/status/404
#> HTTP response headers:
#>   date = Sun, 07 Feb 2021 14:43:20 GMT
#>   content-type = text/html; charset=utf-8
#>   content-length = 0
#>   connection = keep-alive
#>   server = gunicorn/19.9.0
#>   access-control-allow-origin = *
#>   access-control-allow-credentials = true

Created on 2021-02-07 by the reprex package (v0.3.0)

@petrbouchal
Copy link
Author

Thanks for such a quick fix! Can confirm this now returns the headers upon HTTP 400 error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants