Provide more informative feedback when targets with `format = "url"` fail #303

petrbouchal · 2021-02-07T13:43:06Z

Prework

Read and agree to the code of conduct and contributing guidelines.
Confirm that your issue is most likely a genuine bug in the targets package itself and not a user error or known limitation. For usage issues and troubleshooting, please post to the discussions instead.
If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
Post a minimal reproducible example like this one so the maintainer can troubleshoot the problems you identify. A reproducible example is:
- Runnable: post enough R code and data so any onlooker can create the error on their own computer.
- Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
- Readable: format your code according to the tidyverse style guide.

Description

Currently, when a URL-type target runs into any non-200 HTTP status code, it returns the same error ("could not access url"). This could be improved to provide more specific feedback, since it can get difficult to debug (see below). Also, the current wording could be read to suggest a 404 error, which is not always the case.

Specifically, I have come across a case where a server (a) does not permit the HEAD method and (b) unhelpfully returns 400 instead of the appropriate 405 status, so even just looking at the status code is not useful. This is quite hard to debug as using a plain curl command or a web browser returns the resource correctly with Etag and Last-modified headers.

Unfortunately this particular error relates to an instance of Socrata, a data publication solution which is used by quite a few open data publishers, so this issue may affect a broader range of users.

Reproducible example

Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
- Runnable: post enough R code and data so any onlooker can create the error on their own computer.
- Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
- Readable: format your code according to the tidyverse style guide.

Run tar_make() using the following _targets.R:

library(targets)

tar_option_set(packages = c("curl"))

list(
  tar_target(euurl, "https://cohesiondata.ec.europa.eu/api/views/f5vn-zy5i/rows.csv?accessType=DOWNLOAD",
                   format = "url"),
  tar_target(eufile, curl_download(euurl, "x.csv"), format = "file")
)

Here is a quick investigation of what headers and content the server returns using different curl options:

library(curl)

eu_url <- "https://cohesiondata.ec.europa.eu/api/views/f5vn-zy5i/rows.csv?accessType=DOWNLOAD"

# same headers as in {targets}

handle_head <- curl::new_handle(nobody = TRUE)
eucurl_head <- curl::curl_fetch_memory(eu_url, handle = handle_head)
parse_headers(eucurl_head$headers)
#>  [1] "HTTP/1.1 400 Bad Request"                                      
#>  [2] "Server: nginx"                                                 
#>  [3] "Date: Sun, 07 Feb 2021 13:31:39 GMT"                           
#>  [4] "Connection: keep-alive"                                        
#>  [5] "Access-Control-Allow-Origin: *"                                
#>  [6] "X-Error-Code: invalid_request"                                 
#>  [7] "X-Error-Message: HEAD is not supported"                        
#>  [8] "Cache-Control: private, no-cache, must-revalidate"             
#>  [9] "Age: 0"                                                        
#> [10] "X-Socrata-Region: aws-eu-west-1-prod"                          
#> [11] "Strict-Transport-Security: max-age=31536000; includeSubDomains"
#> [12] "X-Socrata-RequestId: 3bfa74fbb6428f8041490b947faa3daa"

# try range

handle_range <- curl::new_handle(range = "0-500")
eucurl_range <- curl::curl_fetch_memory(eu_url, handle = handle_range)
parse_headers(eucurl_range$headers)
#>  [1] "HTTP/1.1 200 OK"                                                                                                 
#>  [2] "Server: nginx"                                                                                                   
#>  [3] "Date: Sun, 07 Feb 2021 13:31:39 GMT"                                                                             
#>  [4] "Content-Type: text/csv; charset=utf-8"                                                                           
#>  [5] "Transfer-Encoding: chunked"                                                                                      
#>  [6] "Connection: keep-alive"                                                                                          
#>  [7] "Access-Control-Allow-Origin: *"                                                                                  
#>  [8] "Content-disposition: attachment; filename=ESIF_2014-2020_CCI_Lookup_Table.csv"                                   
#>  [9] "Cache-Control: public, must-revalidate, max-age=21600"                                                           
#> [10] "ETag: \"YWxwaGEuODcyNl8yXzM0dXFuLV94THplQW5ibnpUbFZjb3RSdU9UQlE4---gzipaD5k_28eYLb6f_TDQ6CW1zFETdY--gzip--gzip\""
#> [11] "X-SODA2-Data-Out-Of-Date: false"                                                                                 
#> [12] "X-SODA2-Truth-Last-Modified: Mon, 15 Jun 2020 16:29:15 GMT"                                                      
#> [13] "X-SODA2-Secondary-Last-Modified: Mon, 15 Jun 2020 16:29:15 GMT"                                                  
#> [14] "Last-Modified: Mon, 15 Jun 2020 16:29:15 GMT"                                                                    
#> [15] "Vary: Accept-Encoding"                                                                                           
#> [16] "Content-Encoding: gzip"                                                                                          
#> [17] "Age: 0"                                                                                                          
#> [18] "X-Socrata-Region: aws-eu-west-1-prod"                                                                            
#> [19] "Strict-Transport-Security: max-age=31536000; includeSubDomains"                                                  
#> [20] "X-Socrata-RequestId: 2ef05052589a050482e28dd74ac665ae"

# setting range does not make a difference

object.size(eucurl_plain$content)
#> Error in structure(.Call(C_objectSize, x), class = "object_size"): object 'eucurl_plain' not found
object.size(eucurl_range$content)
#> 194248 bytes

# check that basic GET request works

eucurl_plain <- curl::curl_fetch_memory(eu_url)
parse_headers(eucurl_plain$headers)
#>  [1] "HTTP/1.1 200 OK"                                                                                                 
#>  [2] "Server: nginx"                                                                                                   
#>  [3] "Date: Sun, 07 Feb 2021 13:31:40 GMT"                                                                             
#>  [4] "Content-Type: text/csv; charset=utf-8"                                                                           
#>  [5] "Transfer-Encoding: chunked"                                                                                      
#>  [6] "Connection: keep-alive"                                                                                          
#>  [7] "Access-Control-Allow-Origin: *"                                                                                  
#>  [8] "Content-disposition: attachment; filename=ESIF_2014-2020_CCI_Lookup_Table.csv"                                   
#>  [9] "Cache-Control: public, must-revalidate, max-age=21600"                                                           
#> [10] "ETag: \"YWxwaGEuODcyNl8yXzM0dXFuLV94THplQW5ibnpUbFZjb3RSdU9UQlE4---gzipaD5k_28eYLb6f_TDQ6CW1zFETdY--gzip--gzip\""
#> [11] "X-SODA2-Data-Out-Of-Date: false"                                                                                 
#> [12] "X-SODA2-Truth-Last-Modified: Mon, 15 Jun 2020 16:29:15 GMT"                                                      
#> [13] "X-SODA2-Secondary-Last-Modified: Mon, 15 Jun 2020 16:29:15 GMT"                                                  
#> [14] "Last-Modified: Mon, 15 Jun 2020 16:29:15 GMT"                                                                    
#> [15] "Vary: Accept-Encoding"                                                                                           
#> [16] "Content-Encoding: gzip"                                                                                          
#> [17] "Age: 2"                                                                                                          
#> [18] "X-Socrata-Region: aws-eu-west-1-prod"                                                                            
#> [19] "Strict-Transport-Security: max-age=31536000; includeSubDomains"                                                  
#> [20] "X-Socrata-RequestId: 5880b7d90421809aa70dbca33080d81a"

^{Created on 2021-02-07 by the reprex package (v0.3.0)}

Expected result

It would be useful to see more about what error occurred - status code and error text. Even the current text could be adapted so as not to suggest a 404 HTTP error.

I don't see a way to get around the fact that the server does not accept HEAD requests: the same server does not respect the range curl option telling the server to only return e.g. the first 500 bytes, which suggests using this option will not be reliable. Failing that, the only other option is to GET the whole resource from the URL, which would defeat most of the purpose of the URL-format target.

Diagnostic information

A reproducible example.
using current CRAN targets

The text was updated successfully, but these errors were encountered:

wlandau · 2021-02-07T14:44:48Z

Thanks, good idea. Should be fixed now. The error message now returns the status code and the whole header.

library(targets)
tar_script(tar_target(abc, "https://httpbin.org/status/404", format = "url"))
tar_make()
#> ● run target abc
#> Error : HTTP response status code 404
#> Could not access url:
#>   https://httpbin.org/status/404
#> HTTP response headers:
#>   date = Sun, 07 Feb 2021 14:43:20 GMT
#>   content-type = text/html; charset=utf-8
#>   content-length = 0
#>   connection = keep-alive
#>   server = gunicorn/19.9.0
#>   access-control-allow-origin = *
#>   access-control-allow-credentials = true
#> Error: callr subprocess failed: HTTP response status code 404
#> Could not access url:
#>   https://httpbin.org/status/404
#> HTTP response headers:
#>   date = Sun, 07 Feb 2021 14:43:20 GMT
#>   content-type = text/html; charset=utf-8
#>   content-length = 0
#>   connection = keep-alive
#>   server = gunicorn/19.9.0
#>   access-control-allow-origin = *
#>   access-control-allow-credentials = true

^{Created on 2021-02-07 by the reprex package (v0.3.0)}

petrbouchal · 2021-02-07T21:13:45Z

Thanks for such a quick fix! Can confirm this now returns the headers upon HTTP 400 error.

wlandau closed this as completed in 0da7cd9 Feb 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide more informative feedback when targets with `format = "url"` fail #303

Provide more informative feedback when targets with `format = "url"` fail #303

petrbouchal commented Feb 7, 2021 •

edited

Loading

wlandau commented Feb 7, 2021

petrbouchal commented Feb 7, 2021

Provide more informative feedback when targets with format = "url" fail #303

Provide more informative feedback when targets with format = "url" fail #303

Comments

petrbouchal commented Feb 7, 2021 • edited Loading

Prework

Description

Reproducible example

Expected result

Diagnostic information

wlandau commented Feb 7, 2021

petrbouchal commented Feb 7, 2021

Provide more informative feedback when targets with `format = "url"` fail #303

Provide more informative feedback when targets with `format = "url"` fail #303

petrbouchal commented Feb 7, 2021 •

edited

Loading