Encoding problems in pandoc_citeproc_convert() with Windows #2195

mitchelloharawild · 2021-07-27T02:01:04Z

A couple of issues have been raised in {vitae} about encoding issues for bibliographies on Windows (mitchelloharawild/vitae#167, mitchelloharawild/vitae#158). So far I've narrowed it down to rmarkdown::pandoc_citeproc_convert(), and I'm raising an issue here in the hopes that you have more experience in using pandoc with Windows encoding related issues.

MRE:

bib <- c(
  "@article{conc2021,",
  "  title={História da Habitação},",
  "  author={Conceição, Sérgio},", 
  "  journal={Portuguese History},",
  "  number={1},", 
  "  year={2021}", 
  "}"
)

writeLines(enc2utf8(bib), "test.bib", useBytes = TRUE)
rmarkdown::pandoc_citeproc_convert("test.bib")
#> [[1]]
#> [[1]]$author
#> [[1]]$author[[1]]
#> [[1]]$author[[1]]$family
#> [1] "ConceiÃ§Ã£o"
#> 
#> [[1]]$author[[1]]$given
#> [1] "SÃ©rgio"
#> 
#> 
#> 
#> [[1]]$`container-title`
#> [1] "Portuguese History"
#> 
#> [[1]]$id
#> [1] "conc2021"
#> 
#> [[1]]$issue
#> [1] "1"
#> 
#> [[1]]$issued
#> [[1]]$issued$`date-parts`
#> [[1]]$issued$`date-parts`[[1]]
#> [[1]]$issued$`date-parts`[[1]][[1]]
#> [1] 2021
#> 
#> 
#> 
#> 
#> [[1]]$title
#> [1] "HistÃ³ria da habitaÃ§Ã£o"
#> 
#> [[1]]$type
#> [1] "article-journal"

^{Created on 2021-07-26 by the reprex package (v2.0.0)}

Session info

sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.2 (2020-06-22)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       America/Los_Angeles         
#>  date     2021-07-26                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date       lib source                            
#>  cli           3.0.1   2021-07-17 [1] CRAN (R 4.0.5)                    
#>  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.3)                    
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.2)                    
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)                    
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)                    
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.0.5)                    
#>  htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 4.0.5)                    
#>  jsonlite      1.7.2   2020-12-09 [1] CRAN (R 4.0.4)                    
#>  knitr         1.33    2021-04-24 [1] CRAN (R 4.0.5)                    
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.4)                    
#>  reprex        2.0.0   2021-04-02 [1] CRAN (R 4.0.5)                    
#>  rlang         0.4.11  2021-04-30 [1] CRAN (R 4.0.5)                    
#>  rmarkdown     2.9.5   2021-07-27 [1] Github (rstudio/rmarkdown@bc936f7)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.0.5)                    
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)                    
#>  stringi       1.7.3   2021-07-16 [1] CRAN (R 4.0.2)                    
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.2)                    
#>  withr         2.4.2   2021-04-18 [1] CRAN (R 4.0.5)                    
#>  xfun          0.24    2021-06-15 [1] CRAN (R 4.0.5)                    
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.2)                    
#> 
#> [1] C:/Users/Admin/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.2/library

Checklist

When filing a bug report, please check the boxes below to confirm that you have provided us with the information we need. Have you:

formatted your issue so it is easier for us to read?
included a minimal, self-contained, and reproducible example?
pasted the output from xfun::session_info('rmarkdown') in your issue?
upgraded all your packages to their latest versions (including your versions of R, the RStudio IDE, and relevant R packages)?
installed and tested your bug with the development version of the rmarkdown package using remotes::install_github("rstudio/rmarkdown")?

xfun::session_info('rmarkdown')
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363), RStudio 1.4.1103

Locale:
  LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
  LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
  LC_TIME=English_United States.1252    

Package version:
  base64enc_0.1.3   digest_0.6.27     evaluate_0.14     glue_1.4.2        graphics_4.0.2   
  grDevices_4.0.2   highr_0.9         htmltools_0.5.1.1 jsonlite_1.7.2    knitr_1.33       
  magrittr_2.0.1    markdown_1.1      methods_4.0.2     mime_0.11         rlang_0.4.11     
  rmarkdown_2.9     stats_4.0.2       stringi_1.7.3     stringr_1.4.0     tinytex_0.32     
  tools_4.0.2       utils_4.0.2       xfun_0.24         yaml_2.2.1       

Pandoc version: 2.11.2

The text was updated successfully, but these errors were encountered:

cderv · 2021-07-28T09:12:12Z

Hi @mitchelloharawild !

Thanks for opening this issue. Here are some notes on my investigation.

Does it comes from Pandoc ?

First thing I did to look into this is to see if this comes from Pandoc.

Writing test.bib from R

bib <- c(
  "@article{conc2021,",
  "  title={História da Habitação},",
  "  author={Conceição, Sérgio},", 
  "  journal={Portuguese History},",
  "  number={1},", 
  "  year={2021}", 
  "}"
)
xfun::write_utf8(bib, "test.bib")

Using pandoc from command line in terminal directly.

pandoc -t markdown -s -o test.md .\test.bib
pandoc -t csljson -s -o test.json .\test.bib

Trying to read the file from R as UTF-8

xfun::read_utf8("test.md")
#>  [1] "---"                                   "nocite: \"[@*]\""                     
#>  [3] "references:"                           "- author:"                            
#>  [5] "  - family: Conceição"                 "    given: Sérgio"                    
#>  [7] "  container-title: Portuguese History" "  id: conc2021"                       
#>  [9] "  issue: 1"                            "  issued: 2021"                       
#> [11] "  title: História da habitação"        "  type: article-journal"              
#> [13] "---"                                   ""
xfun::read_utf8("test.json")
#>  [1] "["                                               
#>  [2] "  {"                                             
#>  [3] "    \"author\": ["                               
#>  [4] "      {"                                         
#>  [5] "        \"family\": \"Conceição\","              
#>  [6] "        \"given\": \"Sérgio\""                   
#>  [7] "      }"                                         
#>  [8] "    ],"                                          
#>  [9] "    \"container-title\": \"Portuguese History\","
#> [10] "    \"id\": \"conc2021\","                       
#> [11] "    \"issue\": \"1\","                           
#> [12] "    \"issued\": {"                               
#> [13] "      \"date-parts\": ["                         
#> [14] "        ["                                       
#> [15] "          2021"                                  
#> [16] "        ]"                                       
#> [17] "      ]"                                         
#> [18] "    },"                                          
#> [19] "    \"title\": \"História da habitação\","       
#> [20] "    \"type\": \"article-journal\""               
#> [21] "  }"                                             
#> [22] "]"

This works ok.

What happens with R then ?

In the R function, we don't write to file. We capture the output from a call to system()

rmarkdown/R/pandoc.R

Lines 150 to 153 in 0af6b35

    
           # run the conversion 
        
           with_pandoc_safe_environment({ 
        
             result <- system(command, intern = TRUE) 
        
           })

I think the cause is here because on Windows, UTF-8 is not the default encoding and I think this print incorrectly because the capture string is not marked as UTF8.

rmarkdown:::with_pandoc_safe_environment({
  result <- system("pandoc -t csljson -s test.bib", intern = TRUE)
})
# incorrect result
result
#>  [1] "["                                               
#>  [2] "  {"                                             
#>  [3] "    \"author\": ["                               
#>  [4] "      {"                                         
#>  [5] "        \"family\": \"ConceiÃ§Ã£o\","            
#>  [6] "        \"given\": \"SÃ©rgio\""                  
#>  [7] "      }"                                         
#>  [8] "    ],"                                          
#>  [9] "    \"container-title\": \"Portuguese History\","
#> [10] "    \"id\": \"conc2021\","                       
#> [11] "    \"issue\": \"1\","                           
#> [12] "    \"issued\": {"                               
#> [13] "      \"date-parts\": ["                         
#> [14] "        ["                                       
#> [15] "          2021"                                  
#> [16] "        ]"                                       
#> [17] "      ]"                                         
#> [18] "    },"                                          
#> [19] "    \"title\": \"HistÃ³ria da habitaÃ§Ã£o\","    
#> [20] "    \"type\": \"article-journal\""               
#> [21] "  }"                                             
#> [22] "]"
# not mark as UTF-8 that pandoc output I believe
Encoding(result)
#>  [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
#>  [9] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
#> [17] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
# Mark it as UTF-8
Encoding(result) <- 'UTF-8'
# it works ok ! 
result
#>  [1] "["                                               
#>  [2] "  {"                                             
#>  [3] "    \"author\": ["                               
#>  [4] "      {"                                         
#>  [5] "        \"family\": \"Conceição\","              
#>  [6] "        \"given\": \"Sérgio\""                   
#>  [7] "      }"                                         
#>  [8] "    ],"                                          
#>  [9] "    \"container-title\": \"Portuguese History\","
#> [10] "    \"id\": \"conc2021\","                       
#> [11] "    \"issue\": \"1\","                           
#> [12] "    \"issued\": {"                               
#> [13] "      \"date-parts\": ["                         
#> [14] "        ["                                       
#> [15] "          2021"                                  
#> [16] "        ]"                                       
#> [17] "      ]"                                         
#> [18] "    },"                                          
#> [19] "    \"title\": \"História da habitação\","       
#> [20] "    \"type\": \"article-journal\""               
#> [21] "  }"                                             
#> [22] "]"

So as the string get the incorrect encoding, jsonlite::fromJSON will not get the correct encoding inside the resulting list.

Workaround for you ?

basically, current workaround for you would be to convert to json, mark as UTF-8 encoding, and convert to list yourself.

res <- rmarkdown::pandoc_citeproc_convert("test.bib", type = "json")
res
#>  [1] "["                                               
#>  [2] "  {"                                             
#>  [3] "    \"author\": ["                               
#>  [4] "      {"                                         
#>  [5] "        \"family\": \"ConceiÃ§Ã£o\","            
#>  [6] "        \"given\": \"SÃ©rgio\""                  
#>  [7] "      }"                                         
#>  [8] "    ],"                                          
#>  [9] "    \"container-title\": \"Portuguese History\","
#> [10] "    \"id\": \"conc2021\","                       
#> [11] "    \"issue\": \"1\","                           
#> [12] "    \"issued\": {"                               
#> [13] "      \"date-parts\": ["                         
#> [14] "        ["                                       
#> [15] "          2021"                                  
#> [16] "        ]"                                       
#> [17] "      ]"                                         
#> [18] "    },"                                          
#> [19] "    \"title\": \"HistÃ³ria da habitaÃ§Ã£o\","    
#> [20] "    \"type\": \"article-journal\""               
#> [21] "  }"                                             
#> [22] "]"
Encoding(res) <- "UTF-8"
res
#>  [1] "["                                               
#>  [2] "  {"                                             
#>  [3] "    \"author\": ["                               
#>  [4] "      {"                                         
#>  [5] "        \"family\": \"Conceição\","              
#>  [6] "        \"given\": \"Sérgio\""                   
#>  [7] "      }"                                         
#>  [8] "    ],"                                          
#>  [9] "    \"container-title\": \"Portuguese History\","
#> [10] "    \"id\": \"conc2021\","                       
#> [11] "    \"issue\": \"1\","                           
#> [12] "    \"issued\": {"                               
#> [13] "      \"date-parts\": ["                         
#> [14] "        ["                                       
#> [15] "          2021"                                  
#> [16] "        ]"                                       
#> [17] "      ]"                                         
#> [18] "    },"                                          
#> [19] "    \"title\": \"História da habitação\","       
#> [20] "    \"type\": \"article-journal\""               
#> [21] "  }"                                             
#> [22] "]"
jsonlite::fromJSON(res, simplifyVector = FALSE)
#> [[1]]
#> [[1]]$author
#> [[1]]$author[[1]]
#> [[1]]$author[[1]]$family
#> [1] "Conceição"
#> 
#> [[1]]$author[[1]]$given
#> [1] "Sérgio"
#> 
#> 
#> 
#> [[1]]$`container-title`
#> [1] "Portuguese History"
#> 
#> [[1]]$id
#> [1] "conc2021"
#> 
#> [[1]]$issue
#> [1] "1"
#> 
#> [[1]]$issued
#> [[1]]$issued$`date-parts`
#> [[1]]$issued$`date-parts`[[1]]
#> [[1]]$issued$`date-parts`[[1]][[1]]
#> [1] 2021
#> 
#> 
#> 
#> 
#> [[1]]$title
#> [1] "História da habitação"
#> 
#> [[1]]$type
#> [1] "article-journal"

This is indeed while we need to fix and if you don't want to update dependency to later rmarkdown version

What we need to do in rmarkdown ?

We need to mark the result with correct encoding. I believe Pandoc will always be UTF8 as input and output. Since Pandoc 2.11+, pandoc-citeproc is not more used, and pandoc is directly use for conversion (as example above). However, I believe it would be the same for pandoc-citeproc

I see two solutions:

Mark the output as I did using Encoding()
Write output of command to a temp file and read this file as UTF-8 using xfun::read_utf8()

I wonder if the latter is not safer to avoid any R direct handling of encoding during capture in the system call.

@yihui if you have a preference in this matter.

Thanks for opening this issue @mitchelloharawild, I was not aware of this problem!

mitchelloharawild · 2021-07-28T10:49:46Z

Thanks for figuring this out, great description and investigation. I'll look into the most appropriate fix for {vitae} with this in mind.

yihui · 2021-07-29T15:48:07Z

Mark the output as I did using Encoding()

@cderv Do you mean this?

diff --git a/R/pandoc.R b/R/pandoc.R
index bacb4a18..3bca22e1 100644
--- a/R/pandoc.R
+++ b/R/pandoc.R
@@ -161,6 +161,7 @@ pandoc_citeproc_convert <- function(file, type = c("list", "json", "yaml")) {
   if (type == "list") {
     jsonlite::fromJSON(result, simplifyVector = FALSE)
   } else {
+    Encoding(result) <- "UTF-8"
     result
   }
 }

That sounds simple and safe to me since you have tested it. If we want to be conservative, we can certainly use the second solution (i.e. write to a file and read it back).

cderv · 2021-07-29T15:53:44Z

Not exactly, it would need to be the output resulting from the call to system()

diff --git a/R/pandoc.R b/R/pandoc.R
index bacb4a18..36b86336 100644
--- a/R/pandoc.R
+++ b/R/pandoc.R
@@ -150,6 +150,7 @@ pandoc_citeproc_convert <- function(file, type = c("list", "json", "yaml")) {
   # run the conversion
   with_pandoc_safe_environment({
     result <- system(command, intern = TRUE)
+    Encoding(result) <- "UTF-8"
   })
   status <- attr(result, "status")
   if (!is.null(status)) {

or maybe this

diff --git a/R/pandoc.R b/R/pandoc.R
index bacb4a18..bb715b91 100644
--- a/R/pandoc.R
+++ b/R/pandoc.R
@@ -150,20 +150,22 @@ pandoc_citeproc_convert <- function(file, type = c("list", "json", "yaml")) {
   # run the conversion
   with_pandoc_safe_environment({
     result <- system(command, intern = TRUE)
   })
   status <- attr(result, "status")
   if (!is.null(status)) {
     cat(result, sep = "\n")
     stop("Error ", status, " occurred building shared library.")
   }

+  Encoding(result) <- "UTF-8"
+
   # convert the output if requested
   if (type == "list") {
     jsonlite::fromJSON(result, simplifyVector = FALSE)

This is because the call to fromJSON needs to happen on a strings input with mark encoding.

yihui · 2021-07-29T17:29:06Z

Okay. Either way seems to be fine to me.

Pandoc will output UTF-8 content but on non default UTF-8 (like Windows), system() will return the result string in native encoding. We need to mark it before further processing. fixes #2195 Another option would be to convert to a file and read it back into R.

cderv · 2021-08-18T09:20:58Z

@mitchelloharawild I pushed the fix in the dev version of rmarkdown.

This should solve your issue in vitae. Thanks for the report.

github-actions · 2022-02-15T05:13:27Z

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue by following the issue guide (https://yihui.org/issue/), and link to this old issue if necessary.

mitchelloharawild mentioned this issue Jul 28, 2021

Special Characters in References not rendering correctly mitchelloharawild/vitae#158

Closed

cderv added bug an unexpected problem or unintended behavior next to consider for next release labels Jul 28, 2021

cderv mentioned this issue Aug 18, 2021

Mark result of citeproc conversion as UTF-8 #2202

Merged

cderv closed this as completed in #2202 Aug 18, 2021

github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding problems in pandoc_citeproc_convert() with Windows #2195

Encoding problems in pandoc_citeproc_convert() with Windows #2195

mitchelloharawild commented Jul 27, 2021

cderv commented Jul 28, 2021 •

edited

Loading

mitchelloharawild commented Jul 28, 2021

yihui commented Jul 29, 2021

cderv commented Jul 29, 2021

yihui commented Jul 29, 2021

cderv commented Aug 18, 2021

github-actions bot commented Feb 15, 2022

Encoding problems in pandoc_citeproc_convert() with Windows #2195

Encoding problems in pandoc_citeproc_convert() with Windows #2195

Comments

mitchelloharawild commented Jul 27, 2021

Checklist

cderv commented Jul 28, 2021 • edited Loading

Does it comes from Pandoc ?

What happens with R then ?

Workaround for you ?

What we need to do in rmarkdown ?

mitchelloharawild commented Jul 28, 2021

yihui commented Jul 29, 2021

cderv commented Jul 29, 2021

yihui commented Jul 29, 2021

cderv commented Aug 18, 2021

github-actions bot commented Feb 15, 2022

cderv commented Jul 28, 2021 •

edited

Loading