-
-
Notifications
You must be signed in to change notification settings - Fork 978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding problems in pandoc_citeproc_convert() with Windows #2195
Comments
Hi @mitchelloharawild ! Thanks for opening this issue. Here are some notes on my investigation. Does it comes from Pandoc ?First thing I did to look into this is to see if this comes from Pandoc. Writing bib <- c(
"@article{conc2021,",
" title={História da Habitação},",
" author={Conceição, Sérgio},",
" journal={Portuguese History},",
" number={1},",
" year={2021}",
"}"
)
xfun::write_utf8(bib, "test.bib") Using pandoc from command line in terminal directly. pandoc -t markdown -s -o test.md .\test.bib
pandoc -t csljson -s -o test.json .\test.bib Trying to read the file from R as UTF-8 xfun::read_utf8("test.md")
#> [1] "---" "nocite: \"[@*]\""
#> [3] "references:" "- author:"
#> [5] " - family: Conceição" " given: Sérgio"
#> [7] " container-title: Portuguese History" " id: conc2021"
#> [9] " issue: 1" " issued: 2021"
#> [11] " title: História da habitação" " type: article-journal"
#> [13] "---" ""
xfun::read_utf8("test.json")
#> [1] "["
#> [2] " {"
#> [3] " \"author\": ["
#> [4] " {"
#> [5] " \"family\": \"Conceição\","
#> [6] " \"given\": \"Sérgio\""
#> [7] " }"
#> [8] " ],"
#> [9] " \"container-title\": \"Portuguese History\","
#> [10] " \"id\": \"conc2021\","
#> [11] " \"issue\": \"1\","
#> [12] " \"issued\": {"
#> [13] " \"date-parts\": ["
#> [14] " ["
#> [15] " 2021"
#> [16] " ]"
#> [17] " ]"
#> [18] " },"
#> [19] " \"title\": \"História da habitação\","
#> [20] " \"type\": \"article-journal\""
#> [21] " }"
#> [22] "]" This works ok. What happens with R then ?In the R function, we don't write to file. We capture the output from a call to Lines 150 to 153 in 0af6b35
I think the cause is here because on Windows, UTF-8 is not the default encoding and I think this print incorrectly because the capture string is not marked as UTF8. rmarkdown:::with_pandoc_safe_environment({
result <- system("pandoc -t csljson -s test.bib", intern = TRUE)
})
# incorrect result
result
#> [1] "["
#> [2] " {"
#> [3] " \"author\": ["
#> [4] " {"
#> [5] " \"family\": \"Conceição\","
#> [6] " \"given\": \"Sérgio\""
#> [7] " }"
#> [8] " ],"
#> [9] " \"container-title\": \"Portuguese History\","
#> [10] " \"id\": \"conc2021\","
#> [11] " \"issue\": \"1\","
#> [12] " \"issued\": {"
#> [13] " \"date-parts\": ["
#> [14] " ["
#> [15] " 2021"
#> [16] " ]"
#> [17] " ]"
#> [18] " },"
#> [19] " \"title\": \"História da habitação\","
#> [20] " \"type\": \"article-journal\""
#> [21] " }"
#> [22] "]"
# not mark as UTF-8 that pandoc output I believe
Encoding(result)
#> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
#> [9] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
#> [17] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
# Mark it as UTF-8
Encoding(result) <- 'UTF-8'
# it works ok !
result
#> [1] "["
#> [2] " {"
#> [3] " \"author\": ["
#> [4] " {"
#> [5] " \"family\": \"Conceição\","
#> [6] " \"given\": \"Sérgio\""
#> [7] " }"
#> [8] " ],"
#> [9] " \"container-title\": \"Portuguese History\","
#> [10] " \"id\": \"conc2021\","
#> [11] " \"issue\": \"1\","
#> [12] " \"issued\": {"
#> [13] " \"date-parts\": ["
#> [14] " ["
#> [15] " 2021"
#> [16] " ]"
#> [17] " ]"
#> [18] " },"
#> [19] " \"title\": \"História da habitação\","
#> [20] " \"type\": \"article-journal\""
#> [21] " }"
#> [22] "]" So as the string get the incorrect encoding, Workaround for you ?basically, current workaround for you would be to convert to json, mark as UTF-8 encoding, and convert to list yourself. res <- rmarkdown::pandoc_citeproc_convert("test.bib", type = "json")
res
#> [1] "["
#> [2] " {"
#> [3] " \"author\": ["
#> [4] " {"
#> [5] " \"family\": \"Conceição\","
#> [6] " \"given\": \"Sérgio\""
#> [7] " }"
#> [8] " ],"
#> [9] " \"container-title\": \"Portuguese History\","
#> [10] " \"id\": \"conc2021\","
#> [11] " \"issue\": \"1\","
#> [12] " \"issued\": {"
#> [13] " \"date-parts\": ["
#> [14] " ["
#> [15] " 2021"
#> [16] " ]"
#> [17] " ]"
#> [18] " },"
#> [19] " \"title\": \"História da habitação\","
#> [20] " \"type\": \"article-journal\""
#> [21] " }"
#> [22] "]"
Encoding(res) <- "UTF-8"
res
#> [1] "["
#> [2] " {"
#> [3] " \"author\": ["
#> [4] " {"
#> [5] " \"family\": \"Conceição\","
#> [6] " \"given\": \"Sérgio\""
#> [7] " }"
#> [8] " ],"
#> [9] " \"container-title\": \"Portuguese History\","
#> [10] " \"id\": \"conc2021\","
#> [11] " \"issue\": \"1\","
#> [12] " \"issued\": {"
#> [13] " \"date-parts\": ["
#> [14] " ["
#> [15] " 2021"
#> [16] " ]"
#> [17] " ]"
#> [18] " },"
#> [19] " \"title\": \"História da habitação\","
#> [20] " \"type\": \"article-journal\""
#> [21] " }"
#> [22] "]"
jsonlite::fromJSON(res, simplifyVector = FALSE)
#> [[1]]
#> [[1]]$author
#> [[1]]$author[[1]]
#> [[1]]$author[[1]]$family
#> [1] "Conceição"
#>
#> [[1]]$author[[1]]$given
#> [1] "Sérgio"
#>
#>
#>
#> [[1]]$`container-title`
#> [1] "Portuguese History"
#>
#> [[1]]$id
#> [1] "conc2021"
#>
#> [[1]]$issue
#> [1] "1"
#>
#> [[1]]$issued
#> [[1]]$issued$`date-parts`
#> [[1]]$issued$`date-parts`[[1]]
#> [[1]]$issued$`date-parts`[[1]][[1]]
#> [1] 2021
#>
#>
#>
#>
#> [[1]]$title
#> [1] "História da habitação"
#>
#> [[1]]$type
#> [1] "article-journal" This is indeed while we need to fix and if you don't want to update dependency to later rmarkdown version What we need to do in rmarkdown ?We need to mark the result with correct encoding. I believe Pandoc will always be UTF8 as input and output. Since Pandoc 2.11+, pandoc-citeproc is not more used, and I see two solutions:
I wonder if the latter is not safer to avoid any R direct handling of encoding during capture in the system call. @yihui if you have a preference in this matter. Thanks for opening this issue @mitchelloharawild, I was not aware of this problem! |
Thanks for figuring this out, great description and investigation. I'll look into the most appropriate fix for |
@cderv Do you mean this? diff --git a/R/pandoc.R b/R/pandoc.R
index bacb4a18..3bca22e1 100644
--- a/R/pandoc.R
+++ b/R/pandoc.R
@@ -161,6 +161,7 @@ pandoc_citeproc_convert <- function(file, type = c("list", "json", "yaml")) {
if (type == "list") {
jsonlite::fromJSON(result, simplifyVector = FALSE)
} else {
+ Encoding(result) <- "UTF-8"
result
}
} That sounds simple and safe to me since you have tested it. If we want to be conservative, we can certainly use the second solution (i.e. write to a file and read it back). |
Not exactly, it would need to be the output resulting from the call to diff --git a/R/pandoc.R b/R/pandoc.R
index bacb4a18..36b86336 100644
--- a/R/pandoc.R
+++ b/R/pandoc.R
@@ -150,6 +150,7 @@ pandoc_citeproc_convert <- function(file, type = c("list", "json", "yaml")) {
# run the conversion
with_pandoc_safe_environment({
result <- system(command, intern = TRUE)
+ Encoding(result) <- "UTF-8"
})
status <- attr(result, "status")
if (!is.null(status)) { or maybe this diff --git a/R/pandoc.R b/R/pandoc.R
index bacb4a18..bb715b91 100644
--- a/R/pandoc.R
+++ b/R/pandoc.R
@@ -150,20 +150,22 @@ pandoc_citeproc_convert <- function(file, type = c("list", "json", "yaml")) {
# run the conversion
with_pandoc_safe_environment({
result <- system(command, intern = TRUE)
})
status <- attr(result, "status")
if (!is.null(status)) {
cat(result, sep = "\n")
stop("Error ", status, " occurred building shared library.")
}
+ Encoding(result) <- "UTF-8"
+
# convert the output if requested
if (type == "list") {
jsonlite::fromJSON(result, simplifyVector = FALSE) This is because the call to |
Okay. Either way seems to be fine to me. |
Pandoc will output UTF-8 content but on non default UTF-8 (like Windows), system() will return the result string in native encoding. We need to mark it before further processing. fixes #2195 Another option would be to convert to a file and read it back into R.
@mitchelloharawild I pushed the fix in the dev version of rmarkdown. This should solve your issue in vitae. Thanks for the report. |
This old thread has been automatically locked. If you think you have found something related to this, please open a new issue by following the issue guide (https://yihui.org/issue/), and link to this old issue if necessary. |
A couple of issues have been raised in
{vitae}
about encoding issues for bibliographies on Windows (mitchelloharawild/vitae#167, mitchelloharawild/vitae#158). So far I've narrowed it down tormarkdown::pandoc_citeproc_convert()
, and I'm raising an issue here in the hopes that you have more experience in using pandoc with Windows encoding related issues.MRE:
Created on 2021-07-26 by the reprex package (v2.0.0)
Session info
Checklist
When filing a bug report, please check the boxes below to confirm that you have provided us with the information we need. Have you:
formatted your issue so it is easier for us to read?
included a minimal, self-contained, and reproducible example?
pasted the output from
xfun::session_info('rmarkdown')
in your issue?upgraded all your packages to their latest versions (including your versions of R, the RStudio IDE, and relevant R packages)?
installed and tested your bug with the development version of the rmarkdown package using
remotes::install_github("rstudio/rmarkdown")
?The text was updated successfully, but these errors were encountered: