-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Byte order marks #1
Comments
Yeah, that is tricky. UTF-8 of course does not need a BOM because it is byte order independent. Some tools, however, use It is a really tough question what to do with it in R, because R does not need it, in fact it messes up all R functions: ❯ x <- paste0("\xef\xbb\xbfword ", "\u30de")
❯ Encoding(x)
[1] "UTF-8"
❯ x
[1] "word マ"
❯ nchar(x)
[1] 7
❯ substr(x, 1, 4)
[1] "wor"
❯ grepl("^word", x)
[1] FALSE Why pasting strings with So yes, ideally you would remove the BOM when manipulating the strings in R. OTOH, if you are downloading a file from Google Drive that you would use in some (MS) tool later, then you'd want to keep it, otherwise that tool might not be able to read in the file. I am not sure what the right solution is here. I am afraid that if you want to handle all use cases, then you'd need to make BOM handling explicit when downloading text files from Google Drive. E.g. have an option and/or function argument for it. Maybe the default of the option could be to remove it, and mark the string as UTF-8. |
I think the suggestion for the R Encoding FAQ, then, is just to create awareness of the potential for these marks to exist. When two strings look the same, but clearly are not the same, as usual , |
FWIW readr / vroom have code to skip the byte order marks at https://github.com/r-lib/vroom/blob/b3ba15212978253174c9f99f1098799cca9a6f74/src/utils.h#L215-L266, since they are pretty common in CSV's created using Microsoft programs. |
I recently got to spend some quality time with my best friend
charToRaw()
, courtesy of a byte order mark 😬I was doing a round trip like so:
local plain text file --> upload to Google Drive & convert to a Google Doc --> export from Google Drive as
text/plain
--> read into memory in R --> parse back to character vectorWhile developing a test I see:
And thus I found the BOM on the text returning from the round trip.
Do you have anything to say about ... when you're likely to encounter BOMs? Should you get rid of them? If so, how? Or can you compare two strings in a way that ignores them?
The text was updated successfully, but these errors were encountered: