-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework filepath (re-)encoding #438
Conversation
78c6983
to
8e821c0
Compare
68602ec
to
4317c8a
Compare
3ded2a1
to
bdecf47
Compare
7c90b19
to
1023b1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the high level feel of this is right to me, and better supports the "UTF-8 everywhere" idea that you always live in a UTF-8 world except right before you pass off to a 3rd party library or something like fopen()
.
I have a small feeling we may discover more places where we may need to add enc2utf8()
on the R side to support this, but I feel more confident in what we are doing on the C++ side for sure
Full analysis of file path handling in vroom. The encodings below should reflect reality with this PR (although I'm about to add some more Filepaths are handed from R to C++ and back again many, many times and pass through both base R's file path functions (like This analysis also resulted in a small empirical study of The complexity seen below is why this PR implements UTF-8 (almost) everywhere. The
|
This guards against the scenario where the tempdir's path has non-ascii characters in it. Presumably that could arise on, say, Windows if the user name has non-ascii characters: > tempdir() [1] "C:\\Users\\jenny\\AppData\\Local\\Temp\\Rtmpg30qBQ"
Everytime we use a base R path-handling function, explicitly re-encode the result as UTF-8.
Skip this test in non-UTF-8 locales for now
This PR basically reverts #434, which was proving to be an untenable basis for truly fixing things on linux, in a non UTF-8 locale. I don't really think that situation is actually important for real usage; I'm using it more as a way to make sure that the filepath handling works by definition and not just by coincidence.
This PR embraces UTF-8 encoding of file paths, then re-encodes the path to the native encoding just prior to any call to
fopen()
ormio::make_mmap_source()
(on a non-Windows OS). (The relevant path gymnastics for Windows were already here, mercifully.)Review of basic facts:
as_cpp()
, but can also be quite invisible / implicit. It's extremely difficult to predict, prevent, or reverse this. I've decided to remove any uncertainty about how a path is currently encoded by making it UTF-8 (almost) everywhere.This PR also addresses file writing and reading fixed width files.
Still having to skip some tests due to r-lib/archive#75.
Sidebar: In my explorations of other solutions, I think I've seen an example where cpp11 is marking a string as UTF-8, but the bytes are actually latin1. However, that's a matter to nail down and report upstream in cpp11. But it's part of the motivation for approach taken in this PR.