Implement `r_normalise_encoding()` #1187

DavisVaughan · 2021-05-04T15:08:43Z

Pulled over from https://github.com/r-lib/vctrs/blob/master/src/translate.c

I've left in the maybe-referenced copy semantics, as that seems to work best with vctrs.

Otherwise, I've done minor tweaks to use more of the rlang lib, but nothing major.

I've pulled in all the relevant tests from vctrs, with a few testthat utilities. There are quite a few tests, but I went through each one and none of them feel redundant. I think they are required to ensure we hit all the code paths, but we could move them out of test-c-api.R if you feel it clutters it too much.

lionel- · 2021-05-05T08:46:20Z

Thanks!

I added it to the C library under the name r_obj_fix_encoding(). Do you like it? If so I'll change the C callable as well.

Merging now so I can depend on this in the unique branch.

DavisVaughan · 2021-05-05T13:53:26Z

I like the obj_ prefix, but I'd prefer something a bit more specific than fix. Normalise was nice because I could document what a "normalised" CHARSXP was (NA, ASCII, or UTF-8 marked)

Based on the naming scheme we used in arg_match(), I could see:

r_normalise_encoding() -> obj_normalise_encoding()
r_attrib_normalise_encoding() -> obj_attrib_normalise_encoding()
rlang_normalise_encoding() -> rlang_obj_normalise_encoding()
r_obj_fix_encoding() -> r_obj_normalise_encoding()

lionel- · 2021-05-05T14:10:37Z

I try to avoid "normalise" and "standardise" because they are long names and have the "s" vs "z" spelling issue. I think "fixing encoding" could still be documented as producing "an object with standard encoding, meaning that ...".

DavisVaughan · 2021-05-05T14:18:25Z

Alternatively I considered naming this obj_encode_utf8() since that is the intention of the function. Fully ASCII strings are left with an "unknown" encoding (as seen by Encoding()), but they are always valid UTF-8 so that's probably not too confusing

lionel- · 2021-05-05T14:27:30Z

Right, the "unknown" thing is just an implementation detail of UTF-8 strings in R so that shouldn't have an impact on the naming scheme or documentation. I think I like r_obj_encode_utf8().

lionel- · 2024-07-02T09:56:42Z

Initial vctrs PR: r-lib/vctrs#565

DavisVaughan and others added 5 commits May 5, 2021 10:28

Add r_clone_shared()

ee9e1df

Implement r_normalise_encoding()

ca8d856

Add C callable for r_normalise_encoding()

440ad5b

Ensure R <= 3.6 uses stringsAsFactors = FALSE

fef4c27

Add r_obj_fix_encoding() to the C library

044fb22

lionel- force-pushed the feature/normalize-encoding branch from bb3c8a8 to 044fb22 Compare May 5, 2021 08:35

lionel- merged commit c19c1d6 into r-lib:master May 5, 2021

lionel- mentioned this pull request May 6, 2021

Move unique names repair to rlang #1193

Merged

lionel- mentioned this pull request Jul 2, 2024

Remove LEVELS() accessor #1726

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `r_normalise_encoding()` #1187

Implement `r_normalise_encoding()` #1187

DavisVaughan commented May 4, 2021

lionel- commented May 5, 2021

DavisVaughan commented May 5, 2021

lionel- commented May 5, 2021

DavisVaughan commented May 5, 2021

lionel- commented May 5, 2021

lionel- commented Jul 2, 2024

Implement r_normalise_encoding() #1187

Implement r_normalise_encoding() #1187

Conversation

DavisVaughan commented May 4, 2021

lionel- commented May 5, 2021

DavisVaughan commented May 5, 2021

lionel- commented May 5, 2021

DavisVaughan commented May 5, 2021

lionel- commented May 5, 2021

lionel- commented Jul 2, 2024

Implement `r_normalise_encoding()` #1187

Implement `r_normalise_encoding()` #1187