Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement r_normalise_encoding() #1187

Merged
merged 5 commits into from
May 5, 2021

Conversation

DavisVaughan
Copy link
Member

Pulled over from https://github.com/r-lib/vctrs/blob/master/src/translate.c

I've left in the maybe-referenced copy semantics, as that seems to work best with vctrs.

Otherwise, I've done minor tweaks to use more of the rlang lib, but nothing major.

I've pulled in all the relevant tests from vctrs, with a few testthat utilities. There are quite a few tests, but I went through each one and none of them feel redundant. I think they are required to ensure we hit all the code paths, but we could move them out of test-c-api.R if you feel it clutters it too much.

@lionel- lionel- force-pushed the feature/normalize-encoding branch from bb3c8a8 to 044fb22 Compare May 5, 2021 08:35
@lionel- lionel- merged commit c19c1d6 into r-lib:master May 5, 2021
@lionel-
Copy link
Member

lionel- commented May 5, 2021

Thanks!

I added it to the C library under the name r_obj_fix_encoding(). Do you like it? If so I'll change the C callable as well.

Merging now so I can depend on this in the unique branch.

@DavisVaughan
Copy link
Member Author

I like the obj_ prefix, but I'd prefer something a bit more specific than fix. Normalise was nice because I could document what a "normalised" CHARSXP was (NA, ASCII, or UTF-8 marked)

Based on the naming scheme we used in arg_match(), I could see:

  • r_normalise_encoding() -> obj_normalise_encoding()
  • r_attrib_normalise_encoding() -> obj_attrib_normalise_encoding()
  • rlang_normalise_encoding() -> rlang_obj_normalise_encoding()
  • r_obj_fix_encoding() -> r_obj_normalise_encoding()

@lionel-
Copy link
Member

lionel- commented May 5, 2021

I try to avoid "normalise" and "standardise" because they are long names and have the "s" vs "z" spelling issue. I think "fixing encoding" could still be documented as producing "an object with standard encoding, meaning that ...".

@DavisVaughan
Copy link
Member Author

Alternatively I considered naming this obj_encode_utf8() since that is the intention of the function. Fully ASCII strings are left with an "unknown" encoding (as seen by Encoding()), but they are always valid UTF-8 so that's probably not too confusing

@lionel-
Copy link
Member

lionel- commented May 5, 2021

Right, the "unknown" thing is just an implementation detail of UTF-8 strings in R so that shouldn't have an impact on the naming scheme or documentation. I think I like r_obj_encode_utf8().

@lionel-
Copy link
Member

lionel- commented Jul 2, 2024

Initial vctrs PR: r-lib/vctrs#565

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants