Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly Require / Convert UTF-8 #113

Closed
spgarbet opened this issue Feb 14, 2022 · 7 comments
Closed

Properly Require / Convert UTF-8 #113

spgarbet opened this issue Feb 14, 2022 · 7 comments

Comments

@spgarbet
Copy link
Member

spgarbet commented Feb 14, 2022

Problem Statement

Ideas

  • Change documentation to reflect that UTF-8 strings are required as input.
  • Automatically convert anything that isn't utf8 or unknown string type.
  • Possibly add warning for unknown string type (however, this being the default could be more noisy than useful). Leaning against at present.
  • Make UTF-8 autoconversion an option that defaults to present behavior.
@spgarbet
Copy link
Member Author

Okay this is seriously messed up.

> x <- "\xab"
> cat(x)
�
> Encoding(x)
[1] "unknown"
> Encoding(x) <- "latin1"
> cat(x)
«
> Encoding(x)
[1] "latin1"
> x <- "x"
> cat(x)
x
> Encoding(x)
[1] "unknown"
> Encoding(x) <- "latin1"
> cat(x)
x
> Encoding(x)
[1] "unknown"

@spgarbet
Copy link
Member Author

@kevinushey said this in his pull request:

On the C side, you can use Rf_translateCharUTF8() to convert a SEXP into a UTF-8 string. Or, if necessary, you can manually use Rf_eval() and construct the appropriate R call to enc2utf8() from the C side, but that's a bit clunkier.

@spgarbet
Copy link
Member Author

@spgarbet
Copy link
Member Author

I think the utf8 branch has a working solution. It doesn't crash when encountering odd characters. It doesn't always convert them cleanly, but the results are more predictable. ec838db

@spgarbet
Copy link
Member Author

Need to do some performance evaluations and valgrind tests next.

@spgarbet
Copy link
Member Author

Passes val-grind tests.

@spgarbet
Copy link
Member Author

This passes the tests in data.table which tests yaml functionality as well.

You just wait, I'll sin till I blow up!
		-- Dylan Thomas
garbetsp@Hubble:~/Projects/cran/data.table$ make test
R -e 'require(data.table); test.data.table()'

R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> require(data.table); test.data.table()
Loading required package: data.table
getDTthreads(verbose=TRUE):
  OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            8
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          8
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 4 threads with throttle==1024. See ?setDTthreads.
test.data.table() running: /usr/local/lib/R/site-library/data.table/tests/tests.Rraw.bz2 

Tue Feb 22 16:18:55 2022  endian==little, sizeof(long double)==16, longdouble.digits==64, sizeof(pointer)==8, TZ==unset, Sys.timezone()=='America/Chicago', Sys.getlocale()=='LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C', l10n_info()=='MBCS=TRUE; UTF-8=TRUE; Latin-1=FALSE; codeset=UTF-8', getDTthreads()=='OpenMP version (_OPENMP)==201511; omp_get_num_procs()==8; R_DATATABLE_NUM_PROCS_PERCENT==unset (default 50); R_DATATABLE_NUM_THREADS==unset; R_DATATABLE_THROTTLE==unset (default 1024); omp_get_thread_limit()==2147483647; omp_get_max_threads()==8; OMP_THREAD_LIMIT==unset; OMP_NUM_THREADS==unset; RestoreAfterFork==true; data.table is using 4 threads with throttle==1024. See ?setDTthreads.', zlibVersion()==1.2.11 ZLIB_VERSION==1.2.11
10 longest running tests took 15s (40% of 37s)
      ID  time nTest
 1: 2155 2.844     5
 2: 1438 1.885   738
 3: 1888 1.759     9
 4: 1648 1.461    91
 5: 1848 1.449     2
 6: 1650 1.363    91
 7: 1652 1.309    91
 8: 1437 1.124    36
 9: 1912 1.037     2
10: 1223 1.012   728
All 10038 tests (last 2163) in tests/tests.Rraw.bz2 completed ok in 38.0s elapsed (48.4s cpu)
> 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant