Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Añade tests para encoding #19

Merged
merged 3 commits into from
Feb 17, 2022
Merged

Conversation

dieghernan
Copy link
Member

Gracias @santiagomota

@santiagomota
Copy link

De nada

@santiagomota
Copy link

santiagomota commented Feb 17, 2022

Algunos de los municipios con problemas (en la version de 2021-09-13):
23078 linea 829761 columna 49
03050 linea 203847 columan 49
23051 linea 117301 columna 48
Aparace un caracter (\xd1) o (\xdf). Cuando se lee el municipio con st_read() da un warning y lee bien todo lo que está antes, pero no lo que sigue. Lo he detectado en las parcelas catastrales

@santiagomota
Copy link

Mi solución es leer el fichero y substituir esos caracteres:
file_municipio_bu_temp <- readLines(file_municipio_bu, encoding = "ISO-8859-1")
file_municipio_bu_temp <- gsub("�", "", file_municipio_bu_temp)
file_municipio_bu_temp <- gsub("\xd1", "
", file_municipio_bu_temp)
file_municipio_bu_temp <- gsub("\xbf", "_", file_municipio_bu_temp)
writeLines(file_municipio_bu_temp, file_municipio_bu, useBytes = TRUE)

@dieghernan
Copy link
Member Author

https://github.com/dieghernan/CatastRo/blob/4b1dc6ceb3565e06a2b67c01688fa369c6bebf7c/R/utils_read.R#L12-L24

Gracias!

@santiagomota
Copy link

Pruébalo porque es muy posible que tengas que incluir el , useBytes = TRUE en el writeLines
https://stackoverflow.com/questions/31432560/readlines-and-writelines-r

@codecov
Copy link

codecov bot commented Feb 17, 2022

Codecov Report

Merging #19 (6a3040f) into master (9cdfb42) will increase coverage by 0.01%.
The diff coverage is 90.32%.

❗ Current head 6a3040f differs from pull request most recent head 4b1dc6c. Consider uploading reports for the commit 4b1dc6c to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #19      +/-   ##
==========================================
+ Coverage   98.22%   98.23%   +0.01%     
==========================================
  Files          18       19       +1     
  Lines         788      794       +6     
==========================================
+ Hits          774      780       +6     
  Misses         14       14              
Impacted Files Coverage Δ
R/atom_ad_db.R 100.00% <ø> (ø)
R/atom_bu_db.R 100.00% <ø> (ø)
R/atom_cp_db.R 100.00% <ø> (ø)
R/utils_read.R 88.88% <88.88%> (ø)
R/atom_ad.R 100.00% <100.00%> (ø)
R/atom_bu.R 100.00% <100.00%> (ø)
R/atom_cp.R 100.00% <100.00%> (ø)
R/utils_wfs.R 97.72% <100.00%> (+2.53%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2d76051...4b1dc6c. Read the comment docs.

@dieghernan
Copy link
Member Author

Le he metido unos tests, y funciona a la perfección, gracias. Si más adelante da problemas ya los solucionaremos:

https://github.com/dieghernan/CatastRo/blob/4b1dc6ceb3565e06a2b67c01688fa369c6bebf7c/tests/testthat/test-catr_atom_cp.R#L25-L27

@dieghernan dieghernan merged commit 53c7bb4 into rOpenSpain:master Feb 17, 2022
@santiagomota
Copy link

Fenomenal. Un saludo

@santiagomota
Copy link

santiagomota commented Feb 18, 2022

Le sigo dando vueltas y he visto esto: https://stackoverflow.com/questions/9934856/removing-non-ascii-characters-from-data-files
Probablemente sea mejor en la línea 19 (utils_read.R) cambiar el
newlines <- gsub("\xd1|\xbf", "_", newlines)
por
newlines <- stringi::stri_trans_general(newlines, "latin-ascii")
En este caso en los valores erróneos, en vez de un caracter "_", incluye un "�", pero la lectura de los datos es correcta.
La ventaja de esto es que substituye no sólo los (\xd1) o (\xdf), sino todos los especiales.

dieghernan added a commit that referenced this pull request Feb 21, 2022
@santiagomota
Copy link

santiagomota commented Mar 2, 2022

Me parece que la solución de stringi::stri_trans_general crea un problema que no había visto, las eñes, acentos, etc:
stringi::stri_trans_general("España á ç", "latin-ascii")
[1] "Espana a c"
Tengo que buscar una transformación que funcione o volver al
newlines <- gsub("\xd1|\xbf", "_", newlines)

@santiagomota
Copy link

santiagomota commented Mar 2, 2022

... Este parece que funciona:
stringi::stri_trans_general("España á ç", "Any-Latn")
[1] "España á ç"

aunque no estoy 100% seguro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants