Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't read xls with non UTF-8 encoding #564

Closed
MeoWoo6 opened this issue Apr 2, 2019 · 5 comments
Closed

Can't read xls with non UTF-8 encoding #564

MeoWoo6 opened this issue Apr 2, 2019 · 5 comments
Labels
bug an unexpected problem or unintended behavior libxls xls 👵

Comments

@MeoWoo6
Copy link

MeoWoo6 commented Apr 2, 2019

I want to read xls files in loop but the read_xls seems can only read the first xls file, but when I open the second or other xls file in EXCEL2016 window(PS:NO file broken or other infomation occured), then close it without any changes.
Rerun the code below , then the read_xls can read the second xls file. I am so confused.
I have put the data here Data
My code show as below

library(readxl)
library(reprex)
setwd("C:\\Users\\len\\Desktop\\Data")
files<-list.files(recursive = T)
files<-files[grepl(".XLS",files)]
files
#> [1] "20190326.seq/33.0000.XLS" "20190327.seq/01.0000.XLS"
#> [3] "20190328.seq/04.0000.XLS" "20190329.seq/15.0000.XLS"
#> [5] "20190330.seq/09.0000.XLS" "20190331.seq/03.0000.XLS"
packageVersion("readxl")
#> [1] '1.3.1'
for(i in 1:length(files))
{
  #dat<-read_xls(files[i],sheet = 1,na = "n.a.",skip = 0,col_names = F)[1:52,]
  dat<-read_xls(files[i])
  message(paste("Read",files[i]))
  print(head(dat))
}
#> New names:
#> * Area -> Area...3
#> * Area -> Area...4
#> * Area -> Area...5
#> * Area -> Area...6
#> * Area -> Area...7
#> * ... and 13 more problems
#> Read 20190326.seq/33.0000.XLS
#> # A tibble: 6 x 20
#>   No.   Time  Area...3 Area...4 Area...5 Area...6 Area...7 Area...8
#>   <chr> <chr> <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
#> 1 <NA>  <NA>  "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~
#> 2 <NA>  <NA>  AOC      Cl       NO2      SO4      NO3      NH4     
#> 3 <NA>  <NA>  ECD_1    ECD_1    ECD_1    ECD_1    ECD_1    ECD_2   
#> 4 1     4355~ n.a.     0.11860~ 0.02600~ 0.36162~ 0.36177~ 0.54646~
#> 5 2     4355~ n.a.     0.13981~ 0.03278~ 0.36993~ 0.38382~ 0.56742~
#> 6 3     4355~ n.a.     n.a.     n.a.     0.35741~ 0.34220~ 0.56133~
#> # ... with 12 more variables: Area...9 <chr>, Area...10 <chr>,
#> #   Area...11 <chr>, Area...12 <chr>, Area...13 <chr>, Area...14 <chr>,
#> #   Area...15 <chr>, Area...16 <chr>, Area...17 <chr>, Area...18 <chr>,
#> #   Area...19 <chr>, Area...20 <chr>
#> Read 20190327.seq/01.0000.XLS
#> # A tibble: 0 x 0
#> Read 20190328.seq/04.0000.XLS
#> # A tibble: 0 x 0
#> Read 20190329.seq/15.0000.XLS
#> # A tibble: 0 x 0
#> Read 20190330.seq/09.0000.XLS
#> # A tibble: 0 x 0
#> Read 20190331.seq/03.0000.XLS
#> # A tibble: 0 x 0

rerun the code after I open the second XLS file

library(readxl)
library(reprex)
setwd("C:\\Users\\len\\Desktop\\Data")
files<-list.files(recursive = T)
files<-files[grepl(".XLS",files)]
files
#> [1] "20190326.seq/33.0000.XLS" "20190327.seq/01.0000.XLS"
#> [3] "20190328.seq/04.0000.XLS" "20190329.seq/15.0000.XLS"
#> [5] "20190330.seq/09.0000.XLS" "20190331.seq/03.0000.XLS"
packageVersion("readxl")
#> [1] '1.3.1'
for(i in 1:length(files))
{
  #dat<-read_xls(files[i],sheet = 1,na = "n.a.",skip = 0,col_names = F)[1:52,]
  dat<-read_xls(files[i])
  message(paste("Read",files[i]))
  print(head(dat))
}
#> New names:
#> * Area -> Area...3
#> * Area -> Area...4
#> * Area -> Area...5
#> * Area -> Area...6
#> * Area -> Area...7
#> * ... and 13 more problems
#> Read 20190326.seq/33.0000.XLS
#> # A tibble: 6 x 20
#>   No.   Time  Area...3 Area...4 Area...5 Area...6 Area...7 Area...8
#>   <chr> <chr> <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
#> 1 <NA>  <NA>  "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~
#> 2 <NA>  <NA>  AOC      Cl       NO2      SO4      NO3      NH4     
#> 3 <NA>  <NA>  ECD_1    ECD_1    ECD_1    ECD_1    ECD_1    ECD_2   
#> 4 1     4355~ n.a.     0.11860~ 0.02600~ 0.36162~ 0.36177~ 0.54646~
#> 5 2     4355~ n.a.     0.13981~ 0.03278~ 0.36993~ 0.38382~ 0.56742~
#> 6 3     4355~ n.a.     n.a.     n.a.     0.35741~ 0.34220~ 0.56133~
#> # ... with 12 more variables: Area...9 <chr>, Area...10 <chr>,
#> #   Area...11 <chr>, Area...12 <chr>, Area...13 <chr>, Area...14 <chr>,
#> #   Area...15 <chr>, Area...16 <chr>, Area...17 <chr>, Area...18 <chr>,
#> #   Area...19 <chr>, Area...20 <chr>
#> New names:
#> * Area -> Area...3
#> * Area -> Area...4
#> * Area -> Area...5
#> * Area -> Area...6
#> * Area -> Area...7
#> * ... and 13 more problems
#> Read 20190327.seq/01.0000.XLS
#> # A tibble: 6 x 20
#>   No.   Time  Area...3 Area...4 Area...5 Area...6 Area...7 Area...8
#>   <chr> <chr> <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
#> 1 <NA>  <NA>  "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~
#> 2 <NA>  <NA>  AOC      Cl       NO2      SO4      NO3      NH4     
#> 3 <NA>  <NA>  ECD_1    ECD_1    ECD_1    ECD_1    ECD_1    ECD_2   
#> 4 1     4355~ n.a.     0.05765~ 0.01526~ 0.43807~ 0.37693~ 0.57607~
#> 5 2     4355~ n.a.     0.05745~ 0.01683~ 0.42131~ 0.36903~ 0.57463~
#> 6 3     4355~ n.a.     0.05793~ 0.01444~ 0.42888~ 0.38235~ 0.58119~
#> # ... with 12 more variables: Area...9 <chr>, Area...10 <chr>,
#> #   Area...11 <chr>, Area...12 <chr>, Area...13 <chr>, Area...14 <chr>,
#> #   Area...15 <chr>, Area...16 <chr>, Area...17 <chr>, Area...18 <chr>,
#> #   Area...19 <chr>, Area...20 <chr>
#> Read 20190328.seq/04.0000.XLS
#> # A tibble: 0 x 0
#> Read 20190329.seq/15.0000.XLS
#> # A tibble: 0 x 0
#> Read 20190330.seq/09.0000.XLS
#> # A tibble: 0 x 0
#> Read 20190331.seq/03.0000.XLS
#> # A tibble: 0 x 0
@jennybc
Copy link
Member

jennybc commented Apr 2, 2019

I can reproduce what you're seeing. It has nothing to do with the loop. It's the files, which seem to have Latin-1 encoded contents. And readxl (libxls, maybe?) can't deal with it.

I assume these xls files were written by some 3rd party tool. Sometimes 3rd party tools write files that technically comply with the spec but are very unusual. Excel itself can read them but other xls-reading tools -- like readxl/liblxs -- may struggle.

Opening them in Excel normalizes the encoding problem away and subsequent attempts to read them from readxl work. I suspect you see success with your first file, then not with your next 5, because someone had already done this once with the first file.

I'll report this upstream. You could report your problems to whatever ?instrument vendor? is producing these xls files. If you have a way to get this data with a less exotic format, e.g. csv, I would take it.

@jennybc jennybc changed the title read xls file in loop error Can't read xls with non UTF-8 encoding Apr 2, 2019
@jennybc jennybc added bug an unexpected problem or unintended behavior libxls xls 👵 labels Apr 2, 2019
@jennybc
Copy link
Member

jennybc commented Jul 29, 2021

Having updated libxls in 2c20f5c, I tried these files again and now I get a segfault:

> dat <- read_xls(files[[1]])

 *** caught segfault ***
address 0x0, cause 'memory not mapped'

Traceback:
 1: read_fun(path = path, sheet_i = sheet, limits = limits, shim = shim,     col_names = col_names, col_types = col_types, na = na, trim_ws = trim_ws,     guess_max = guess_max, progress = progress)
 2: tibble::as_tibble(l, .name_repair = .name_repair)
 3: set_readxl_names(read_fun(path = path, sheet_i = sheet, limits = limits,     shim = shim, col_names = col_names, col_types = col_types,     na = na, trim_ws = trim_ws, guess_max = guess_max, progress = progress),     .name_repair = .name_repair)
 4: read_excel_(path = path, sheet = sheet, range = range, col_names = col_names,     col_types = col_types, na = na, trim_ws = trim_ws, skip = skip,     n_max = n_max, guess_max = guess_max, progress = progress,     .name_repair = .name_repair, format = "xls")
 5: read_xls(files[[1]])

Given the claim in libxls that it can read these files (libxls/libxls#55), I need to build xls2csv and see if that still holds / holds for me. Then I can figure out if the problem lies in readxl.

@jennybc
Copy link
Member

jennybc commented Jul 29, 2021

I can read these files with xls2csv built from libxls v1.6.2, so the problem lies in readxl.

@jennybc
Copy link
Member

jennybc commented Mar 18, 2022

Just replicated both the segfault with readxl and the success with xls2csv (no great surprise).

@jennybc
Copy link
Member

jennybc commented Mar 19, 2022

When I open Data/20190326.seq/33.0000.XLS with Microsoft Excel, I see:

Screen Shot 2022-03-18 at 8 17 42 PM

so I suspect this is truly a malformed file. I.e. a file written by some "creative" 3rd party tool that would never be created by Excel and that, after opening in Excel, does not retain its weirdness. In the end, I don't think this was really about encoding, but rather a malformed cell format.

However I have found a relatively harmless null pointer check that allows the file, in all of it peculiar glory, to be read without a segfault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior libxls xls 👵
Projects
None yet
Development

No branches or pull requests

2 participants