Can't read xls with non UTF-8 encoding #564

MeoWoo6 · 2019-04-02T06:54:07Z

I want to read xls files in loop but the read_xls seems can only read the first xls file, but when I open the second or other xls file in EXCEL2016 window(PS:NO file broken or other infomation occured), then close it without any changes.
Rerun the code below , then the read_xls can read the second xls file. I am so confused.
I have put the data here Data
My code show as below

library(readxl)
library(reprex)
setwd("C:\\Users\\len\\Desktop\\Data")
files<-list.files(recursive = T)
files<-files[grepl(".XLS",files)]
files
#> [1] "20190326.seq/33.0000.XLS" "20190327.seq/01.0000.XLS"
#> [3] "20190328.seq/04.0000.XLS" "20190329.seq/15.0000.XLS"
#> [5] "20190330.seq/09.0000.XLS" "20190331.seq/03.0000.XLS"
packageVersion("readxl")
#> [1] '1.3.1'
for(i in 1:length(files))
{
  #dat<-read_xls(files[i],sheet = 1,na = "n.a.",skip = 0,col_names = F)[1:52,]
  dat<-read_xls(files[i])
  message(paste("Read",files[i]))
  print(head(dat))
}
#> New names:
#> * Area -> Area...3
#> * Area -> Area...4
#> * Area -> Area...5
#> * Area -> Area...6
#> * Area -> Area...7
#> * ... and 13 more problems
#> Read 20190326.seq/33.0000.XLS
#> # A tibble: 6 x 20
#>   No.   Time  Area...3 Area...4 Area...5 Area...6 Area...7 Area...8
#>   <chr> <chr> <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
#> 1 <NA>  <NA>  "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~
#> 2 <NA>  <NA>  AOC      Cl       NO2      SO4      NO3      NH4     
#> 3 <NA>  <NA>  ECD_1    ECD_1    ECD_1    ECD_1    ECD_1    ECD_2   
#> 4 1     4355~ n.a.     0.11860~ 0.02600~ 0.36162~ 0.36177~ 0.54646~
#> 5 2     4355~ n.a.     0.13981~ 0.03278~ 0.36993~ 0.38382~ 0.56742~
#> 6 3     4355~ n.a.     n.a.     n.a.     0.35741~ 0.34220~ 0.56133~
#> # ... with 12 more variables: Area...9 <chr>, Area...10 <chr>,
#> #   Area...11 <chr>, Area...12 <chr>, Area...13 <chr>, Area...14 <chr>,
#> #   Area...15 <chr>, Area...16 <chr>, Area...17 <chr>, Area...18 <chr>,
#> #   Area...19 <chr>, Area...20 <chr>
#> Read 20190327.seq/01.0000.XLS
#> # A tibble: 0 x 0
#> Read 20190328.seq/04.0000.XLS
#> # A tibble: 0 x 0
#> Read 20190329.seq/15.0000.XLS
#> # A tibble: 0 x 0
#> Read 20190330.seq/09.0000.XLS
#> # A tibble: 0 x 0
#> Read 20190331.seq/03.0000.XLS
#> # A tibble: 0 x 0

rerun the code after I open the second XLS file

library(readxl)
library(reprex)
setwd("C:\\Users\\len\\Desktop\\Data")
files<-list.files(recursive = T)
files<-files[grepl(".XLS",files)]
files
#> [1] "20190326.seq/33.0000.XLS" "20190327.seq/01.0000.XLS"
#> [3] "20190328.seq/04.0000.XLS" "20190329.seq/15.0000.XLS"
#> [5] "20190330.seq/09.0000.XLS" "20190331.seq/03.0000.XLS"
packageVersion("readxl")
#> [1] '1.3.1'
for(i in 1:length(files))
{
  #dat<-read_xls(files[i],sheet = 1,na = "n.a.",skip = 0,col_names = F)[1:52,]
  dat<-read_xls(files[i])
  message(paste("Read",files[i]))
  print(head(dat))
}
#> New names:
#> * Area -> Area...3
#> * Area -> Area...4
#> * Area -> Area...5
#> * Area -> Area...6
#> * Area -> Area...7
#> * ... and 13 more problems
#> Read 20190326.seq/33.0000.XLS
#> # A tibble: 6 x 20
#>   No.   Time  Area...3 Area...4 Area...5 Area...6 Area...7 Area...8
#>   <chr> <chr> <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
#> 1 <NA>  <NA>  "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~
#> 2 <NA>  <NA>  AOC      Cl       NO2      SO4      NO3      NH4     
#> 3 <NA>  <NA>  ECD_1    ECD_1    ECD_1    ECD_1    ECD_1    ECD_2   
#> 4 1     4355~ n.a.     0.11860~ 0.02600~ 0.36162~ 0.36177~ 0.54646~
#> 5 2     4355~ n.a.     0.13981~ 0.03278~ 0.36993~ 0.38382~ 0.56742~
#> 6 3     4355~ n.a.     n.a.     n.a.     0.35741~ 0.34220~ 0.56133~
#> # ... with 12 more variables: Area...9 <chr>, Area...10 <chr>,
#> #   Area...11 <chr>, Area...12 <chr>, Area...13 <chr>, Area...14 <chr>,
#> #   Area...15 <chr>, Area...16 <chr>, Area...17 <chr>, Area...18 <chr>,
#> #   Area...19 <chr>, Area...20 <chr>
#> New names:
#> * Area -> Area...3
#> * Area -> Area...4
#> * Area -> Area...5
#> * Area -> Area...6
#> * Area -> Area...7
#> * ... and 13 more problems
#> Read 20190327.seq/01.0000.XLS
#> # A tibble: 6 x 20
#>   No.   Time  Area...3 Area...4 Area...5 Area...6 Area...7 Area...8
#>   <chr> <chr> <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
#> 1 <NA>  <NA>  "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~ "\xa6\x~
#> 2 <NA>  <NA>  AOC      Cl       NO2      SO4      NO3      NH4     
#> 3 <NA>  <NA>  ECD_1    ECD_1    ECD_1    ECD_1    ECD_1    ECD_2   
#> 4 1     4355~ n.a.     0.05765~ 0.01526~ 0.43807~ 0.37693~ 0.57607~
#> 5 2     4355~ n.a.     0.05745~ 0.01683~ 0.42131~ 0.36903~ 0.57463~
#> 6 3     4355~ n.a.     0.05793~ 0.01444~ 0.42888~ 0.38235~ 0.58119~
#> # ... with 12 more variables: Area...9 <chr>, Area...10 <chr>,
#> #   Area...11 <chr>, Area...12 <chr>, Area...13 <chr>, Area...14 <chr>,
#> #   Area...15 <chr>, Area...16 <chr>, Area...17 <chr>, Area...18 <chr>,
#> #   Area...19 <chr>, Area...20 <chr>
#> Read 20190328.seq/04.0000.XLS
#> # A tibble: 0 x 0
#> Read 20190329.seq/15.0000.XLS
#> # A tibble: 0 x 0
#> Read 20190330.seq/09.0000.XLS
#> # A tibble: 0 x 0
#> Read 20190331.seq/03.0000.XLS
#> # A tibble: 0 x 0

jennybc · 2019-04-02T16:02:14Z

I can reproduce what you're seeing. It has nothing to do with the loop. It's the files, which seem to have Latin-1 encoded contents. And readxl (libxls, maybe?) can't deal with it.

I assume these xls files were written by some 3rd party tool. Sometimes 3rd party tools write files that technically comply with the spec but are very unusual. Excel itself can read them but other xls-reading tools -- like readxl/liblxs -- may struggle.

Opening them in Excel normalizes the encoding problem away and subsequent attempts to read them from readxl work. I suspect you see success with your first file, then not with your next 5, because someone had already done this once with the first file.

I'll report this upstream. You could report your problems to whatever ?instrument vendor? is producing these xls files. If you have a way to get this data with a less exotic format, e.g. csv, I would take it.

jennybc · 2021-07-29T21:30:55Z

Having updated libxls in 2c20f5c, I tried these files again and now I get a segfault:

> dat <- read_xls(files[[1]])

 *** caught segfault ***
address 0x0, cause 'memory not mapped'

Traceback:
 1: read_fun(path = path, sheet_i = sheet, limits = limits, shim = shim,     col_names = col_names, col_types = col_types, na = na, trim_ws = trim_ws,     guess_max = guess_max, progress = progress)
 2: tibble::as_tibble(l, .name_repair = .name_repair)
 3: set_readxl_names(read_fun(path = path, sheet_i = sheet, limits = limits,     shim = shim, col_names = col_names, col_types = col_types,     na = na, trim_ws = trim_ws, guess_max = guess_max, progress = progress),     .name_repair = .name_repair)
 4: read_excel_(path = path, sheet = sheet, range = range, col_names = col_names,     col_types = col_types, na = na, trim_ws = trim_ws, skip = skip,     n_max = n_max, guess_max = guess_max, progress = progress,     .name_repair = .name_repair, format = "xls")
 5: read_xls(files[[1]])

Given the claim in libxls that it can read these files (libxls/libxls#55), I need to build xls2csv and see if that still holds / holds for me. Then I can figure out if the problem lies in readxl.

jennybc · 2021-07-29T21:50:43Z

I can read these files with xls2csv built from libxls v1.6.2, so the problem lies in readxl.

jennybc · 2022-03-18T17:49:11Z

Just replicated both the segfault with readxl and the success with xls2csv (no great surprise).

jennybc · 2022-03-19T03:26:33Z

When I open Data/20190326.seq/33.0000.XLS with Microsoft Excel, I see:

so I suspect this is truly a malformed file. I.e. a file written by some "creative" 3rd party tool that would never be created by Excel and that, after opening in Excel, does not retain its weirdness. In the end, I don't think this was really about encoding, but rather a malformed cell format.

However I have found a relatively harmless null pointer check that allows the file, in all of it peculiar glory, to be read without a segfault.

jennybc mentioned this issue Apr 2, 2019

Unreadable files with, I think, Latin-1 encoded contents libxls/libxls#55

Closed

jennybc changed the title ~~read xls file in loop error~~ Can't read xls with non UTF-8 encoding Apr 2, 2019

jennybc added bug an unexpected problem or unintended behavior libxls xls 👵 labels Apr 2, 2019

jennybc mentioned this issue Apr 9, 2019

Check the return value of xls_parseWorkSheet #567

Closed

jennybc mentioned this issue May 11, 2020

read_excel losing first few rows of .xls files. #611

Closed

jennybc closed this as completed in b43ab60 Mar 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't read xls with non UTF-8 encoding #564

Can't read xls with non UTF-8 encoding #564

MeoWoo6 commented Apr 2, 2019

jennybc commented Apr 2, 2019 •

edited

Loading

jennybc commented Jul 29, 2021

jennybc commented Jul 29, 2021

jennybc commented Mar 18, 2022

jennybc commented Mar 19, 2022 •

edited

Loading

Can't read xls with non UTF-8 encoding #564

Can't read xls with non UTF-8 encoding #564

Comments

MeoWoo6 commented Apr 2, 2019

jennybc commented Apr 2, 2019 • edited Loading

jennybc commented Jul 29, 2021

jennybc commented Jul 29, 2021

jennybc commented Mar 18, 2022

jennybc commented Mar 19, 2022 • edited Loading

jennybc commented Apr 2, 2019 •

edited

Loading

jennybc commented Mar 19, 2022 •

edited

Loading