Skip to content

defconst/readr

 
 

Repository files navigation

readr

CRAN_Status_Badge Build Status Coverage Status

The goal of readr is to provide a fast and friendly way to read tabular data into R. The most important functions are:

  • Read delimited files: read_delim(), read_csv(), read_tsv(), read_csv2().
  • Read fixed width files: read_fwf(), read_table().
  • Read lines: read_lines().
  • Read whole file: read_file().
  • Re-parse existing data frame: type_convert().

Installation

readr is now available from CRAN.

install.packages("readr")

You can try out the dev version with:

# install.packages("devtools")
devtools::install_github("hadley/readr")

Usage

library(readr)
library(dplyr)

mtcars_path <- tempfile(fileext = ".csv")
write_csv(mtcars, mtcars_path)

# Read a csv file into a data frame
read_csv(mtcars_path)
# Read lines into a vector
read_lines(mtcars_path)
# Read whole file into a single string
read_file(mtcars_path)

See vignette("column-types") on how readr parses columns, and how you can override the defaults.

Output

read_csv() produces a data frame with the following properties:

  • Characters are never automatically converted to factors (i.e. no more stringsAsFactors = FALSE).

  • Column names are left as is, not munged into valid R identifiers (i.e. there is no check.names = TRUE).

  • The data frame is given class c("tbl_df", "tbl", "data.frame") so if you also use dplyr you'll get an enhanced display.

  • Row names are never set.

Problems

If there are any problems parsing the file, the read_ function will throw a warning telling you how many problems there are. You can then use the problems() function to access a data frame that gives information about each problem:

df <- read_csv(col_types = "dd", col_names = c("x", "y"), skip = 1, "
1,2
a,b
")
#> Warning message: There were 2 problems. See problems(x) for more details
problems(df)
#>   row col expected actual
#> 1   2   1 a double      a
#> 2   2   2 a double      b

It's likely that there will be cases that you can never load without some manual regexp-based munging in R. Load those columns with col_character(), fix them up as needed, then use convert_types() to re-run the automated conversion on every character column in the data frame. Alternatively, you can use parse_integer(), parse_numeric(), parse_date() etc to parse a single character vector at a time.

Compared to base functions

Compared to the corresponding base functions, readr functions:

  • Use a consistent naming scheme for the parameters (e.g. col_names and col_types not header and colClasses).

  • Are much faster (up to 10x faster).

  • Have a helpful progress bar if loading is going to take a while.

  • All functions work exactly the same way regardless of the current locale. To override the US-centric defaults, use locale().

Compared to fread()

data.table has a function similar to read_csv() called fread. Compared to fread, readr:

  • Is slower (currently ~1.2-2x slower. If you want absolutely the best performance, use data.table::fread().

  • Readr has a slightly more sophisticated parser, recognising both doubled ("""") and backslash escapes ("""). Readr allows you to read factors and date times directly from disk.

  • fread() saves you work by automatically guessing the delimiter, whether or not the file has a header, how many lines to skip by default and more. Readr forces you to supply these parameters.

  • The underlying designs are quite different. Readr is designed to be general, and dealing with new types of rectangular data just requires implementing a new tokenizer. fread() is designed to be as fast as possible. fread() is pure C, readr is C++ (and Rcpp).

Acknowledgements

Thanks to:

  • Joe Cheng for showing me the beauty of deterministic finite automata for parsing, and for teaching me why I should write a tokenizer.

  • JJ Allaire for helping me come up with a design that makes very few copies, and is easy to extend.

  • Dirk Eddelbuettel for coming up with the name!

About

Read flat files (csv, tsv, fwf) into R

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 39.7%
  • R 35.7%
  • C 24.6%