replace csvread/write with DataFrames reader code #1893

HarlanH · 2013-01-05T18:05:40Z

Per #1852, it seems like a good idea to have dlmread/write be for the simple, numeric case, and csvread/write be for the arbitrarily/annoyingly complex general case. This issue is for design of the CSV methods.

The relevant DataFrames code is roughly read_separated_text() and parts of read_table(). The plan would be to separate out the code so that Base Julia would be able to read a CSV file into an optional header vector and a String array. The DataFrames package would take that result and do the additional work to convert_to_dataframe().

Some questions:

What are proper names for the top-level functions? csvread for Base and read_table for DataFrames seems a little weird.
csvread should return a (Array{String,2}, Vector{String}) tuple, I think? The latter being null if there's no header.
- The current read_table methods may not have the lines drawn in quite the right places. There's a single-argument filename method, and a method with the following arguments: io::IOStream, separator::Char, quotation_character::Char, missingness_indicators::Vector{R}, header::Bool, column_names::Vector{S}, nrows::Int. The filename method opens the file, then extracts the column names and number of rows (possibly an overestimate because of embedded newlines) before passing that information, the IO stream, and some defaults into the specific method. It seems to me that this should be rethought a bit.
Missingness indicators don't apply to csvread, but separators, quotes, and headers do.
Keyword arguments...
Right now, the code always builds UTF-8 objects, which hurts performance if the file is strictly ASCII. We've talked about improving performance by defaulting to ASCII, then failing and restarting in UTF-8 mode if a non-ASCII character is seen. An option to force UTF-8 mode would make sense too.
readline() is currently defined as readuntil(io, '\n'). At some point EOL processing will have to be improved.
It would be desirable to have a variant that returns an iterator giving one row at a time.

...what else?

The text was updated successfully, but these errors were encountered:

johnmyleswhite · 2013-01-05T22:14:12Z

One major requirement that I think you're missing is the ability to reuse the Array{String,2} across many calls. The I/O code we have now was designed to handle the streaming data case in which you read in a portion of the file on each call. Reallocating memory instead of reusing will slow down the I/O system substantially.

quinnj · 2013-06-18T18:01:20Z

This can probably be closed after the discussion on #3350. I think the consensus there was to keep readdlm and readcsv fairly simple in Base, and leave DataFrames the more complicated stuff.

johnmyleswhite · 2013-06-18T18:53:10Z

Yes, I would close this and #1899 as well.

HarlanH mentioned this issue Jan 5, 2013

default splitter in dlmread #1852

Closed

johnmyleswhite mentioned this issue Jan 6, 2013

RFC: Draft of code for new csvread/csvwrite #1899

Closed

JeffBezanson closed this as completed Jun 19, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace csvread/write with DataFrames reader code #1893

replace csvread/write with DataFrames reader code #1893

HarlanH commented Jan 5, 2013

johnmyleswhite commented Jan 5, 2013

quinnj commented Jun 18, 2013

johnmyleswhite commented Jun 18, 2013

replace csvread/write with DataFrames reader code #1893

replace csvread/write with DataFrames reader code #1893

Comments

HarlanH commented Jan 5, 2013

johnmyleswhite commented Jan 5, 2013

quinnj commented Jun 18, 2013

johnmyleswhite commented Jun 18, 2013