Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace csvread/write with DataFrames reader code #1893

Closed
HarlanH opened this issue Jan 5, 2013 · 3 comments
Closed

replace csvread/write with DataFrames reader code #1893

HarlanH opened this issue Jan 5, 2013 · 3 comments

Comments

@HarlanH
Copy link
Contributor

HarlanH commented Jan 5, 2013

Per #1852, it seems like a good idea to have dlmread/write be for the simple, numeric case, and csvread/write be for the arbitrarily/annoyingly complex general case. This issue is for design of the CSV methods.

The relevant DataFrames code is roughly read_separated_text() and parts of read_table(). The plan would be to separate out the code so that Base Julia would be able to read a CSV file into an optional header vector and a String array. The DataFrames package would take that result and do the additional work to convert_to_dataframe().

Some questions:

  • What are proper names for the top-level functions? csvread for Base and read_table for DataFrames seems a little weird.
  • csvread should return a (Array{String,2}, Vector{String}) tuple, I think? The latter being null if there's no header.
    • The current read_table methods may not have the lines drawn in quite the right places. There's a single-argument filename method, and a method with the following arguments: io::IOStream, separator::Char, quotation_character::Char, missingness_indicators::Vector{R}, header::Bool, column_names::Vector{S}, nrows::Int. The filename method opens the file, then extracts the column names and number of rows (possibly an overestimate because of embedded newlines) before passing that information, the IO stream, and some defaults into the specific method. It seems to me that this should be rethought a bit.
  • Missingness indicators don't apply to csvread, but separators, quotes, and headers do.
  • Keyword arguments...
  • Right now, the code always builds UTF-8 objects, which hurts performance if the file is strictly ASCII. We've talked about improving performance by defaulting to ASCII, then failing and restarting in UTF-8 mode if a non-ASCII character is seen. An option to force UTF-8 mode would make sense too.
  • readline() is currently defined as readuntil(io, '\n'). At some point EOL processing will have to be improved.
  • It would be desirable to have a variant that returns an iterator giving one row at a time.

...what else?

@johnmyleswhite
Copy link
Member

One major requirement that I think you're missing is the ability to reuse the Array{String,2} across many calls. The I/O code we have now was designed to handle the streaming data case in which you read in a portion of the file on each call. Reallocating memory instead of reusing will slow down the I/O system substantially.

@quinnj
Copy link
Member

quinnj commented Jun 18, 2013

This can probably be closed after the discussion on #3350. I think the consensus there was to keep readdlm and readcsv fairly simple in Base, and leave DataFrames the more complicated stuff.

@johnmyleswhite
Copy link
Member

Yes, I would close this and #1899 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants