You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Per #1852, it seems like a good idea to have dlmread/write be for the simple, numeric case, and csvread/write be for the arbitrarily/annoyingly complex general case. This issue is for design of the CSV methods.
The relevant DataFrames code is roughly read_separated_text() and parts of read_table(). The plan would be to separate out the code so that Base Julia would be able to read a CSV file into an optional header vector and a String array. The DataFrames package would take that result and do the additional work to convert_to_dataframe().
Some questions:
What are proper names for the top-level functions? csvread for Base and read_table for DataFrames seems a little weird.
csvread should return a (Array{String,2}, Vector{String}) tuple, I think? The latter being null if there's no header.
The current read_table methods may not have the lines drawn in quite the right places. There's a single-argument filename method, and a method with the following arguments: io::IOStream, separator::Char, quotation_character::Char, missingness_indicators::Vector{R}, header::Bool, column_names::Vector{S}, nrows::Int. The filename method opens the file, then extracts the column names and number of rows (possibly an overestimate because of embedded newlines) before passing that information, the IO stream, and some defaults into the specific method. It seems to me that this should be rethought a bit.
Missingness indicators don't apply to csvread, but separators, quotes, and headers do.
Keyword arguments...
Right now, the code always builds UTF-8 objects, which hurts performance if the file is strictly ASCII. We've talked about improving performance by defaulting to ASCII, then failing and restarting in UTF-8 mode if a non-ASCII character is seen. An option to force UTF-8 mode would make sense too.
readline() is currently defined as readuntil(io, '\n'). At some point EOL processing will have to be improved.
It would be desirable to have a variant that returns an iterator giving one row at a time.
...what else?
The text was updated successfully, but these errors were encountered:
One major requirement that I think you're missing is the ability to reuse the Array{String,2} across many calls. The I/O code we have now was designed to handle the streaming data case in which you read in a portion of the file on each call. Reallocating memory instead of reusing will slow down the I/O system substantially.
This can probably be closed after the discussion on #3350. I think the consensus there was to keep readdlm and readcsv fairly simple in Base, and leave DataFrames the more complicated stuff.
Per #1852, it seems like a good idea to have
dlmread/write
be for the simple, numeric case, andcsvread/write
be for the arbitrarily/annoyingly complex general case. This issue is for design of the CSV methods.The relevant DataFrames code is roughly
read_separated_text()
and parts ofread_table()
. The plan would be to separate out the code so that Base Julia would be able to read a CSV file into an optional header vector and a String array. The DataFrames package would take that result and do the additional work toconvert_to_dataframe()
.Some questions:
csvread
for Base andread_table
for DataFrames seems a little weird.csvread
should return a(Array{String,2}, Vector{String})
tuple, I think? The latter being null if there's no header.read_table
methods may not have the lines drawn in quite the right places. There's a single-argument filename method, and a method with the following arguments:io::IOStream, separator::Char, quotation_character::Char, missingness_indicators::Vector{R}, header::Bool, column_names::Vector{S}, nrows::Int
. The filename method opens the file, then extracts the column names and number of rows (possibly an overestimate because of embedded newlines) before passing that information, the IO stream, and some defaults into the specific method. It seems to me that this should be rethought a bit.readline()
is currently defined asreaduntil(io, '\n')
. At some point EOL processing will have to be improved....what else?
The text was updated successfully, but these errors were encountered: