default splitter in dlmread #1852

JeffBezanson · 2012-12-30T22:10:35Z

Somebody brought this up on the list. dlmwrite uses ',' by default, but dlmread calls split(line)by default, which uses whitespace, and also discards empty fields. If a delimiter is specified then it uses split(line, dlm, true). I don't remember why we did this, but perhaps the default should just be split(line, ',', true). The empty fields thing can be changed, but if so then for both cases.

The text was updated successfully, but these errors were encountered:

HarlanH · 2013-01-05T14:59:11Z

I definitely think that dlmread should default to comma-separation and matching dlmwrite, and tend to think that empty fields should not be discarded in either case.

StefanKarpinski · 2013-01-05T15:22:02Z

I think the space-separated and output using tabs would be sensible. If you want to read csv, you should use a real csv reader.

HarlanH · 2013-01-05T16:46:29Z

OK. Worth noting here that DataFrames.jl has a souped-up CSV reader that now supports escaped-quotes, newlines in quoted fields, UTF-8, etc. It might be worth replacing the existing csvread (now just dlmread with different defaults) with the non-DF-specific parts of that code at some point. read_separated_text() returns an Array of strings.

https://github.com/HarlanH/DataFrames.jl/blob/master/src/io.jl

johnmyleswhite · 2013-01-05T16:53:00Z

+1 for @HarlanH's idea. If our read_separated_text() function were used to replace csvread, it would:

Give Julia a much more robust default reader for delimited data
Increase the number of eyeballs looking at that code, which would help the performance of DataFrames

JeffBezanson · 2013-01-05T17:17:38Z

Is there a clear difference between TSV and CSV, e.g. that TSV is just tab-separated numbers, but CSV might include various other formatting? We might want to have a separate function for simple delimited data if "real csv" is complex.
In any case, using your parsing code sounds good to me. Feel free to replace/incorporate with csvread.

HarlanH · 2013-01-05T17:41:45Z

My understanding is that it's all a holy mess with no standards and no
consistency. It's not as simple as TSV is simple and CSV is complex. But we
could pretend it is, with dlmread/write being just whitespace-separation
and no quotes, and csvread (aka read_separated_text) being arbitrarily
complex. Will start an issue to discuss some details...

On Sat, Jan 5, 2013 at 12:17 PM, Jeff Bezanson [email protected]:

Is there a clear difference between TSV and CSV, e.g. that TSV is just
tab-separated numbers, but CSV might include various other formatting? We
might want to have a separate function for simple delimited data if "real
csv" is complex.
In any case, using your parsing code sounds good to me. Feel free to
replace/incorporate with csvread.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/1852#issuecomment-11916629.

johnmyleswhite · 2013-01-05T18:04:50Z

I think delimited files are like HTML: there are standards, but lots of buggy examples that systems like Excel have evolved to accept. I think that TSV can be used with all of the complexity of CSV, but that this is required less often because of the relatively low frequency with which a field contains a tab that would have to be escaped.

Let's open another issue. I'll prepare a pull request for read_separated_text() before the end of the day.

HarlanH · 2013-01-05T18:08:59Z

#1893.

Sounds like the resolution to the original issue is as Stefan suggests,
dlmread splits on whitespace; dlmwrite defaults to tabs.

On Sat, Jan 5, 2013 at 1:04 PM, John Myles White
[email protected]:

I think delimited files are like HTML: there are standards, but lots of
buggy examples that systems like Excel have evolved to accept. I think that
TSV can be used with all of the complexity of CSV, but that this is
required less often because of the relatively low frequency with which a
field contains a tab that would have to be escaped.

Let's open another issue. I'll prepare a pull request for
read_separated_text() before the end of the data.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/1852#issuecomment-11917256.

johnmyleswhite · 2013-01-05T19:01:11Z

For future reference: the IANA standard for TSV disallows tabs within fields. See http://www.iana.org/assignments/media-types/text/tab-separated-values

StefanKarpinski · 2013-01-05T19:13:26Z

Presumably newlines are also not allowed in fields, although it doesn't say that.

HarlanH mentioned this issue Jan 5, 2013

replace csvread/write with DataFrames reader code #1893

Closed

JeffBezanson closed this as completed in 273b46d Jan 5, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

default splitter in dlmread #1852

default splitter in dlmread #1852

JeffBezanson commented Dec 30, 2012

HarlanH commented Jan 5, 2013

StefanKarpinski commented Jan 5, 2013

HarlanH commented Jan 5, 2013

johnmyleswhite commented Jan 5, 2013

JeffBezanson commented Jan 5, 2013

HarlanH commented Jan 5, 2013

johnmyleswhite commented Jan 5, 2013

HarlanH commented Jan 5, 2013

johnmyleswhite commented Jan 5, 2013

StefanKarpinski commented Jan 5, 2013

default splitter in dlmread #1852

default splitter in dlmread #1852

Comments

JeffBezanson commented Dec 30, 2012

HarlanH commented Jan 5, 2013

StefanKarpinski commented Jan 5, 2013

HarlanH commented Jan 5, 2013

johnmyleswhite commented Jan 5, 2013

JeffBezanson commented Jan 5, 2013

HarlanH commented Jan 5, 2013

johnmyleswhite commented Jan 5, 2013

HarlanH commented Jan 5, 2013

johnmyleswhite commented Jan 5, 2013

StefanKarpinski commented Jan 5, 2013